The Wharton Research Data Service (WRDS) data set and Factor Model Factors

Information wants to be free
Attributed to Stewart Brand

On the one hand information wants to be expensive, because it's so valuable. The right information in the right place just changes your life.
Steve Wozniak, co-founder of Apple Computer

Table of Contents

Introduction

This page started out as a set of notes that I wrote while I was working with the Wharton Research Data Service (WRDS) data set for my Masters presentation in Computational Finance and Risk Management Masters at the University of Washington, on November 22, 2013.

I am grateful that I had access to the WRDS databases. Without this access I would not have been able to do my Masters presentation work. This is a fantastic resource for graduate students and researchers.

Locating the data that I needed within the WRDS data bases and cleaning up the data so that it could be used in my Masters presentation proved to be extremely time consuming. At one point I worried that I would end up giving my Masters presentation on data preparation since this work had consumed so much time. I hope that these notes will help others navigate the WRDS data (and spend less time on the data and more on their research).

Pension funds, endowments, hedge funds and other investment groups spend tens of thousands, or hundreds of thousands of dollars a year on market data. Some of these "industry" data sources are available through the Wharton Research Data Service. I don't know if there are differences in the quality of the academic databases vs. those provided to industrial subscribers.

Apparently different academic subscribers have different levels of access. So these notes are not comprehensive: a subscription may have more or less data access than is noted here. Some of these data sources, like the Fama French data may be available elsewhere (e.g., Kenneth French's web site).

I have focused on data sets that are useful for building portfolio factor models based on corporate balance sheet data (earnings per share, etc...). To explore the WRDS data, I first used the S&P 100 stocks and then later, the S&P 500 stocks. Most of the detail here involves stocks in the CRSP/Compustat S&P 500 index, from 1998 to 2013.

The WRDS data can be difficult to navigate. Each data set has a variety of sub-data sets. Even after reading the documentation it can be difficult to understand what is contained in the data set or how complete it is. These notes reflect many hours spent exploring the WRDS data set.

Data Set Overview

In most cases the descriptions of the data sets consist of edited versions of the description provided on the WRDS web site.

Detailed Data Source Notes

As far as I can tell, of the data sets that I have access to, there are only two that contain corporate factor data: IBES and Compustat. The IBES data also includes analysts forcasts.

I/B/E/S (IBES)

The IBES data set can be selected in a variety of ways. The IBES Detail History - Actuals data set appears to be the most promising when it comes to corporate factors (e.g., earnings per share, cash flow per share). Here "actual" values are in contrast to the forecast values.

The IBES data is surprising in its ommissions. Almost every stock has periods where there are no values provided for Book Value per Share, Cash Flow per Share and Return on Equity. This makes this data set difficult or impossible to use. Given what IBES charges for data I can only hope that they provide better quality for their industry subscribers.

IBES Detail History - Actuals

The IBES database has quarterly stock factor data. I experimented with selecting the factors shown below:

The SAL factor (revinue non-per-share) does not seem to be a useful factor since it needs to be scaled by the number of shares or the corporate size (capitalization).

Unlike the Compustat data, which is returned in tabular form, the IBES data is returned with one factor per line. This is shown below. The column names are:

OFTIC Official ticker
TICKERIBES ticker
PENDSperiod end
MEASUREe.g., the factor EPS, BPS, CPS...
PDICITYperiodicity (e.g., quarterly, yearly)
VALUEthe factor value
ANNDATESAnnouncement date
OFTIC,TICKER,PENDS,MEASURE,PDICITY,VALUE,ANNDATS
AEP,AEP,12/31/2004,EPS,QTR,0.42,01/27/2005
AEP,AEP,03/31/2005,EPS,QTR,0.88,04/28/2005
AEP,AEP,06/30/2005,EPS,QTR,0.61,07/29/2005
AEP,AEP,09/30/2005,EPS,QTR,0.95,10/27/2005
AEP,AEP,12/31/2005,EPS,QTR,0.29,02/01/2006
AEP,AEP,03/31/2006,EPS,QTR,0.96,04/27/2006
AEP,AEP,06/30/2006,EPS,QTR,0.44,07/27/2006
AEP,AEP,03/31/2006,BPS,QTR,23.82,04/28/2006
...
AEP,AEP,06/30/2006,BPS,QTR,23.8,07/27/2006
AEP,AEP,09/30/2006,BPS,QTR,24.05,10/31/2006
AEP,AEP,12/31/2006,BPS,QTR,23.83,01/30/2007
AEP,AEP,03/31/2007,BPS,QTR,23.94,04/26/2007
AEP,AEP,06/30/2007,BPS,QTR,24.22,07/31/2007
AEP,AEP,09/30/2007,BPS,QTR,24.76,10/24/2007
AEP,AEP,12/31/2007,BPS,QTR,25.31,01/29/2008
AEP,AEP,03/31/2008,BPS,QTR,,04/28/2008

To allow this data be used in an environment like R the data needs to be in a tabular format, like that shown below:

symdateanndateBPSCPSCSHDPSEBGEBSEPSFFOGPSGRMROAROESAL
AEP2004-12-312005-01-27NA0.447NA0.35NANA0.42NA0.42NANANA3500.0
AEP2005-03-312005-04-28NA0.7685NA0.35NANA0.88NA0.9NANANA3000.0
AEP2005-06-302005-07-29NA0.7765NA0.35NANA0.61NA0.58NANANA2800.0
AEP2005-09-302005-10-27NA1.8355NA0.35NANA0.95NA0.99NANANA3300.0
AEP2005-12-312006-02-01NA0.354NA0.37NANA0.29NA-0.38NANANA2900.0
AEP2006-03-312006-04-2723.821.8NA0.37NANA0.96NA0.97NA1.0611.693100.0
AEP2006-06-302006-07-2723.81.53NA0.37NANA0.44NA0.44NA0.4710.982900.0
AEP2006-09-302006-10-3124.052.28NA0.37NA2.890.99NA0.67NA1.0811.093600.0
AEP2006-12-312007-01-3023.83NANA0.39NA2.890.38NA0.46NA0.411.613000.0
AEP2007-03-312007-04-2623.94NANA0.39NA2.350.68NA0.68NA0.7110.343200.0
AEP2007-06-302007-07-3124.22NANA0.39NA2.310.64NA0.45NA0.6611.093100.0
AEP2007-09-302007-10-2424.76NANA0.39NA3.311.16NA1.02NA1.1711.513800.0
AEP2007-12-312008-01-2925.31NANA0.41NANA0.52NA0.58NA0.5211.93300.0
AEP2008-03-312008-04-24NA1.56NA0.41NA2.91.02NA1.4364.15NANA3467.0

Note that NA has been filled in for missing values.

I wrote a Java program (ConvertData.java) to convert the WRDS IBES value per line data into the table shown above (a downloadable form can be found here).

The WRDS query interface allows you to select a file on your local system that contains the stock ticker symbols for the query. I used a file that contained the S&P 100 stocks (note that the composition of the S&P 100 and other indexes change over time):

S&P 100 Stocks (as of October 2013)
AAPLABBVABTACNAEPAIGALLAMGNAMZNAPA
APCAXPBABACBAXBKBMYBRK.BCCAT
CLCMCSACOFCOPCOSTCSCOCVSCVXDDDIS
DOWDVNEBAYEMCEMREXCFFCXFDXFOXA
GDGEGILDGMGOOGGSHALHDHONHPQ
IBMINTCJNJJPMKOLLYLMTLOWMAMCD
MDLZMDTMETMMMMOMONMRKMSMSFTNKE
NOVNSCORCLOXYPEPPFEPGPMQCOMRTN
SBUXSLBSOSPGTTGTTWXTXNUNHUNP
UPSUSBUTXVVZWAGWFCWMBWMTXOM

You can download the stocks in list form here.

Although the IBES search has a check box for each of the factors above, many of them have no data or many missing values (for example, CPH, EBG, FFO, GRM). In some cases, not all time periods have all of the factors selected in the query (e.g., values are not supplied as empty values, but are omitted entirely for the time period).

I tried selecting fewer factors and got a data set with fewer missing values. For some stocks there are still enough missing values to make this data set problematic.

symdateanndateBPSCPSDPSEPSROESAL
BA2005-03-312005-04-2714.231.7550.250.6615.3212987.0
BA2005-06-302005-07-2713.791.870.250.715.6215028.0
BA2005-09-302005-10-2611.741.840.251.2624.3712629.0
BA2005-12-312006-02-0113.971.8620.250.5823.2814204.0
BA2006-03-312006-04-2614.551.840.30.8823.7114264.0
BA2006-06-302006-07-2613.620.730.3-0.2119.3314986.0
BA2006-09-302006-10-2514.062.250.30.8915.2914739.0
BA2006-12-312007-01-316.023.120.31.2946.7417541.0
BA2007-03-312007-04-257.220.940.351.1242.715365.0
BA2007-06-302007-07-257.544.680.351.3561.5517028.0
BA2007-09-302007-10-248.284.30.351.4362.7516517.0
BA2007-12-312008-01-3011.832.490.351.3545.0717477.0
BA2008-03-312008-04-2312.132.590.41.6148.4315990.0

The IBES time series with the fewest missing values is the earnings per share actual value. However, not all dates are covered.

S&P 500 Index Components

Compustat Monthly Updates - Index Constituents

Stocks are added and removed from the S&P 500 over time. When modeling portfolio performance with historical data it is important to use the historical S&P 500 to avoid survivor bias. Also, some stocks, like Google, have existed only relatively recently.

In order to download the historical S&P 500, the index for various time periods must be calculated. Some stocks do not exist anymore and are replaced by CRSP/Compustat codes. A file of GVKEY values must be generated to download the factors discussed here. See Building the S&P 500 Constituents. The web page include a link to R code that will build the index and the GVKEY file.

CRSP/Compustat Corporate Value Factors

Forrester Research analyst Sucharita Mulpuru told the Times that Amazon was "the teacher's pet of Wall Street," because no other company in the world "has the consistently abominable rate of profitability they do and yet has the stratospheric valuation they do."
http://www.salon.com/2013/10/25/why_wall_street_cant_quit_amazon

The raw numbers would seem to indicate that Microsoft is killing it while Amazon is floundering. But Wall Street disagrees: On Friday morning Microsoft's stock price hovered around $35 per share while Amazon rose to $357.
http://www.salon.com/2013/10/25/what_microsoft_profits_say_about_amazon_losses/Microsoft

Of the databases available on WRDS, the CRSP/Compustat combined database seems to be the most extensive database of fundamental corporate information. This is the core data that I used for my Masters presentation.

CRSP/Compustat Merged Database - Fundamentals Quarterly

This data consists of corporate factors from the quarterly corporate filings.

Corporations in different industrial sections differ in fundamental ways. A transportation company (trucking, railroad) has large capital equipment expenditures and depreciation, but no research and development expenses. A pharmaceutical company has research and develpment costs, capital costs for drug manufacture and depreciation. A software company has research and development costs but low capital expenses (outside of buildings and computers). Each of these companies will have different corporate factors.

Corporate reports contain a staggering number of categories. For those of us without the benefit of MBA or CPA graduate degrees, some of these factors are confusing. Another problem is that in many cases there are missing factors.

Factor Density for CRSP/Compustat Quarterly Factors

The tables below show the density of the factors, relative to the S&P 500 from January 1998 to September 2013. For example, [100.00:95.00] means that between 95 and 100 percent of the values are present (e.g., not NA).

In some cases the fact values are missing is valid. For example, not all stocks have dividends, so it is reasonable that dividend values are missing for some stocks.


[100.00:95.00]

PRCCQCSHPRQCSH12QSALEQIBADJQIBCOMQATQLSEQ
IBQLTMIBQNIQLTQSEQQICAPTQPIQCSHOQ
AOQCHEQEPSPIQEPSFIQCEQQOPEPSQEPSPXQCOGSQ
EPSFXQXOPRQEPSX12OEPS12OIADPQCSHFDQDILAVQOEPSXQ
EPSF12TXTQAPQRECTQREQCSTKQOANCFYCHECHY

[95.00:90.00]

REVTQLOQPPENTQOEPF12DLTTQOIBDPQ

[90.00:85.00]

DPQNOPIQCSHIQCAPSQACOQ

[85.00:80.00]

LCOQLCTQACTQWCAPQXINTQDLCQINVTQ

[80.00:75.00]

XSGAQANCQLLTQIBADJ12SPCEPQ

[75.00:70.00]

SPCEEPSPQSPCEDPQSPCEQSPCEEPSQSPCEDQ

[70.00:65.00]

SPCEP12SPCEEPSP12SPCEPD12REUNAQDPACTQPPEGTQSPCE12SPCEEPS12
SPCED12ACOMINCQ

[65.00:60.00]

ALTOQDVPSPQINTANQCSTKCVQ

[60.00:55.00]

TSTKQGDWLQ

[55.00:50.00]

TXDITCQTSTKNQTXPQLOXDRQSPIQ

[50.00:45.00]

CITOTALQPNCWIAPQPNCWIPPQCIBEGNIQPNCPQSTKCOQPNCWIAQPNCWIPQ
PNCQINTANOQRECTRQGLCEA12

[45.00:40.00]

PNC12

[40.00:35.00]

CHQGLCEEPS12GLCED12IVLTQPRCPQCICURRQPRCRAQMIBTQ
CSHOPQ

[35.00:30.00]

PRCAQPNCPEPSQPNCPDQPNCEPS12PNCD12PNCEPSQPNCDQPRCE12
PNCPEPS12PNCPD12MIIQGLCEPQPNCWIEPQXRDQPNCWIDPQPNCWIEPSQ
RECTAQPNCWIDQXOPTQPGLCEAQ

[30.00:25.00]

CIDERGLQAOCIPENQTEQQXIDOQXOPTEPSQPXOPTDQPTFVAQAOCIDERGLQ
CIPENQXOPTQRCPQCISECGLQ

[25.00:20.00]

RECDQCIQTXDIQXOPT12MIBQRCAQIBMIIQAOL2Q
XOPTEPSP12XOPTD12PMSAQXOPTD12XOPTEPS12DOQOPTVOLQOPTRFRQ
OPTLIFEQXOPTEPSQXOPTDQTFVLQAQPL1QPNCIAPQPNCIPPQSTKCPAQ
INVFGQGLCEEPSQGLCEDQIVSTQLOL2QRECTOQ

[20.00:15.00]

PNCIAQPNCIPQRCDQRCEPSQSETA12INVRMQPSTKQSETEPS12
SETD12PNCIEPSPQPNCIDPQDVPQFCAQINVWIPQDRCQPNCIEPSQ
PNCIDQ

[15.00:10.00]

PSTKNQOPTFVGRQOPTDRQPRCEPS12PRCD12MIBNQPRCPEPSQPRCEPSQ
PRCDQPRCPDQPRCPEPS12PRCPD12RRA12CIMIIQAUL3QDILADQ

[10.00:5.00]

SPIOPQTXDBQAQPQSPIOAQDRLTQAQAQDCOMQSETPQ
SETAQCIOTHERQAOCIOTHERQSPIDQSPIEPSQLUL3QNRTXTQGDWLIA12
GLPQSETDQWDPQAQDQSETEPSQGLAQGDWLID12GDWLIEPS12
AQEPSQNRTXTDQWDAQDTEPQNRTXTEPSQRREPS12RRD12DTEAQ
LQPL1QUCEQQUACOQULCOQUGIQUNIAMIQUOPIQURECTQ
PNRSHOQUAOQUINVQUNOPINCQUAPTQWDDQWDEPSQCSTKEQ
GLDQGLEPSQHEDGEGLQRRPQRRAQIVAOQGDWLAMQDTEDQ
DTEEPSQ

[5.00:0.00]

RLLQTIEQTIIQPSTKRQESOPCTQNIMQNPATQPLLQ
NCOQCAPR3QCAPR1QCAPR2QINVOQNIITQINTACCQDD1Q
USUBPCVQXIQTXDBCAQUCAPSQXACCQESUBQOBKQTXDBAQ
USUBDVPQGDWLIPQRREPSQGDWLIAQRRDQUDOLTQIVAEQQANOQ
LNOQACCHGQGDWLIDQGDWLIEPSQNPQUSPIQRDIPQRETQ
RDIPAQDPACREQRECUBQDPRETQSEQOQESOPTQTXWQRDIPDQ
RDIPEPSQUDMBQTFVCEQPRSHOQUDDQUPMCSTKQTXDBCLQUDPCOQ
FFOQUTEMQUPSTKQSRETQUDVPQESOPNRQARCEDQARCEEPSQ
ARCEQESOPRQUPMPFQUPMSUBPQAOCISECGLQUPDVPQUPSTKCQUCCONSQ
UPMPFSQ

100% to 60% CRSP/Compustat Quarterly Corporate Factors with Explanation

In most cases factors that don't have at least 2/3s of their values are not going to be useful. The tables below list the factors, with the description provided by Compustat.

[100.00:95.00]

PRCCQPrice Close - Quarter
CSHPRQCommon Shares Used to Calculate Earnings Per Share - Basic
CSH12QCommon Shares Used to Calculate Earnings Per Share - 12 Months Moving
SALEQSales/Turnover (Net)
IBADJQIncome Before Extraordinary Items - Adjusted for Common Stock Equivalents
IBCOMQ Income Before Extraordinary Items - Available for Common
ATQAssets - Total
LSEQLiabilities and Stockholders Equity - Total
IBQIncome Before Extraordinary Items
LTMIBQLiabilities - Total and Noncontrolling Interest
NIQNet Income (Loss)
LTQLiabilities - Total
SEQQStockholders Equity <Parent <Index Fundamental <Quarterly
ICAPTQ Invested Capital - Total - Quarterly
PIQPretax Income
CSHOQCommon Shares Outstanding
AOQAssets - Other - Total
CHEQCash and Short-Term Investments
EPSPIQEarnings Per Share (Basic) - Including Extraordinary Items
EPSFIQEarnings Per Share (Diluted) - Including Extraordinary Items
CEQQCommon/Ordinary Equity - Total
OPEPSQEarnings Per Share from Operations
EPSPXQ Earnings Per Share (Basic) - Excluding Extraordinary Items
COGSQ Cost of Goods Sold
EPSFXQEarnings Per Share (Diluted) - Excluding Extraordinary items
XOPRQOperating Expense- Total
EPSX12Earnings Per Share (Basic) - Excluding Extraordinary Items - 12 Months Moving
OEPS12Earnings Per Share from Operations - 12 Months Moving
OIADPQ Operating Income After Depreciation - Quarterly
CSHFDQCommon Shares for Diluted EPS
DILAVQDilution Available - Excluding Extraordinary Items
OEPSXQDilution Available - Excluding Extraordinary Items
EPSF12Earnings Per Share (Diluted) - Excluding Extraordinary Items - 12 Months Mo
TXTQIncome Taxes - Total
APQAccount Payable/Creditors - Trade
RECTQReceivables - Total
REQRetained Earnings
CSTKQCommon/Ordinary Stock (Capital)
OANCFY Operating Activities - Net Cash Flow
CHECHYCash and Cash Equivalents - Increase (Decrease)

[95.00:90.00]

REVTQ Revenue - Total
LOQLiabilities - Other
PPENTQProperty Plant and Equipment - Total (Net)
OEPF12Earnings Per Share - Diluted - from Operations - 12MM
DLTTQLong-Term Debt - Total
OIBDPQOperating Income Before Depreciation - Quarterly

[90.00:85.00]

DPQDepreciation and Amortization - Total
NOPIQNon-Operating Income (Expense) - Total
CSHIQCommon Shares Issued
CAPSQCapital Surplus/Share Premium Reserve
ACOQCurrent Assets - Other - Total

[85.00:80.00]

LCOQCurrent Liabilities - Other - Total
LCTQ Current Liabilities - Total
ACTQCurrent Assets - Total
WCAPQWorking Capital (Balance Sheet)
XINTQInterest and Related Expense- Total
DLCQDebt in Current Liabilities
INVTQInventories - Total

[80.00:75.00]

XSGAQSelling, General and Administrative Expenses
ANCQNon-Current Assets - Total
LLTQLong-Term Liabilities (Total)
IBADJ12Income Before Extra Items - Adj for Common Stock Equivalents - 12MM
SPCEPQS&P Core Earnings - Preliminary

[75.00:70.00]

SPCEEPSPQS&P Core Earnings EPS Basic - Preliminary
SPCEDPQS&P Core Earnings EPS Diluted - Preliminary
SPCEQS&P Core Earnings
SPCEEPSQS&P Core Earnings EPS Basic
SPCEDQS&P Core Earnings EPS Diluted

[70.00:65.00]

SPCEP12S&P Core Earnings 12MM - Preliminary
SPCEEPSP12S&P Core 12MM EPS - Basic - Preliminary
SPCEPD12S&P Core Earnings 12MM EPS Diluted - Preliminary
REUNAQUnadjusted Retained Earnings
DPACTQDepreciation, Depletion and Amortization (Accumulated)
PPEGTQProperty, Plant and Equipment - Total (Gross) - Quarterly
SPCE12S&P Core Earnings 12MM
SPCEEPS12S&P Core Earnings EPS Basic 12MM
SPCED12S&P Core Earnings EPS Diluted 12MM
ACOMINCQAccumulated Other Comprehensive Income (Loss)

[65.00:60.00]

ALTOQOther Long-term Assets
DVPSPQDividends per Share - Pay Date - Quarter
INTANQIntangible Assets - Total
CSTKCVQCarrying Value

Corporate Factor Building Blocks

Most of the Compustat corporate factors listed in the table below are used to build the factors described in Chapter 5 of Quantitative Equity Portfolio Management by Qian, Hua and Sorensen. As the table shows, most of these factors have few missing values. In some cases, dividends (DVPSPQ), missing values are not unexpected, since not all companies pay dividends.

Compustat Factors
Compustat CodeExplanationPercent NA (after data cleaning)
CEQQCommon/Ordinary Equity - Total 0.21
CHECHYCash and Cash Equivalents - Increase (Decrease) 4.44
CHEQCash and Short-Term Investments 0.13
COGSQ Cost of Goods Sold 0.12
CSHIQCommon Shares Issued12.41
CSHOQCommon Shares Outstanding (thousands)0
DLCCHYChanges in Current Debt43.96
DLCQDebt in Current Liabilities2.12
DLTISYLong-Term Debt - Issuance8.66
DLTRYLong-Term Debt - Reduction7.52
DLTTQLong-Term Debt - Total (millions)0.44
DVPSPQDividends per Share - Pay Date - Quarter0.02
DVYCash Dividends0.02
EPSPXQ Earnings Per Share (Basic) - Excluding Extraordinary Items 0.34
IBCOMQIncome Before Extraordinary Items - Available for Common0.03
IBQIncome Before Extraordinary Items0.03
LTQLiabilities - Total 0.09
NIQNet Income (Loss) 0.03
OANCFY Operating Activities - Net Cash Flow (year to date)4.44
OIADPQ Operating Income After Depreciation - Quarterly 1.38
PIQPretax Income0.06
PRCCQPrice Close - Quarter0
PRSTKCY Purchase of Common and Preferred Stock9.99
PSTKQPreferred/Preference Stock (Capital) - Total0.84
SALEQSales/Turnover (Net)0
SEQQStockholders Equity <Parent <Index Fundamental <Quarterly 0.30
SSTKYSale of Common and Preferred Stock8.45
TXTQIncome Taxes - Total0.10
XINTQInterest and Related Expense- Total14.10
XOPRQOperating Expense- Total 0.83
XSGAQSelling, General and Administrative Expenses (millions)22.41

Note:
In the CRSP/Compustat data the REVTQ (quarterly revenue) and SALEQ (quarterly sales) values seem to be the same. There are fewer missing values for SALEQ, so this value is used rather than REVTQ.

The distribution of NA (not present) values after interpolation for quarterly values is shown below:

Additional Metadata

In some cases the date defined in the datadate column is incorrect. To allow cleaning and filtering software to fix the dates the metadata fields (from CRSP/Compustat) should be included.

Metadata Fields
gvkey
datadate
fyear
fqtr
tic
conm
DATAFQTR
GGROUP
GSECTOR

If you want additional verification you can also include the DATAFQTR column in the data. This value (e.g., 1998Q1, 1998Q2) mirrors the fyear and fqtr columns.

The GGROUP and GSECTOR are Global Industry Classification Standard values

Cleaning up the Data

In our recollection, we have never encountered a large, real-world dataset that was directly usable without data cleaning.
Scaling Big Data Mining Infrastructure: The Twitter Experience by Jimmy Lin and Dmitriy Ryaboy, SIGKDD Explorations, V14, Issue 2, 2012

Fixing Date Values

The table below of Google (GOOG) data shows the correct format for the dates. Note that the date in the datadate field mirrors the year and quarter in the fyearq and fqtr columns.

Consistent Data Values
gvkeydatadatefyearqfqtrticconmDATAFQTRATQetc
16032909/30/200420043GOOGGOOGLE INC2004Q32888.5180 ...
16032912/31/200420044GOOGGOOGLE INC2004Q43313.3510 ...
16032903/31/200520051GOOGGOOGLE INC2005Q13865.1990 ...
16032906/30/200520052GOOGGOOGLE INC2005Q24497.7180 ...
16032909/30/200520053GOOGGOOGLE INC2005Q39451.0010 ...
16032912/31/200520054GOOGGOOGLE INC2005Q410271.8130 ...

The data below for Mylan Inc (MYL) shows an example where the date in the datadate column is wrong. I have tried to check this data against finance.yahoo.com data. It is hard to exactly match up the values, but it does apepar that the fyearq and fqtr close price is the correct one.

Along with being inaccurate, this date problem causes an error in R's indexes since some datadate values are repeated.

Inconsistent Data Values
gvkeydatadatefyearqfqtrticconmDATAFQTRATQetc
00763703/31/199819974MYLMYLAN INC1997Q4847.7530 ...
00763706/30/199819981MYLMYLAN INC1998Q1890.8990 ...
00763709/30/199819982MYLMYLAN INC1998Q2924.4280 ...
00763712/31/199819983MYLMYLAN INC1998Q31158.3130 ...
00763703/31/199919984MYLMYLAN INC1998Q41206.6610 ...
00763706/30/199919991MYLMYLAN INC1999Q11241.1060 ...
00763709/30/199919992MYLMYLAN INC1999Q21259.8280 ...
00763712/31/199919993MYLMYLAN INC1999Q31309.0420 ...
00763703/31/200019994MYLMYLAN INC1999Q41341.2300 ...
00763706/30/200020001MYLMYLAN INC2000Q11364.8180 ...

The CRSP/Compustat Merged Database - Fundamentals Quarterly data is extracted from the quarterly reports that public companies, traded on US exchanges, are required to file with the SEC. There are many factors and not all apply to all companies, so some will naturally be zero (blank).

There are other data values that should be reported and present for all companies. However, there are cases where these critical values are missing in the CRSP/Compustat data.

Before the Compustat data can be used, it must be processed to add critical values that are missing. Adding interpolated values is better than a missing value, but this may result in estimation error.

Filling in missing stock values

The R package tseries includes a function, get.hist.quote, which can be used to fetch stock values form Internet sources like finance.yahoo.com.

The historical S&P 500 includes stocks for companies that have been purchased by other companies or have otherwise gone out of business (Lehman Brothers, for example). The tickers for these stocks have been replaced by CRSP/Compustat symbols. If these symbols are passed to the get.hist.quote function there will be an error. This error is not an exception that can be caught, if the quote cannot be read. This will cause the R script to terminate.

To avoid this, the R function below can be used to check whether a value can be read from Yahoo before calling get.hist.quote().


#
# Return TRUE if the stock, on date, can be read from Yahoo
#

canGetStockInfo = function(ticker, date)
{
  start = date
  end = date
  url <- paste("http://chart.yahoo.com/table.csv?s=", ticker, 
               format(start, "&a=%m&b=%d&c=%Y"), 
               format(end, "&d=%m&e=%d&f=%Y"), "&g=d&q=q&y=0&z=",
	       ticker, "&x=.csv", sep = "")
  page = getURL(url) 
  ix = grep(pattern="404 Not Found", x = page)
  readURL = (length(ix) == 0)
  return(readURL)
}

R code to process the Compustat Data

An R script to clean the S&P 500 CRSP/Compustat data: fix_compustat_data.r

Fundamental Corporate Factors from Quantitative Equity Portfolio Management

The factors in this section are discussed in Chapter 5 of Quantitative Equity Portfolio Management by Qian, Hua and Sorensen, Chapman and Hall, 2007. These factors are synthesized from the Compustat quarterly data (except for the earnings forecast, which can be obtained from IBES).

In Appendex A5.1 of Qian et al there is a description of how these factors are calculated from the Compustat data. Unfortunately the current Compustat data available from WRDS in 2013 has a different structure. Qian et al use an obsolete Compustat quarterly data schema, is described here.

I have listed below the calculation I have used for the selected Qian et al factors.

Preprocessing Factor Data

In the CRSP/Compustat data the yearly values are "year-to-date" values. For example, if cash grows by 2 million a quarter, in the first quarter the cash will be 2 million, then 4 million in the second quarter, 6 million in the third quarter and 8 million in the fourth quarter. If the progression continues, then in the first quarter of the next year the cash will again be 2 million.

The yearly factors must be preprocessed to generate the quarterly values before they are used. The yearly factors used above are:

DLTISY
DLTRY
OANCFY

Distribution of the Value Factors

The plots below show the distributions of the standardized value factors synthesized from the corporate values from the S&P 500 stock universe over the period 1998 through 2013 (a distribution is standardized by subtracting the mean and dividing the result by the standard deviation).


R Code to Calculate the Value Factors

  • R code to calculate the value factors from CRSP/Compustat data: factor_calc.r


Ian Kaplan
October 2013
Last revised: February 2014

Back to topics in Quantitative Finance