I remain to be pleased with my ZODB implementation.
Since my last post, I have successfully downloaded all the market data I could find off of Yahoo! Finance. The program downloads all the market data for a specified ticker off of Yahoo’s servers, and then this information is saved into a binary tree in my ZODB, which is keyed by the tuple (date,ticker).
The download took 7 hours, and comprised 8 years of market data for all my tickers.
I’m not sure if everything is packed optimally or not, but the resulting ZODB pickle file is sixteen gigabytes.
I was committing changes to the database after every ticker (~14K tickers for the 10 years I did), and that is a lot of file read/writes… I may be able to perform the download in less time if I space out the local ZODB transactions, and let more downloaded data sit in RAM before committing everything to my hard drive… As it was, the RAM burden was tiny.
But, it is all irrelevant now- having downloaded everything. Seven hours is no big deal.
I wrote a small script to evaluate the success of the Yahoo! download. I randomly select three days in each calendar year and query the ZODB for quote information for these days… If two of three quotes are found, I assume that data downloaded for that calendar year.
The results are saved to alltickers.txt.
My first finding: There are a number of garbage tickers… numbers. They are somehow artifacts of the Damadoran XLS sheets I am using, though I can’t pin down an explanation entirely. I have verified the integrity of fundamental data, so I remain confident in my fundamental data (and am basically just ignoring these weird garbage tickers).
My second finding: I only have market data for 42% of the tickers. Some of this is explained by finding #1 above, but many tickers are simply dead companies whose tickers no longer appear on the Yahoo! Finance site and could not be found by my script. Or, other companies may have changed tickers (in which case, an alias ticker has all the market data — and the former ticker is a duplicate).
The dead companies is a problem. It may be possible to still harvest data from Yahoo! Finance or elsewhere.
At the moment, I am not interested in perfecting my data cache. I wish to develop the historical screener / portfolio tools — get a working program — and then focus my efforts back on data integrity.
(OR: Buy a complete, private source of market data for $$$ and spare myself the trouble of cracking less-credible sources found on the ‘net.)
I don’t know precisely when I will get around to developing the first screener and portfolio tools (and a graphical user interface!!). I am happy having populated the database with horrible-integrity market data to begin the next phase of coding, and will take a rest and look at other exciting code projects like activity-based modeling (for academic research).
alltickers.txt shows the panoply of tickers I have harvested so far and what market data I have available, by year.
ford.txt shows the depth of market data I have for a typical company, Ford Motor (F). I have included some fundamental attributes(Current PE, Market Cap) and the closing price to show which quotes have that information, and also a final entry titled “# Attributes” which tells you how much data I have saved for each quote… 8 attributes for Yahooo! market data downloads, and 55+ attributes for fundamental data imported from Damadoran XLS.
| Attachment | Size |
|---|---|
| alltickers.txt | 814.5 KB |
| ford.txt | 184.1 KB |