"Midas Data Miner"

So the latest name for my historical stock screener project is Midas’ Data Miner… It’s good symbolism for what I hope to accomplish with an investing strategy some day.

Besides, mythology only ever contains a grain of truth(if that). For all we know, King Midas owned a MacBook and traded stocks online, and the bards merely fancied telling the tall tale of him turning trash into gold with his finger tips, rather than the dull truth of his amassing of wealth in bonds and shares— Not at all anachronistic, right?

Incoming Python

I am now developing it in Python. Originally, I had planned on C++ but Python is much more fun to develop for, and performance isn’t going to be as much of an issue as before(discussed next).

Insufficient RAM

Historical stock data sets are enormous. If I were to achieve my goal of decades of stock quotes for ten-thousand-plus firms, that’s a lot of data. Consider that each stock quote yields twenty or more distinct pieces of price or fundamental data, and multiply this by 300+ days per year, by however many thousands of firms, and however many years… and the pile of quotes grows.

It is great enough that I cannot hope to load it all in RAM. Fortunately, I’ve come around to the idea of accessing any data set one day at a time. And this is how data sets are implemented in MDM. I can’t access my PZTMDB databases randomly; MDM can only iterate through them sequentially, day after day, one day at a time.

It’s very elegant, and perfectly in line with the demands of this app, as historical portfolios will also navigate the data one day at a time.

Data Sources

So far, in writing my classes, I’ve really only played with one XLS (Excel Sheet) source for fundamental data going back to January 2001. However, any other class I might write to access a different data set(perhaps Yahoo! market data through the web) must use the same interface as the already-implemented XLS class, so there isn’t really any critical thinking remaining… Just peon work.

Database (iterator) interface

Speaking of which, the Database (iterator) class and its children all have three public members: a date object, a dictionary object, and a next() method

  • database.date is a datetime object for the iterator’s position in data set
  • database.dict is a keyed dictionary of the form { TICKER -> STOCK QUOTE DATA }
  • database.next() cycles to next day in dataset, and updates the aforementioned date and dict members

So, it becomes moot whether “Database” actually describes the information I’m working with as long as the object fulfills these three members. My next task is to try to write a copy of this class that accesses data over the web using the Python urllib module, and define a simulacrum iterator for Yahoo! data using only these three members.

Information is information, regardless whether printed on an Excel sheet or available on the net.

Knowledge is Power

The next dull step of this project is to find any variety of market data sets freely available on the internet, and to implement each as Database child classes, whether Excel files or otherwise.

And one fine day, I shall have a plethora of market data in hand to do historical stock-trading system testing(Say that 10x fast).

Roadmap

Having completed the Data Mining feature(s), further features planned are enumerated below:

  • Stock screener that identifies any stocks passing user-specified criteria (e.g. P/E ratio, Market Capitalization) for a data point(e.g. May 10, 2005)
  • Portfolio tool that advances a saved stocked screen against all data(e.g. January 1, 1990 through January 1, 2010) with user-specified position sizing, etc.
  • Provide meaningful data about this portfolio(e.g. % per annum return, nominal value)

Discipline…

I would guarantee completing this before the end of Spring semester, but all bets are off if I can’t find the time(i.e., my status quo).

/s/ Patrick