Midas: Secondary Data

The end (of the first phase) is in sight. I started this project very myopic and wrote code with very small, short-term goals in mind, but I find great joy this morning in having the rest of the code conceived in my mind (and on paper) (and on MS Word). I know exactly what is left to be implemented — everything is sure.

Secondary Data

To date, Midas Data Miner coding efforts have focused on mining & storing raw data which I have aptly stored in the binary tree named “primary”. However, before I can write a backtesting application, I need to abstract the variety of Damadoran- or Yahoo- (or someday, AAII SIPRO) derived PE Ratios and other quote information as general, uniform quote objects. That way, no matter where I source my information, or how I change my data mining implementation in the future, the usage/abstraction remains consistent for any down-stream programs like Midas Backtester.

This abstraction will be the binary tree named “secondary”. Besides providing consistency for down-stream applications (aka a uniform naming scheme for accessing data), I can also utilize data structures that will perform faster — or eliminate excessive quote information that I don’t anticipate needing.

Basically, the advantages of dividing “primary” and “secondary” data sets is that I gain sustainability (of use in other applications) and performance.

I’m thrilled to finally have an abstraction pegged down that accommodates freely-available Yahoo and Damadoran market data today and will also accommodate proprietary data sets like Value Line or AAII SIPRO a year from now, crucially WITHOUT having to re-write any down-stream applications to accommodate the new quote information. I will be even more thrilled when this is finished and I can move on to the backtester.

Specifics

I have attached a PDF detailing the data structures and some of the implementation. To summarize, the “secondary” data abstraction fulfills these goals:

  • Eliminate excessive quote information (and thereby improve performance)
  • Cull bad tickers (any companies for which I have insufficient market data)
  • Refine the primary data into more efficient data structures for enhanced backtesting performance
  • Flag affected, existing secondary keys when primary source data is imported to speed up subsequent secondary data refinements
  • Devise a quote attribute naming scheme which makes any variety of primary data sets accessible by down-stream apps

The details are all in the PDF.

Midas Backtester

Completing this secondary data abstraction will conclude the first phase of developing my trading system; namely, the data mining. And I am excited to begin the next phase soon: developing Midas Backtester, the actual utility that will use all this data in simulating portfolios over historical time series of market data.

Some features (off the top of my head):

  • Graphical user interface
  • Stock screening (identify stocks matching criteria for a day)
  • Company In Focus (graph data for one company over time)
  • Portfolio Simulation (specify stock selection and position sizing rules to simulate a historical stock portfolio)
  • Macro” screening (choose to be in or out of the market based on macro-economic factors)
  • Caches previously used stock screening criteria
  • Save portfolios

(These features were first discussed here.)

Time line

I would like to have these changes completed before summer term ends (Aug 20 or so)… allowing me to move on to Midas Backtester for the fall.

/s/ Patrick

AttachmentSize
30 July 2010 - midas data structures.pdf469.73 KB