S&P 500 Insights With Advanced Analytics

in sp500 •  6 years ago 

I started this as a way to see if I could apply my skills in tech (called Advanced analytics or Business intelligence) to data about stocks. My aim is to find a way to analyze market behaviour by using hard data without human bias.

I hoped to find out if market gurus like Peter Lynch, Warren Buffet and Benjamin Graham have any support for their strategies. For example Lynch often claim makro does not matter when picking stocks, Buffet pretty much also ignore the overall economy but use an own developed mathematical formula for pricing stocks by using the interest rates combined with P/E to determine if stocks are overpriced or not.

My aim was to see if hard data spanning over a long time frame could prove if they were right. Also by using structured data I knew there was a good possibility my tools would find more correlations in the data, I even hoped to find that hidden statistical gem that no one ever found before by using such a large data set.

The dataset I use is a fairly large set about the S&P 500 going back to 1871-01-01 (even though I know the data from that time may be somehow inaccurate I choose to include it anyway).

I use a correlation engine built into Power BI that itarate row by row over my dataset matching every possible combination between the rows and features and by using statistical algorithms to find the correlations.

Each row in the dataset consists of the following features:

Consumer Price Index
Earnings
Dividend
Long interest rates
PE 10 (Price earnings ratio is based on average inflation-adjusted earnings from the previous 10 years, known as the Cyclically Adjusted PE Ratio (CAPE Ratio), Shiller PE Ratio, or PE 10)
Real dividend (adjusted dividend)
Real earnings (adjusted earnings)
Real price (adjusted price)
The data looks like this:

Date (yymmdd)

1938-11-01 13.07 0.56 0.63 14.0 2.39 233.24 9.99 11.3 16.15
1938-12-01 12.69 0.51 0.64 14.0 2.38 226.46 9.1 11.42 15.76
1939-01-01 12.5 0.51 0.66 14.0 2.36 223.07 9.16 11.84 15.6

With only one row each month. A dataset with one row per week or day might reveal better insights but for the large part I believe one entry point per month is good enough for this type of analysis.

The features and number of rows in a data set used for statistical analysis is crucial in this particular correlation engine I could have used more features but I was unable to find features ranging from such a long range (1871). I chose the features I had that could match the very long time frame.

This leads me to the main strenght of the analysis; data from the 1th of january 1871 until today. This is very, very hard to find good data from this period and even harder to find it structured in such a way that statistical algorithms may use it. I would love to use a dataset with even more features but I am very unlikely to find one for a decent price (and rightly so since a dataset of such lenght is valuable and made more valuable by its scarcity).

Instead what I have is analysis on a very long time frame regarding data about (what nowadays is called) S&P 500. In the future I plan to expand the data set with other types of data but since it is so time consuming to do I must postpone it for now.

Explanations

First some explanations about the report. The correlation plot show both numbers and colors, the darkest blue and the number 1 means a perfect correlation. This only occur diagonally, for example dividend correlates perfectly with dividend etc.

The lower the number means less correlation and a minus sign means a negative correlation as seen with a - 0.2 between "PE10" and "Long interest rates". The colors work the same way, light blue color means a small correlation and the color red means a negative correlation.

Interestingly we have very weak correlation between interest rates and the rest of the features, meaning it actually does not matter much for the market in the long run how the interest rates move. I thought I would find some kind of correlation between rising interest and falling stock prices or falling interest rates and rising stock prices but none were found.

One could argue there could be a "lag effect" between prices and interest rates of say 6-12 months since the market anticipates a rate increase. That may be so but a month from say 1890 is equally weighted as a month from 2015 by the correlation engine and I dont believe the market was very efficient in pricing interest rates before the 1950-1960, this may be the reason we dont have a clear correlation (even one with a lag) between interest rates and price.

(Or perhaps the Federal Reserve does a better job than most of us think, increasing and decreasing interest rates when it should.)

What surprised me the most was the poor correlation between PE10 and earnings only getting a score of 0.6 out of a total 1.

My only guess why this might be the case is that the average P/E ratio have changed a lot during the years. Benjamin Graham for example had little problem finding stocks below a P/E of 6. While nowadays paying at a P/E level of 18 is not uncommon. I believe transparancy and better regulations for stock companies increase the market participants understanding of risks and are more willing today to pay a higher price for future earnings than investors before the 50:s.
The amount of money held by index funds and mutual funds also push the P/E up as funds now holds assets many times higher than they did long ago, money that must be put to good use in an environment with low interest rates. Many funds for example cant accept returns of just a few percent like they would get from interest rates, they are instead forced to try to acheive returns above 6-7% or even to "beat the market".

The low in the 1950 is about 2.5%, compare that to the levels of interest we see nowadays. Also for those of you who know your history know what happened in the 50:s. It is the best performing decade in the history of the stock market, both for small caps and large caps, S&P 500 averaged 19.4 % every year between 1950-1959.

Returning to my argument, the low interest rate environment of today force funds to buy stocks to achieve the return their investors anticipate. The sector must move away from cash and into higher paying assets like stocks.

The second most surprising thing about the data is the strong correlation "Consumer Price Index" shows against many of the features. I will pay closer attention to this in the future, if there is a lag between the S&P 500 and CPI it may be worth watching closely.

As expected, dividends and the S&P 500 show a nice correlation.This makes a lot of sense to me since managers know their business best and they know better than anyone if their profits are sustainable and if it's a good time to pay or increase the dividends.

The whole dataset is structured after "date" this means the date decides the rest of the information on that row. You dear reader may find this useless information but this is everything for me and making reports like this. It enables the correlation engine to find statistical proof such as if the market always go down between may and oktober, or any other month.

Sell in may and go away

But no such correlation was found, if the famous claim "sell in may and go away" had any truth in there the model would have found it. It found all kinds of correlations not included here, like when to lock your interest rates for the highest chance of getting the lowest possible interest rate and a correlation between the S&P 500 and CPI.

This analysis was made purely by analyzing large amounts of data in a single dataset. There were many more correlations but I decided not to include them in this article as I found them less interesting.

The low correlation between the long interest rates and earnings and PE10 is interesting, since there really isnt much of a correlation at all and even a slight negative correlation in the case of PE10 and long interest rates. Peter Lynch famously stated that if you spend 15 minuties analyzing the market per year 12 minutes of those is wasted time.

The data proves to me that 3 minutes spent analyzing makro information during a 12 month period seems to be a good reccomendation from Lynch.

Conclusion

Many insights can be found in data and be used as help and guidance for investors. One thing that struck me the most is the lack of correlation between interest rates and the S&P 500. Now I have a better understanding of why the greats like Warren Buffett and Peter Lynch spend so little time analyzing macro and interest rates, because in the long run it makes little difference to their stockholdings.

Investors should spend less time analyzing what the federal reserve or the overall market will do to their stocks.
For investors investing long term it does not pay to take the interest rate into account, the market seems to be efficient or the federal reserve is doing a good enough job.
Do not listen to advice such as "sell in may and go away" it holds no statistical evidence.
The behaviour of funds and how these behemoths are moving their money is important.

Finding correlations in large dataset is my passion. The findings in this article have motivated me to work harder with better data. My next goal is to run correlations on a dataset including features such as insider holdings including the titles of insider holders. This may show how strongly correlated the price is to say the chairmans holdings in the company.

There is no limit to the ammount of data I can use in my correlations, digging deeper into individual stocks and their data is my next goal. If you like my work, please comment below what stock you would like me to run correlations on and on what type of data.

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!