Predicting Stock Market Dips, Crashes and Corrections with Light Gradient Boosting Machines

Here’s the deal — as an occasional swing/day/options trader, I want to know when there’s a chance that a big dip might happen. By “big dip”, I mean taking the percentage change in the closing price, normalizing (z-scoring) it and then defining a “big” drop as anything more than 4 standard deviations under the mean. Depending on the data you use, this comes out to ~5.4% drop from the day before. I wanted to make a model that was generalizable, aspiring towards universal — so rather than focusing on one stock, I fed data from many of the major indices in the US and around the world, as well as many broad ETFs from Vanguard, iShares and the like that covered broad swaths of the international market, including countries from North America, Europe and Asia, general index trackers like S&P500, Russell 2000, and some industry-specific (e.g. health care, information tech) ETFs. For the model architecture, I tried using some basic LSTM networks, but in the end, I found using a Light GBM (gradient boosting machine) to be faster for training and analysis.

The setup was this: take the last 20 closing prices of our data and see if you can forecast a big drop on the following day. I combined the data from these disparate sources just to see if we could find general patterns. I have found that individual equities have different dynamics. I’m aiming this model towards people or funds that have diverse, ETF-based portfolios.

Since you are dealing with super-imbalanced sets on the level of majority/minority ratio ~ 300, I tried using imblearn library’s oversampling module to create a balanced data set using basic SMOTE.

I then used Randomized Grid search for hyperparameter tuning. The scores I was most interested in precision and recall of the TRUE class, which could be summarized using F1-score for the TRUE class. We want to catch as many of the real drops as possible, without creating too many false alarms!

Then I used the sklearn CalibratedClassifier function to calibrate the classifier. This usually improves the model — though that depends on the test data, but also on the threshold and point on the recall curve you’re interested in. I decided, for the sake of this experiment, to use 0.80 as my minimum acceptable TRUE class recall score. I had to catch 80% of the real drops — and at this point, I want to know which version of the classifier has more false alarms than the other. I also did some testing at 0.90 recall. I’ve found that the differences are a little less glaring at this level.


Testing with IOO, I compared the calibrated and non-calibrated classifier on this ETF at the 0.8 TRUE class recall level. You can see that at this level, the calibrated classifier — the one on top — has 16 false positives to the 26 of the non-calibrated classifier.

Scoring the two classifiers at the 0.8 recall level

And here’s what happens when you demand that minimum recall on the TRUE class is 0.9:

Classifier comparison at the 0.9 recall level.

Without doing a precise or exhaustive search for the ideal threshold, I found that 0.7 gives you a maximum f1-score of 0.87 with the calibrated classifier — and that’s the best I can do for IOO.

As few as 8 errors on the whole set.

Of course, the devil is in the details of what kind of error costs you more money. If you are buy-and-hold you probably shouldn’t care that much about the 5% drop that happens less than 1% of the time. But if you are day trading some leveraged S&P500 ETFs or the like then you might want to know if you should take your gains and call it a day.

The clean code is on my GitHub, or you can see more of a notebook here.

Machine Learning Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store