This is the project that I wasted many months of my life on. I say “waste” half-jokingly —I wanted to write an academic paper, but I made some fundamental errors — or just didn’t think hard enough — about certain aspects of the matter. Anyway, here’s a mini report on what I learned. As always, I make my Colab notebooks publicly available.

The data comes from the Beijing multisite data set from the UCI Machine Learning repository. I picked just 1 of the sites — the Tiantan (Temple of Heaven, downtown/central Beijing) monitoring site — to use for these experiments…

There are more cool time series libraries for Python than you can shake a stick at. You might have heard of some of them:

Each of these libraries has different methods for dealing with the various time series learning tasks — regression, classification and forecasting. Where they tend to differ is in the selection of methods they use, ranging from traditional statistical methods (e.g. ARIMA), to dynamic time series warping, symbolic time series approximations, and more.

For those who are interested in time series classification (TSC)— let’s talk about two types of machine learning methods…

How many dates should you go on before you settle down?

The classic “secretary” problem has many names and guises but it’s essentially a mathematical decision theory problem that goes like this: imagine you are interviewing a pool of candidates, and you want to find the best one, but you are under time constraints. The interviewees arrive in random order, and you evaluate each one with a “score” — but you have to decide if you want to hire them on the spot, or else pass them up and move to the next person. No backsies! So…what’s your algorithm for…

If you dabble in stock trading, as I do, you might wonder how you can tell how the stock is going to do by the time of the closing bell — is it going to close above where it started, or not? There are intraday patterns, surely — people always tell you stock trading activity comes in “waves”, and that things tend to slow down a bit during the lunch hours, and that there is a power hour towards the end where big moves can happen.

For this project — (Google Colab notebook publicly available here) — I am using…

You’ve probably heard of the Poisson distribution, a probability distribution often used for modeling counts, that is, positive integer values. Imagine you’re modeling “events”, like the number of customers that walk into a store, or birds that land in a tree in a given hour. That’s what the Poisson is often used for. From the perspective of regression, we have this thing called Generalized Poisson Models (GPMs), where the response variable you are modeling is some kind of Poisson-ish type distribution. …

Data: The Seoul Bike Sharing Data Set from UCI Machine Learning Repository. The target is **Rented Bike Count**. That means this is count data, which is positive integer values. The predictors/features are mostly weather-related, e.g. sunshine, rain, wind, visibility, as well as temporal features such as the hour of the day, whether it’s a holiday or not, etc. You can view my entire colab notebook here.

**Question: Should this be a Poisson regression problem?** Count data is often modeled assuming that it comes from a Poisson distribution. …

The SELFBACK dataset contains wearable data of 9 activity classes; 6 ambulatory activities and 3 sedentary activities, performed by 33 participants.

Data are recorded with two tri-axial accelerometers sampling at 100Hz, mounted on the dominant side wrist and the thigh of the participant.

First, we create windows of data for the supervised learning problem. This involves taking “snapshots” (snippets of a multivariate time series). Then, I will see what kind of model pycaret comes up with vs. what we can get from keras or autokeras.

Let’s dive right in — with one of the main problems I faced in handling…

The original paper is as follows:

Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone.

BMC Med Inform Decis Mak20,16 (2020). https://doi.org/10.1186/s12911-020-1023-5

This data set has 12 features and you can download it from the UCI Machine Learning Repository. It is a binary classification, supervised learning problem, with “DEATH_EVENT” as the target variable, 1 meaning died and 0 meaning survived.

*Here’s the question: what is the most efficient way to find the best learner (algorithm) and best feature subset? Sometimes, surprisingly small subsets of the features…*

Well, it wasn’t a complete waste, you live and learn. But I spent a lot of time on the project and wanted to write it up as an academic paper — which means you have to consider what added value your paper is offering to the world, and I realized, after seeing my error, that I was not sharing a skillful time series forecasting model as I had hoped, but just, well, a barely better than an educated guess model.

In my case the data was drawn from the UCI Machine Learning Repository, and had to do with air quality…

This is the second in my Machine Learning Mini Project series, where I take some public data sets, such as those found in the UCI Machine Learning Repository (which are often used in peer reviewed papers, so have some level of vetting), and see what I can learn using basic and advanced machine learning models. I don’t pretend to have any expert knowledge, nor am I sharing any state-of-the-art results or methods. These are meant as explorations, with my notebooks and code being publicly accessible. …

Machine Learning Engineer