Image for post
Image for post

This is the project that I wasted many months of my life on. I say “waste” half-jokingly —I wanted to write an academic paper, but I made some fundamental errors — or just didn’t think hard enough — about certain aspects of the matter. Anyway, here’s a mini report on what I learned. As always, I make my Colab notebooks publicly available.

The data comes from the Beijing multisite data set from the UCI Machine Learning repository. I picked just 1 of the sites — the Tiantan (Temple of Heaven, downtown/central Beijing) monitoring site — to use for these experiments. The data itself is just your typical structured numerical meteorological and pollutant data, collected on an hourly basis between roughly March 2013 and March 2017. …

Image for post
Image for post
Poisson Deviance

You’ve probably heard of the Poisson distribution, a probability distribution often used for modeling counts, that is, positive integer values. Imagine you’re modeling “events”, like the number of customers that walk into a store, or birds that land in a tree in a given hour. That’s what the Poisson is often used for. From the perspective of regression, we have this thing called Generalized Poisson Models (GPMs), where the response variable you are modeling is some kind of Poisson-ish type distribution. …

Forecasting Hourly Bike Rentals in Seoul with pycaret, keras, and Tensorflow Probability

Image for post
Image for post
Photo by Pixabay from Pexels

Data: The Seoul Bike Sharing Data Set from UCI Machine Learning Repository. The target is Rented Bike Count. That means this is count data, which is positive integer values. The predictors/features are mostly weather-related, e.g. sunshine, rain, wind, visibility, as well as temporal features such as the hour of the day, whether it’s a holiday or not, etc. You can view my entire colab notebook here.

Question: Should this be a Poisson regression problem? Count data is often modeled assuming that it comes from a Poisson distribution. …

Machine Learning Mini-Project

Using pycaret and keras to find good models

Image for post
Image for post
Photo by Ketut Subiyanto from Pexels

The SELFBACK dataset contains wearable data of 9 activity classes; 6 ambulatory activities and 3 sedentary activities, performed by 33 participants.
Data are recorded with two tri-axial accelerometers sampling at 100Hz, mounted on the dominant side wrist and the thigh of the participant.

Experimental Structure

First, we create windows of data for the supervised learning problem. This involves taking “snapshots” (snippets of a multivariate time series). Then, I will see what kind of model pycaret comes up with vs. what we can get from keras or autokeras.

Creating Data Windows for Supervised Learning

Let’s dive right in — with one of the main problems I faced in handling this data set. I wanted to create a supervised learning problem, which means that I needed to use “windows” of data (slices of time, e.g. 1 second or 1/100th of a second long) which would serve as the features to train on, while an activity (e.g. “jogging”, “walking up stairs”) would serve as the class label. …

Machine Learning Mini-Project

(for Heart Failure Survival Prediction)

Image for post
Image for post
Photo by National Cancer Institute on Unsplash

The original paper is as follows:

Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020).

This data set has 12 features and you can download it from the UCI Machine Learning Repository. It is a binary classification, supervised learning problem, with “DEATH_EVENT” as the target variable, 1 meaning died and 0 meaning survived.

Here’s the question: what is the most efficient way to find the best learner (algorithm) and best feature subset? Sometimes, surprisingly small subsets of the features perform better than the complete feature set. …

Well, it wasn’t a complete waste, you live and learn. But I spent a lot of time on the project and wanted to write it up as an academic paper — which means you have to consider what added value your paper is offering to the world, and I realized, after seeing my error, that I was not sharing a skillful time series forecasting model as I had hoped, but just, well, a barely better than an educated guess model.

In my case the data was drawn from the UCI Machine Learning Repository, and had to do with air quality and pollution in Beijing, China. …

Image for post
Image for post
Photo by Karolina Grabowska from Pexels

This is the second in my Machine Learning Mini Project series, where I take some public data sets, such as those found in the UCI Machine Learning Repository (which are often used in peer reviewed papers, so have some level of vetting), and see what I can learn using basic and advanced machine learning models. I don’t pretend to have any expert knowledge, nor am I sharing any state-of-the-art results or methods. These are meant as explorations, with my notebooks and code being publicly accessible. …

Start to finish on some open source data.

As a teacher, I’ve had to rely on video conferencing this last half year, but until now I have never bothered sharing any of my personal projects or explorations via video, until now. I have a YouTube playlist where I take you through various aspects of a simple machine learning project. In this case, it’s Early Stage Diabetes Prediction, a basic binary outcome supervised classification project that involves a modest number of mostly categorical features. …

We know that, in order to train a model that generalizes well to unseen data, that we cannot overfit the model to the data. Given a real-life data set, there is no way to tell how well a model will generalize, just because you never know what kind of real life data might be thrown at the model in the future. New data might not conform well to the distributions of existing data.

So you use cross-validation — divide the data into folds (let’s say 5, for now), and train on all but one of the folds (4 of them) and then test your model on the 5th — in other words, always leave one out as the validation data. So you’re actually training 5 models, each of which have different performance scores on their respective validation sets. Since these models might have different weights, in order to consolidate you have to find some “average” of the three models, something that would do well no matter what validation data it’s facing. …

Image for post
Image for post
Image from the Gottman Institute

Yöntem et. al (2019) from Turkey recently published a paper “Divorce Prediction Using Correlation Based Feature Selection and Artificial Neural Networks”. This study of Turkish couples and whether or not they stayed married or divorced was based on the principles of Gottman Couples Therapy. The graphic above shows some of the tenets of GCT. For example, creating shared meaning, managing conflict, turning towards instead of away — these principles are all ingredients to a successful and enduring relationship, and the 54 questions are designed specifically to measure where couples are at with regards to these particular dimensions.

For example, Question 33 reads: I can use negative statements about my spouse’s personality during our discussions. This question is then answered on scale of 0 to 4, with 4 meaning strongly agree. …


Peijin Chen

Machine Learning Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store