You’ve probably heard of the Poisson distribution, a probability distribution often used for modeling counts, that is, positive integer values. Imagine you’re modeling “events”, like the number of customers that walk into a store, or birds that land in a tree in a given hour. That’s what the Poisson is often used for. From the perspective of regression, we have this thing called Generalized Poisson Models (GPMs), where the response variable you are modeling is some kind of Poisson-ish type distribution. In a Poisson distribution, the mean is equal to the variance, but in real life, the variance is often greater (over-dispersed) or less than the mean (under-dispersed).
One of the most intuitive and easy to understand explanations of GPMs is from Sachin Date, who, if you don’t know, explains many difficult topics well. What I want to focus on here is something kind of related — the Poisson deviance, or the Poisson loss function.
Most of the time, your models, whether they are linear models, neural networks, or some tree-based method (XGBoost, Random Forest, etc.) are going to use mean squared error (MSE) or root mean squared error (RMSE) as your objective function. As you might know, MSE tends to favor the median, and RMSE and tends to favor the mean of a conditional distribution — this is why people worry about how the loss function works with the outliers in their data sets. However, most of the time, with some normally distributed response, you expect the values of the response (if z-score normalized, especially), to be some kind of Gaussian with mean 0 and unit variance. However, if you are dealing with count data — all positive integers — and you don’t scale or transform the response, then maybe a Poisson distribution is the better description of the data. When the mean of a Poisson is over 10, and especially over 20, it can be approximated with a Gaussian; the heavy tail that you see when the mean (the rate) is low tends to disappear.
Take a look at the formula in the beginning of the post — the y_i is the ground truth, and the mu_i is your model’s prediction. Obviously, if y_i = mu_i, then you have a ln(1), which is 0, canceling out the first term, and the second as well, giving a deviance of 0. What’s interesting is what happens when your model errs on either side of the actual value. In the following snippet, I plot the loss of one example where the ground truth is set at 20 — meaning that we assume the variable is Poisson distributed with mean/rate = 20.