Intuition behind R2 and other regression evaluation metrics

There are many metrics for evaluating a regression model. But often they seem cryptic. Below is an attempt to help understand the intuition two often used such metrics: mean/median absolute error and R2 (or coefficient of determination)

Average Accuracy of the Model (Mean/Median Absolute Error)

Let’s assume you got a model that can predict house prices. Naturally you won’t trust it unless you evaluate it and establish some confidence on expected error. So, to start with you feed in features (such as room number, lot size, etc) for a certain house and compare the predicted (say 130K) to its actual (say 120K) price. In this particular case we can say that the model over estimated the price by 10K. But a single point is not sufficient to make a general claim about the accuracy or expected error for the given model. So we feed in features for another 1000 houses and for each of them compute error, i.e. difference between predicted and actual price).

From descriptive statistics we know that there are different ways to summarize these 1000 error points. For instance we can summarize the general tendency of the dataset by mean or median or even draw a boxplot to understand the distribution of error.

Since we are interested in a numerical measure (rather than visualization), using “mean” as a way to summarize all the observed error make sense. Thus we can compute mean error.

However there is a problem. What if the error is -10K (i.e under-estimates) for one house and 10K (i.e. over-estimates) for another. Then mean error will be 0. Intuitively this doesn’t make sense. It makes more sense to say that the expected error is 10K i.e. we operate on absolute error rather than on signed (under/over estimate) error. Thus we got all the components of our first metric, namely Mean Absolute Error. To summarize, its called mean absolute error because:

  1. Error: because we are comparing actual house price to predicted house price
  2. Absolute: because we just think about the error and not whether it is under-predicted or over-predicted
  3. Mean: because we are using “mean” as a way to describe the central tendency of the observed error.

Now, we know that mean is sensitive to outliers. So sometimes instead of mean we use median and the metric is known as median absolute error. The advantage of “Mean/Median Absolute Error” is that its easy to make sense of the number. For instance if the mean absolute error of a model is 20K then we know that if the predicted price is 200K then the actual price is most likely between 180K to 220K.

Can it be better (R2)

Data scientists are not only concerned with quantifying the error but are also interested in determining if the model can be improved. To answer this question let’s first establish the best and the worst models.

Best Model
Theoretically, the best model is a model for which the absolute error is zero for all the test cases. As shown in the graph below, if we draw absolute error on x-axis and cumulative percentage of houses on y-axis then a point say (50K, 0.6) indicates that for 60% of houses the absolute error is less than or equal to 50K.

So given this graph how the best model will look like ?
Since absolute error is always zero, the graph will be simply a vertical line starting from 0 on x-axis extending to 100% on y-axis.

Worst Model
Don’t confuse the word “worst” with the word “dump”. Typically for building a regression model we have a target variable (house price) and certain features or predictor variables such as number of rooms, lot size, etc. But what if there are no features available. For instance, the only information provided is house prices for 10K randomly selected houses. We can still build a model simply based on this limited information. For instance we can compute mean house price based on the 10K training samples we have. Now our model will simply return this mean value. Let’s say the mean value is 215K. If we ask this model what will be the price of a house with lot size 5000 sq ft, it will simply return 215K. Let’s call this mean model.

Theoretically it can be shown that when no other information is available mean model will minimize error. Intuitively this makes sense as we often tend to use mean value when we have no other information. The graph below indicates how the curve for the mean model will look like.


Determining scope for improvement
From the above graph, we can easily observe few things. First, as our model becomes better, it will move towards the best model and hence the area between the best model and our model will decrease. On the other hand the area between the worst model and our model will increase. However the total area i.e area between the best and the worst model remains same. Let’s call this area to be the improvement opportunity. As our model get’s better, the more of this improvement opportunity area it covers. This is exactly what R2 metric captures. It indicates what portion of the total improvement opportunity our model covers i.e.

R2 = \frac{area~between~our~model~and~mean~model}{area~between~best~and~mean~model}


Once we understand the above intuition, its also easy to understand why often there is a confusion of whether R2 ranges from 0 to 1 (as mentioned in wikipedia) or from -1 to 1 (as in sklearn library). If we go by formula 1 in the above graph then R2 will be always positive and between 0 and 1. However this doesn’t tell where our model is in comparison to the mean model. Implicitly it’s made an assumption that our model will be always better than mean model and hence will be in between mean model and the best model.

But in practices its possible that our model is worst than mean model and it falls on right side of the mean model. In that case \sum{|y - \hat{y}|^2} will be bigger than \sum{|y - \bar{y}|^2} and hence R2 will be negative.

I hope now we can appreciate the beauty of R2 and understand the intuition behind it.

Posted in General, Machine Learning | Tagged , , , | Leave a comment