R² or r² is a number which has been squared. Or is it?
It’s popular and feels sophisticated.
Especially if it’s ‘high’.
Higher is better. Or is it?
There are actually a number of ‘fallacies’ around the use of R2 or r2. First let’s distinguish between the big letter R and the little letter r. The little r2 is for simple regression, where you just have one predictor, and one predicted variable. (Like value y is predicted by size x). It is also called the powerful-sounding coefficient of determination.
The capital R2 is used in multivariate regression. In other words there are several predictors. (like size, lot size, number of doors, number of bedrooms, or distance to downtown). It may then be called the multiple coefficient of multiple determination.
r2 is a measure of linear dependence of one variable on another. This definition itself gives a couple of important clues to solve the mystery questions. “Linear” means that the relationship is “linear in the parameters,” but does not need to be linear in the variables themselves. These parameters (or coefficients) are like ‘adjustments’ in appraisal – but different in amount. (Those familiar with the polynomial curves we use in the Stats, Graphs, and Data Science1 class may remember that even though the polynomials are curved, the regression is still linear in the parameters (coefficients). If this is confusing, then the first point of this blog is simpler: for nonlinear coefficients, R2 is different, very different.
The second word “dependence,” must be decided before you run the regression. The data itself tells you nothing about which variable depends on the other. You, the appraiser, the analyst, the data scientist, the asset economist – must decide what depends on what, before you push the regress button.
R2 does measure the goodness of fit – how well a particular line fits the data. It minimizes the sum of the squared distances between the line itself, and each data point. Because each distance (deviation) is squared, outliers have greater influence. (So the least similar comps are given extra weight — another fallacy for another day . . .)
I carry a list around in my wallet with several other data issues which must always be considered along with “how high is the R2?” Ready?
- Outliers
- Slope of the line
- Influential data points
- Number of data points
- The domain (smallest to the largest)
- Linearity
- Nature of the problem
- Deviations unevenly distributed
- Pragmatic (economic) relevance
- Data transformations made, or should have been made
In some cases, a higher R2 (or r2) is a good thing. In other cases it is not. It depends . . .
(Please note: R2 (or r2) is completely different from the R and RStudio open source software programs that we use in our Stats, Graphs, and Data Science classes.)