Myths Fallacies and Foibles #3: R² - George Dell, SRA, MAI, ASA, CRE

August 9, 2017

George Dell Appraiser, Regression, Simple appraisal, myths fallacies and foibles 0

R² or r² is a number which has been squared. Or is it?

It’s popular and feels sophisticated.

Especially if it’s ‘high’.

Higher is better. Or is it?

There are actually a number of ‘fallacies’ around the use of R² or r². First let’s distinguish between the big letter R and the little letter r. The little r² is for simple regression, where you just have one predictor, and one predicted variable. (Like value y is predicted by size x). It is also called the powerful-sounding coefficient of determination.

The capital R² is used in multivariate regression. In other words there are several predictors. (like size, lot size, number of doors, number of bedrooms, or distance to downtown). It may then be called the multiple coefficient of multiple determination.

r²is a measure of linear dependence of one variable on another. This definition itself gives a couple of important clues to solve the mystery questions. “Linear” means that the relationship is “linear in the parameters,” but does not need to be linear in the variables themselves. These parameters (or coefficients) are like ‘adjustments’ in appraisal – but different in amount. (Those familiar with the polynomial curves we use in the Stats, Graphs, and Data Science¹ class may remember that even though the polynomials are curved, the regression is still linear in the parameters (coefficients). If this is confusing, then the first point of this blog is simpler: for nonlinear coefficients, R² is different, very different.

The second word “dependence,” must be decided before you run the regression. The data itself tells you nothing about which variable depends on the other. You, the appraiser, the analyst, the data scientist, the asset economist – must decide what depends on what, before you push the regress button.

R² does measure the goodness of fit – how well a particular line fits the data. It minimizes the sum of the squared distances between the line itself, and each data point. Because each distance (deviation) is squared, outliers have greater influence. (So the least similar comps are given extra weight — another fallacy for another day . . .)

I carry a list around in my wallet with several other data issues which must always be considered along with “how high is the R²?” Ready?

Outliers
Slope of the line
Influential data points
Number of data points
The domain (smallest to the largest)
Linearity
Nature of the problem
Deviations unevenly distributed
Pragmatic (economic) relevance
Data transformations made, or should have been made

In some cases, a higher R² (or r²) is a good thing. In other cases it is not. It depends . . .

(Please note: R² (or r²) is completely different from the R and RStudio open source software programs that we use in our Stats, Graphs, and Data Science classes.)

Related

Having Thoughts You Want to Share?Cancel reply