# Myths Fallacies and Foibles #3: R²

## It’s popular and feels sophisticated.

Especially if it’s ‘high’.

R² or r² is a number which has been squared. Or is it?

Higher is better. Or is it?

There are actually a number of ‘fallacies’ around the use of R^{2} or r^{2}. First let’s distinguish between the big letter R and the little letter r. The little r^{2} is for simple regression, where you just have one predictor, and one predicted variable. (Like value *y* is predicted by size *x*). It is also called the powerful-sounding *coefficient of determination*.

The capital R^{2} is used in multivariate regression. In other words there are several predictors. (like size, lot size, number of doors, number of bedrooms, or distance to downtown). It may then be called the *multiple coefficient of multiple determination*.

R^{2} is a measure of **linear dependence** of one variable on another. This definition itself gives a couple of important clues to solve the mystery questions. “Linear” means that the relationship is “linear in the parameters,” but does not need to be linear in the variables themselves. These parameters (or coefficients) are like ‘adjustments’ in appraisal – but different in amount. (Those familiar with the polynomial curves we use in the *Stats, Graphs, and Data Science ^{1}* class may remember that even though the polynomials are curved, the regression is still linear in the parameters (coefficients). If this is confusing, then the first point of this blog is simpler: for nonlinear coefficients, R

^{2}is different, very different.

The second word “dependence,” must be decided *before* you run the regression. The data itself tells you nothing about which variable depends on the other. You, the appraiser, the analyst, the data scientist, the asset economist – must decide what depends on what, *before* you push the regress button.

R^{2} does measure the *goodness of fit* – how well a particular line fits the data. It minimizes the sum of the squared distances between the line itself, and each data point. Because each distance (deviation) is squared, outliers have greater influence. (So the *least* similar comps are given extra weight — another fallacy for another day . . .)

I carry a list around in my wallet with several *other* data issues which must always be considered along with “how high is the r-squared?” Ready?

- Outliers
- Slope of the line
- Influential data points
- Number of data points
- The domain (smallest to the largest)
- Linearity
- Nature of the problem
- Deviations unevenly distributed
- Pragmatic (economic) relevance
- Data transformations made, or should have been made

In some cases, a higher R^{2} (or r^{2}) is a good thing. In other cases it is not. It depends . . .

*(Please note: r-squared is completely different from the R software package.)*