Most of us grew up being taught pie chart statistics. In elementary school that’s how you probably learned about data and graphing. Proportions were taught by using a pie chart. This seemed relevant because it dealt with the practical problem of slicing a pie so five kids could share. Much else often is taught not by asking practical mouth-watering questions. Except, maybe pizza!
How is data science different from statistics?
The explanation is that data science uses and reduces statistics. What does this mean?
In high school, although pie was still served, there was more emphasis on summary numbers (called ‘descriptive’ statistics), such as mean, median, range, and even deviation. The ‘normal’ bell curve was shown as being important, but no one seemed to know why it was important. Perhaps even skew lepto and mesokurtic curves appeared. The study of proportions was expanded to the concept of probability.
In college, there was little or no further attention to the ways of describing data. Too simple. The statistics professors often were the most ivory-tower of the bunch. No one seemed to really know what was going on. It was all about inferential statistics. (How do you infer something about a population by “drawing” a sample). It was all quite stern and serious.
But what few noticed was that most statistics was not taught in the statistics department. It was taught in other departments. First it was economists (who had big numbers), psychologists (who had samples), and medicine people (who were not allowed ethically to perform experiments on people). In fact, most statistics was taught outside the statistics department.
How could this happen?
Then something else happened. It was called data. Almost suddenly, electronically, instantly, it poured out of our desktop screens, right out of the cloud above!
No longer did I have to study a few carefully selected cases, to guess how the population behaved. No longer did I have to carefully pick a random sample from the population. I could just look at the whole population!
The whole population! How could this be? How can I see the whole population? It’s too big!
This is the first, most important difference between old statistics and modern data science. The nature of the problem is different.
- Statistics (particularly inferential statistics) is all about how to learn or ‘characterize’ a population by drawing a random sample. Randomness is king! The problem of understanding a large population of data is by reducing it to sample means, medians, deviations, p-values, regression t-scores, R-squares, and many other sophisticated and clever “statistical tests.”
- Data Science is all about how to understand the mass of population data directly. The human brain zonks out on groups larger than six or seven. (Just three comps is wonderful!) The human brain is really good at generalizing from three or four cases. And the human brain is really good at visualizing and seeing. Like graphs!
So, this is the first main difference between old statistics and the modern science of data. Data science is about helping the human brain understand and analyze with complete populations. For valuation, the burning question is: What is the population studied by appraisers, AVMs, evaluators, and guessers?
In coming analogue blogs, we will consider other differences.
Get the difference – Subscribe, and never miss today’s future!
Michael V. Sanders
March 25, 2020 @ 5:20 pm
The dilemma for me has always been whether we are studying a population of sales (from a geographic area over a specified time period), or a sample of properties that have sold from within that geographic area. On a lighter note, the story at the front end about pie charts and pizza instantly reminded me of one of Yogi Berra’s most famous quotes when asked whether he wanted his pizza cut into four or six slices – “You better cut the pizza in four pieces because I’m not hungry enough to eat six.”
susanrockypt
March 26, 2020 @ 2:39 pm
Love the humor, thank you!