Sunday, January 4, 2015

Lost in Translation, the Statistics Edition


If you speak multiple languages, then you're very much aware of words that easily get lost in translation. If you speak "Statistics," this happens just the same. Students tell me that one of the hardest things about learning Statistics is the vocabulary, learning and using the terminology correctly. Being a Statistics teacher has made me increasingly aware of certain statistical vocabulary that students, and people in general, use that often don't mean exactly what they think they mean.

Here are five statistical words and concepts that I think can sometimes get lost in translation:

  1. Correlation -- We use "correlation" in the English language to denote any relationship between two things. In Statistics, however, this word really identifies a specific, linear relationship. To talk of a correlation between two variables that are non-linearly related even sounds a little silly to me. It can be hard for people to make that distinction, let alone resist the urge to make the leap from correlation to causation. I like to think of correlation as being conditioned on a linear relationship. I try to shift students away from using a correlation coefficient as evidence for the strength of a linear model since it can be very misleading, like a siren song.
  2. Prediction -- This one is more of a personal pet peeve that stems from my prior actuarial work. Many statisticians use predictive models and the term "predict" to indicate using a model to make an estimate. If I had a nickel for every time a student wrote the phrase "predict the future" when commenting on the utility of regression or statistics, in general, I could retire early. As if statisticians have a crystal ball and can intuit what will happen in the future. Hardly. This is, though, precisely one of the reasons the actuarial firm for which I worked banned "predict" and its derivatives from all outgoing documents. It has stuck with me to this day. I don't think the use of prediction is wrong, I just think it helps perpetuate some misconceptions. Instead, I encourage the use of the word "project" to replace "predict."
  3. Random* -- Yes, this is a tricky one. This word that gets thrown around quite randomly arbitrarily to informally mean a lot of things. "I randomly bumped into a friend in Park Slope." "Here's a random thought..." "She randomly dropped the glass." Nope, not really random. Randomness in Statistics is about the lack of pattern (e.g., a random set of digits), having an indeterminable value (e.g., a random variable), or of having equal chance (e.g., a random sample). The problem here is not so much that we use "random" to generally mean pseudorandom, but that we use it to mean so many other things. So, what to use instead? Unexpected, miscellaneous, arbitrary, haphazard, accidental, ... there are so many better alternatives. 
  4. Percent** -- I think this is a mathematical concept that often gets misused and misunderstood. Proportional reasoning is typically hard for students and the use of percentages to represent ratios is made even more difficult when it's unclear of what that ratio is. Talking about a percent is made foggy if one cannot answer the question: "A percent of what?" Is it percentage points, a joint percentage, a conditional percentage, or what? Even further, reports of percentage comparisons can be misleading if used to compare ratios with unequal denominators. In most cases, these translation issues are remedied by asking a hierarchy of questions to promote understanding as well as representing percents in different mathematical ways.
  5. Age -- When surveying people and asking about their demographic information, it can be difficult to correctly obtain even basic information. Classification questions about gender, ethnicity, and the like are somewhat obviously not so simple. The bias issues in this case can often be reduced by adding an "other" option when asking categorical questions. But why is this also true when asking about age? If someone is 19 years old, for example, it can be confusing as to whether that means the person has finished their 19th year of life or is starting it. Sometimes people round ages, often if asked about the age of a relative. This is further complicated because in different languages how one says his/her age varies. While I don't think that the true definition of age varies too greatly, good surveys avoid this issue by asking people for their birth date.
What's most important about these misconceptions is that they can help guide teaching so that students can better understand the proper use of vocabulary, avoid common misconceptions, and promote better statistical literacy.

* The site Random.Org uses atmospheric noise to produce random numbers, lists of digits, and other random-based applications.
** Ian Hay of the University of Tasmania is in part responsible for helping me think about this misconception. You can read more about it in his ICOTS 9 invited paper, Teaching Probability: Using Levels of Dialogue and Proportional Reasoning.

(Graphic from memegenerator.net via hellogiggles.com.)

No comments:

Post a Comment