Monday, August 25, 2014

Using Real Data from the Real World... Because Really

I like to use real data in my Statistics classes. I can think of only one brief instance off the top of my head that I wouldn't. Unfortunately, most of the math textbooks and exams my students see contain lots of made-up real-world situations with painfully obvious fake data. So it's a real relief that most of the Statistics textbooks I know are chock full of examples and problems that use genuine datasets. These data come from many different disciplines, from peer-reviewed journals, or from the author's personal data collection (and these are often fun).

For class assignments, I sometimes have my students collect data too. I sometimes, however, want my students to complete a task using whatever existing dataset they wish. In these cases, the context is not really important so I would much rather have them research and use something they find personally interesting. Most students love that they get to choose the topic, but a few have complained that they don't know where to look for data. Like, in alllll of the Internet they couldn't find any data anywhere that was worth investigating. Okayyy...

In my students' defense, there is so much data available publicly that I can understand how it can be overwhelming for a teenager. The datasets are out there but they're not always easy to find, or they require fancy software, or they're only available for a fee, or students don't know where to begin their search. So, this summer I set out to compile a list of good online sources for data that students and teachers can use in the Statistics classroom.

A few More than a few comments about this list:

  1. All of the sources provide data and datasets free of charge. Unfortunately, for example, data from the Gallup Poll Database is only available for a fee. That's not something I am personally interested in paying for at the moment, but if you are then you can find more information on the Gallup Analytics website.
  2. I've tried to help determine how one might search the website or database. I've also tried to list some of the topics and resources for which I think each site is useful. Your mileage may vary.
  3. The sources under "Data Archives and Libraries" contain data sets that may be good for teaching but may not be suitable for projects and assignments for which you wish students to produce work that is original in nature. That is, in these cases there's very little thinking that needs to be done. Furthermore, some are more, um, organized or searchable than others.
  4. The data is not always available in a nice format. Some of the websites have the data in tabular form or allow users to download in various user-friendly formats, but that's not always the case. That said, I steered clear of sites, like iPUMS, solely with microdata or requiring a data extraction system. This is software that is not readily available to my students. 
  5. Because the data is real, there's no guarantee that it will be clean. I certainly don't shy away from messy data in my classroom, but that might be a different story for new teachers or a middle-school setting. 
  6. Some of the sites listed might seem redundant. For example, I've listed the new (and exciting!) NYC Open Data website. This is meant to be one-stop shopping for all New York City data. I have also, though, listed the links for several of the city agencies that I find particularly useful or have yet to be included in the open data.
  7. This is a living document. Links and sites change all the time. If you have suggestions, questions, or kind comments, please feel free to email me.

