Telling Stories with Data
Notes on a data-first end-to-end workflow using quantitative analysis and R.
Last updated 15 March 2020
Chapter 1 Overview
1.1 On telling stories
Like many parents, when our child was born, one of the first things that my wife and I did regularly was read stories to him. In doing so we carried on a tradition that has occurred for millennia. Myths, fables, and fairy tales can be seen and heard all around us. Not only are they entertaining but they enable us to easily learn something about the world. While ‘The Very Hungry Caterpillar’ may seem quite far from the world of quantitative analysis, there are similarities. Both are trying to tell the reader a story.
When conducting quantitative analysis we are trying to tell the reader a story that will convince them of something. It may be as exciting as predicting elections, as banal as increasing internet advertising click rates by 0.01 per cent, as serious as finding the cause of some disease, or as fun as forecasting the winner of a basketball game. In any case the key elements are the same. When writing fiction Wikipedia suggests there are five key elements: character, plot, setting, theme, and style. When we are conducting quantitative analysis we have analogous concerns:
- What is the data? Who generated it and how?
- What is the data trying to say? How can we let it say this?
- What is the broader context surrounding the data? Where and when was it generated? Could other data have been generated?
- What are we hoping others will see from this data?
- How can you convince them of this?
In the past, certain elements of telling stories with quantitative data were easier. For instance, experimental design has a long and robust tradition within traditional applications such as agricultural and medical sciences, physics, and chemistry. Student’s t-distribution was identified by a chemist, William Sealy Gosset, who was working at Guinness and needed to assess the quality of the beer (Tonse Raju 2005)! It would have been possible for him to randomly sample the beer and change one aspect at a time. Indeed, many of the fundamental statistical methods that we use today were developed in an agricultural setting. In the settings for which they were developed it was typically possible to establish control groups, randomize, and easily deal with any ethical concerns. In such a setting any subsequent story that is told with the resulting data is likely to be fairly convincing.
Unfortunately, such a set-up is rarely possible in modern data science applications. On the other hand, there are many aspects that are easier today. For instance, we have well-developed statistical techniques, easier access to larger datasets, and open source statistical languages such as R. But the lack of ability to conduct traditional experiments means that we must turn to other aspects in order to tell a reader a convincing story about our data. These other aspects allow us to tell stories about causality even in the absence of a traditional experimental set-up.
1.2 Telling stories with data
These notes focus on what to do when traditional experimental design methods cannot be implemented or are not appropriate. The aim is to equip you with everything you need to be able to write short, technical, memos, that convince a reader of the story you are telling. These notes encourage research-based, independent learning. This means that you will develop your own questions and answer them to the extent that you can. We focus on methods that can provide some measure of causality even when we cannot conduct traditional experiments. Importantly, these approaches do not rely on ‘big data’, but instead on better using the data that are available. The purpose of the notes is to allow you to tell convincing stories using data and quantitative analysis. They blend theory, examples, and labs, to equip you to with practical skills, a sophisticated workflow, and an appreciation for how more-advanced methods build on what is covered here.
Data science is multi-disciplinary. It takes the ‘best’ bits from fields such as statistics, data visualisation, programming, and experimental design (to name a few). As such, data science projects require a blend of these skills. These are hands-on notes in which you will learn these skills by conducting research projects using real-world data. This means that you will:
- obtain and clean relevant datasets;
- develop your own research questions;
- use statistical techniques to answer those questions; and
- communicate your results in a meaningful way.
These notes were developed in collaboration with professional data scientists as well as academics from a variety of fields. They are designed around approaches that are used extensively in academia, government, and industry. Furthermore, they include many aspects, such as data cleaning and communication, that are critical, but rarely taught. However, these notes do not contain everything that you need. Your learning must be ‘active’ when using these notes because that is the way you will continue to learn through the rest of your life and career. You need to seek out additional information, critically evaluate it, and apply it to your situation.
The workflow that we follow in these notes is:
- Research question development.
- Data collection.
- Data cleaning.
- Exploratory data analysis.
- Statistical modelling.
All of these aspects are critical to being able to tell convincing stories in an absence of traditional experimental set-ups. Importantly, when you are doing causal analysis, you are trying to convince a reader of your story. Your ability to convince them of your story depends on the quality of all aspects of your workflow.
If we were to expand on this workflow then we roughly get the chapters that are covered in these notes. From the first chapter we will have a workflow (make a graph then write about it convincingly) that could allow us to speak to causality. In each subsequent chapter we add aspects and depth to our workflow that will allow us to speak with increasing sophistication and credibility.
This workflow also aligns nicely with the skills that are sought in data scientists. For instance, Mango Solutions, a UK data science consultancy, describes ‘the six core capabilities of data scientists’ as:
- modeller; and
1.3 These notes
The software that we use in these notes is R. This language was chosen because it is open-source, widely used, general enough to cover the entire workflow, yet specific enough to have plenty of the tools that we need for statistical analysis built in. We do not assume that you have used R before, and so another reason for selecting R for these notes is the community of R users which is, in general, especially welcoming of new-comers and there are a lot of great beginner-friendly materials available.
If you don’t have a programming language, then R is a great one to start with. If you have a preferred programming language already, then it wouldn’t hurt to pick up R as well. That said, if you have a good reason to prefer another open source programming language (for instance you use Python daily at work) then you may wish to stick with that. However, all examples in these notes are in R.
Please download R and R Studio onto your own computer. You can download R for free here: http://cran.utstat.utoronto.ca/, and you can download R Studio Desktop for free here: https://rstudio.com/products/rstudio/download/#download.
Please also create an account on R Studio Cloud: https://rstudio.cloud/. This will allow you to run R in the cloud, which will be helpful when you are getting started.
1.3.2 Assumed background
These notes assume familiarity with first-year statistics. For instance, if you have a taken a course or two where you covered hypothesis testing and similar concepts then that should be enough.
These notes are structured around a 12-week course. Each chapter contains a list of required reading, as well as a list of recommended reading for those who are interested in the topic and want a starting place for further exploration. All chapters contain a summary of the key concepts and skills that are developed in that chapter. Code and technical chapters additionally contain a list of the main packages and functions that are used in the chapter. Each of the main content chapters (as opposed to labs) also has a pre-quiz. This is a short quiz that you should complete after doing the required readings, but before going through the chapter to test your knowledge. After completing the chapter, you should go back through the lists and the pre-quiz to make sure that you understand each aspect.
There are five problem sets throughout these notes. These are opportunities for you to conduct your own research on a topic that is of interest to you. Although the initial problem sets require you to use data from the Toronto Open Data Portal (https://open.toronto.ca/), after those first few you are able to use any appropriate dataset. Although open-ended research may be new to you, the extent to which you are able to develop your own questions, use quantitative methods to explore them, and communicate your story to a reader, is the true measure of the success of these notes.
Many people gave generously of their time, code, and data to help develop these notes. Thank you to Monica Alexander, Michael Chong, and Sharla Gelfand for allowing their code to be used. Thank you to Kelly Lyons, Hareem Naveed, and Periklis Andritsos for helpful comments.
Thank you to the Winter 2020 Term INF2178 students at the University of Toronto, whose feedback greatly improved all aspects, especially: Aaron Miller, and Mounica Thanam.
Any comments or suggestions on these notes would be welcomed. You can contact me: firstname.lastname@example.org.
Tonse Raju. 2005. William Sealy Gosset and William A. Silverman: Two "Students" of Science. Pediatrics. Vol. 116. https://doi.org/10.1542/peds.2005-1134.