Turning our world into data

A talk delivered at the Harvard Biostatistics Data Science in Action Summer Camp, 7 July 2021, organised by Jesse Gronsbell.

Author

Rohan Alexander

Published

July 6, 2021

Introduction

Hi, I’d like to thank Jesse Gronsbell for letting me talk today. My name is Rohan Alexander, and I’m an assistant professor in the Faculty of Information and the Department of Statistical Sciences at the University of Toronto.

Professor Gronsbell is also a professor in Statistical Sciences, but that’s about where my comparison with Jesse ends. Jesse works on a wide range of biostatistics problems, using classical statistical techniques to solve new problems, especially with large observational health data sets such as electronic health records. It’s a privilege to get to call her a colleague in data science.

I’m just someone who likes to play with data using the statistical programming language R (R Core Team 2020). And the nice thing about what we now call data science is that there’s space for both of us. And there’s space for you as well.

My path to data science

I’m 34 and so I’m in the ‘data science didn’t exist when I was an undergrad’ generation. Most of you are about 17, or about half my age. When I was 17, I wasn’t thinking about data science summer camps nor self-driving cars, so it’s commendable that you’re already so advanced.

When I was 17 I just went to uni and studied economics, because that’s what my friends were doing. Luckily, I had some terrific mentors including Rabee Tourkey, Flavio Menezes, and Leo Yanes, and what they did was give broad instructions then get out of the way and check in weekly. While I don’t think I delivered anything of value to them, that experience of working directly with professors was formative for me.

One day I was in the library with a friend, and I wanted to go to the pub, but my friend insisted I wait until he’d applied for an internship at the Reserve Bank of Australia (RBA). I didn’t want to go to the pub without him, so I applied for that internship as well while I was waiting.

Be careful who you make friends with because the result was that after uni I worked for a bit at the RBA, which is Australia’s central bank (think: Fed Reserve) and that was great because I learnt about just how vital communication—both written and speaking—is. (My friend didn’t get the internship, but he ended up at Goldman Sachs, so he’s doing just fine.) And also, about navigating a traditional workplace.

After that I started a business with some friends because that was what I was doing when I wasn’t at work, and it seemed to make sense to make that my actual job. And that start-up was great because I learnt about strategy, tactics, execution, teamwork, and leadership; all the stuff that traditional workplaces don’t let junior analysts worry about

Why data science is exciting

I’d like to come back to that earlier point—that data science didn’t exist when I was 17—because it may imply that one should not just be making decisions that optimize for what data science looks like right now, but also what could happen. While that’s a little difficult, that’s also one of the things that makes data science so exciting. As a 17-year-old that might mean choices like:

taking courses on fundamentals, not just fashionable applications;
reading books, not just whatever is trending on Hacker News; and
trying to be at the intersection of at least a few different areas, rather than hyper-specialized.

When that start-up finished, I started a PhD. My PhD is actually in economic history, so if I had my way, I’d spend my time in dusty, beautiful libraries reading old books and drinking coffee in the morning, and then writing code and drinking wine in the afternoon. Your library at Harvard is very nice, but not quite dusty enough for me, and you’re not allowed wine in there. The Faculty of Information, which is one of my appointments, originally existed to train librarians, so these days I get an office in a library, which is great, and my role is ‘human centered data science’, so it’s very much the best of both worlds. I get to turn our world into data, analyse it, and get paid to do it!

I spent most of my PhD trying to deal with big text datasets. And when I say ‘dealing’ I mean using R to clean and tidy them, which was the work of years. My supervisors—John Tang, Tim Hatton, Martine Mariotti, and Zach Ward—gave me the freedom to do what I wanted, again with regular weekly meetings. It’s not a topic that traditionally would have been appropriate, and I’m grateful they gave me the space because traditional economics is not for me, but data science is. And I hope that you also consider that it could be for you.

What data science means to me

What is data science? There’s not one agreed on definition, but a lot of people have tried. My other appointment is in the Department of Statistical Sciences, and my boss there says (Craiu 2019):

This unsureness isn’t necessarily the weakness many consider it. After all, who can really say what makes someone a poet or a scientist? Even so, some things can be said. In broad strokes, … someone with a data driven research agenda, who adheres to or aspires to using a principled implementation of statistical methods and uses efficient computation skills.

I like this definition, but I also think there’s value in having a simple definition, even if we lose a bit of specificity in the process. Probability, which is related to statistics, is often informally defined as ‘counting things’ (McElreath 2020, 10). In a similar informal sense, I’m currently playing around with a definition of data science that is something like ‘measuring stuff and averaging it, with purpose’.

One reason that I like this definition is that it doesn’t treat data as terra nullius, or nobody’s land. Statisticians tend to see data as the result of some data generating process, which we can never know, but that we try to use data to come to understand. I’m oversimplifying here, and many statisticians care deeply about data, but at the same time, there’s a lot of cases in statistics where the data kind of just appear; they belong to no one. But that’s just never the case.

Data have to be gathered, cleaned, and prepared, and these decisions matter (Huntington-Klein et al. 2021). I’ve come to believe that every dataset is sui generis, or in a class by itself, and so when you come to know one dataset really well, you just know one dataset, not all datasets.

During this data science camp, you’re going to be exposed to an awful lot of ‘science’. I’d like to spend a little more time in defence of ‘data’. And that’s another reason that I like my definition. I argue that the most important word in ‘data science’ is ‘data’ and would love to convince you of it.

I’ll take an example from Jordan (2019) where he talks about being in a medical office and being given some probability, based on prenatal screening, that his child, then a fetus, had Down syndrome. By way of background, you can decide to test to know for sure, but that test comes with the risk of the fetus not surviving, so this initial screening probability matters. Jordan (2019) describes how he found those probabilities were determined by a study done a decade earlier in the UK. The issue was that in the ensuing 10 years, imaging technology had improved so the test wasn’t expecting such high-resolution images and there had been a subsequent (false) increase in Down syndrome diagnoses when the images improved. There was no problem with the science here, it was the data that were the issue.

Tips for students

In my opinion, we are realising that it’s not just the ‘science’ bit that’s hard, it’s the ‘data’ bit as well. For instance, researchers went back and examined one of the most popular text datasets in computer science, and they found that around 30 per cent of the data were inappropriately duplicated (Bandy and Vincent 2021). There’s a lot of people who could have told the computer scientists that those datasets would have big problems; there’s an entire field—linguistics—that specialises in this stuff. And this is one of the dangers of any one field being hegemonic, and why it’s important that you don’t specialise too early. Instead, pick at least two different areas, learn how the ‘insiders’ speak in each, and then be the person that translates between them.

Data science needs diversity. And it’s one reason that I’m excited to see you all here with your variety of skills and backgrounds—we need you in our research labs. We need your intelligence and enthusiasm. We need you to be in the room making contributions. And I think, just like me, that working directly with professors would be formative and enjoyable for you. I hope you’re lucky enough, like me, to be given the space to find what types of projects you’re intrinsically interested in, because then everything just becomes a lot easier.

How do you push down the door and work with professors? I was very lucky and it was a lot easier for me to get opportunities than it is for you, because there were fewer of us. But basically, what I think you should do is show a professor that you’ve got:

an interest; and
some skills.

One way to do this is through demonstrating your interest in coding and the data science skills that you’re developing at this camp. For instance, these days a lot of papers in the fields that I’m interested in need some type of R package to go alongside them, and it can be hard to find students who can do this. If a student emailed having made an R package, that would show a lot of interest and skills, and it would be clear how I could involve them. Another thing that is increasingly needed is a datasheet (Gebru et al. 2020). So again, you could put together documentation for a dataset and email that. Again, it shows that you’ve got a genuine interest and that you’ve got a base of skills that would enable you to be useful.

Thanks very much for letting me speak. I know that everyone says this to you, and with apologies to Olivia Rodrigo, but being 17 is just such an exciting time of one’s life, and I’m glad to see that you’re making the most of it. One seems to go straight from being the youngest in the room to being the oldest, without any middle. And, just so that you know, you kind of never work life out— everyone is just making it up as they go.

I’d be happy to take any questions.

Acknowledgments

Thank you very much to Monica for reading a draft of this.

References

Bandy, Jack, and Nicholas Vincent. 2021. “Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus.” https://arxiv.org/abs/2105.05241.

Craiu, Radu V. 2019. “The Hiring Gambit: In Search of the Twofer Data Scientist.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.440445cb.

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III au2, and Kate Crawford. 2020. “Datasheets for Datasets.” https://arxiv.org/abs/1803.09010.

Huntington-Klein, Nick, Andreu Arenas, Emily Beam, Marco Bertoni, Jeffrey R Bloem, Pralhad Burli, Naibin Chen, et al. 2021. “The Influence of Hidden Researcher Decisions in Applied Microeconomics.” Economic Inquiry.

Jordan, Michael I. 2019. “Artificial Intelligence—the Revolution Hasn’t Happened Yet.” Harvard Data Science Review 1 (1). https://doi.org/10.1162/99608f92.f06c6e61.

McElreath, Richard. 2020. Statistical Rethinking. CRC Press.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.