Opportunities Provided by Open Data and Reproducibility

Some thoughts on getting started with open data and reproducibility. A talk delivered at the University of Toronto, Stellar Stats Workshop, 28 May 2021, organised by Gwen Eadie & Josh Speagle.

Author

Rohan Alexander

Published

May 23, 2021

Introduction

Hi, my name is Rohan Alexander. Thank you very much to Gwen and Josh for the opportunity to speak. I’m a baby professor at the University of Toronto, in the Faculty of Information and the Department of Statistical Sciences. Today I would like to talk about taking some baby steps toward open data and reproducibility. And in particular the opportunities provided by open data and reproducibility across applied statistics and astronomy.

I’d like to talk about three benefits of openness and reproducibility:

improved diversity broadly;
improving your own (future) life; and
building on-ramps for your collaborators.

And then three baby steps that you could get started with this afternoon:

buy a bunch of notebooks;
write a datasheet; and
make a software package.

Three benefits

By way of background, M. Alexander (2019) says ‘research is reproducible if it can be reproduced exactly, given all the materials used in the study’. She goes on:

Reproducibility is not just publishing your analysis code. The entire workflow of a research project––from formulating hypotheses to dissemination of your results––has decisions and steps that ideally should be reproducible.

Open Data Institute (2017) defines open data as ‘data that’s available to everyone to access, use and share.’ Open data enables reproducibility and is the bedrock of progress.

Improved diversity

The first benefit that I see is improved diversity, broadly, in our fields. By making our work accessible to more people, we help to to make education and research resources more equitable. In Canadian STEM fields diversity at an undergraduate level is not great, but it’s also not too bad—we reflect broader Canadian society to some extent—but at each successive level we significantly worsen (Villemure and Webb 2017). In my own Department of Statistical Sciences, University Professor Nancy Reid was the only tenured research-track woman in the department for decades. And in astronomy, speaking more broadly, Prescord-Weinstein (2021, 132) says ‘I’ve never met a Black woman professor in the field of theoretical cosmology because I am the first—not just the first professor, but also the first Black woman cosmology theory PhD’¹.

Improving diversity is important because diverse teams are associated with higher performance (Hunt et al. 2020). That’s just a correlation, but thinking about my own experiences, I do think that diversity in my own team improved my productivity. For instance, I’ve found that students with different experiences push back and criticize me about things that I haven’t considered. Improving those aspects makes the work better.

As an aside, I also think that as Canada’s best university, and a public one at that, we have a duty to be more reflective of Canada. There’s a quote from one of my favourite books that I thought this audience would like, because it features an astronomer is talking to a child, trying to convince him that humanities are important, and the astronomer says:

Any science is expensive and astronomy is more expensive than most…. If you need hundreds of millions of pounds sooner or later you are going to have to talk to people who don’t understand what you’re doing and don’t want to understand because they hated science at school.

DeWitt (2000, 396)

It’s entirely correct that these people are in charge because that’s who we voted for. But it would be nice if they didn’t hate astronomy or statistics when they take it at university and possibly part of that is reflecting other experiences than just our own.

Improving your own (future) life

The second benefit is that ‘future-you’ will be helped. I think that most of us who write code to analyse data for a living have had the feeling of coming back to a project after six months and a lot of the time it’s literally like having to start a new project. The variables make no sense, and the code is unintelligible. The main question for me is always ‘why did I do that?’. And then I spend the afternoon re-coding and almost always end up back where I was.

If we’re spending eight hours a day writing code to analyse data, then almost anything that adds even one per cent to our productivity is worthwhile, let alone something that saves an afternoon. And that’s particularly the case with open data and reproducibility, because these benefits tend to compound over time and accrue not just to you but other researchers. Talking about knowledge that only you have, doesn’t make you knowledgeable, it makes you a crank, and that includes knowledge that only ‘past-you’ has.

Building on-ramps

The third benefit is that adopting open data and reproducibility can allow us to build better on-ramps for our collaborators. I love applied statistics, and I can’t quite believe that I get paid to do this job, and I really want more people to be able to work with me on my projects. These days I get to work with a lot of students, and I think the best way for me to make it easier for them to get up to speed is to adopt open data and reproducibility principles.

I don’t think of myself as a scary or intimidating person, but I’m regularly told that in anonymous surveys. New, especially undergraduate, collaborators may be hesitant to ask questions because they feel awkward, or they feel they should know the answer or that they worry I’ll think they’re dumb. Some thoughts that may run through their head when they watch me talk through some data analysis: “Why did he remove the 99s?”, “Maybe everyone does that?”, “Can I just look this up after he finishes talking?”, “Is he ever going to finish talking?”, “Oh no, now I missed it, what is he saying now?”. Open data and reproducibility enable them to not be reliant on me and I hope makes it easier for my collaborators to work with me.

Three baby steps

Right, so you’re convinced! You want these great benefits! How can you go about getting them? I’m not asking for anything big from you, and I’m not asking you to advocate, or even change much of anything that you do. (Indeed, if you’re faculty then I’m just asking you to spend some research funds and hire some undergrads!²)

Buy a bunch of notebooks

The first is to buy a notebook. (If you’re faculty, then go and buy a bunch for your team.) And then just write one dot point each time you write some code chunk. Let’s say you’re cleaning some text data, and you want to change all the instances of ‘Rohan’ to ‘Monica’. Just before you write the code that does that, or perhaps just after, write a simple dot point in that fancy notebook that you bought that explains what you did and why. That’s it. (Don’t worry too much about what you write initially – you’ll get better at it naturally over time.) At the end of the day, you’ll magically have a plain-English list of everything that was done to the dataset. That can easily be added to an appendix or added as comments and documentation alongside the code. Newton and Da Vinci kept notebooks! And if that doesn’t convince you (and it shouldn’t, see: selection bias) the US NIH describes notebooks in science as ‘legal documents’ that support claims to patents, defend against allegations of fraud, and act as your scientific legacy (Ryan 2015).

Write a datasheet

The second thing is writing a datasheet for your dataset. I had a quick look at some of the astronomy data repos and there’s some amazing things around, NASA came up immediately of course, but also a whole bunch associated with folks speaking at this workshop! I’ve never used astronomical datasets, but when I use political or economic datasets of a similar nature, I sometimes find that because I didn’t collect, clean, or prepare the dataset myself, it can be difficult to trust it. This is where datasheets come in (Gebru et al. 2020). Datasheets are basically nutrition labels for datasets. It’s really important to understand what you’re feeding your model, but plenty of researchers don’t have any idea. For instance, recently researchers went back and wrote a datasheet for one of the most popular datasets in computer science, and they found that around 30 per cent of the data were duplicated (Bandy and Vincent 2021)!

Instead of telling you how unhealthy various foods are, a datasheet tells you things like:

‘Who created the dataset and on behalf of which entity?’
‘Who funded the creation of the dataset?’
‘Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?’
‘Is any information missing from individual instances?’

I’ve integrated datasheets into my teaching and an example of one that a student wrote earlier in Winter 2021 is Rosenthal (2021). I’m not saying that this isn’t helpful for the student who made it—I hope it was—but it’s especially helpful for me when I think about how best to use the dataset that he created.

The speaker after me, Renée Hlozek, has a bunch of papers documenting datasets, for instance LSST Dark Energy Science Collaboration et al. (2021), and another about data challenges (Hložek 2019) that I think would be interesting to combine with the idea of backfilling datasheets.

Make a software package.

The third and final thing is to build out internal APIs for your code and then make them external. Now this may sound intimidating, but you can do it! (Or if you’re faculty, again, it’s something that you can have an undergraduate do.) My PhD is in economic history and basically what that means is that I have a PhD in data gathering and cleaning. To my shame, I have literally written code that ‘downloads a PDF from a URL, saves it to my computer, pauses, and goes and gets another PDF from a very similar URL’, literally hundreds of times. And of course, no one should write the same code hundreds of times.

Last summer I found myself asking a student to go and write code to do that same task and I finally decided that enough was enough. Instead, I had them put together an R package. Now this was literally only the work of a week for them. So instead of everyone in the lab writing code each time they needed to download a PDF, they could just call this R package and use that instead: we wrote an internal API. I decided that I wanted to make it external facing and there’s now an R package—‘heapsofpapers’—that anyone can use (R. Alexander and Mahfouz 2021).

If you want to do this and you use R then just follow ‘Chapter 2 - The Whole Game’ from Wickham and Bryan (2021). Within a few hours you can have a workable solution and within a month or two you can have a great external API.

And for those of you who are more Python focused, the speaker before me, Jo Bovy, has all these great Python packages on his GitHub, including a whole course on writing Python packages (Bovy 2021).

Making internal APIs external is one reason that Amazon is worth a trillion dollars (Benzell, Lagarda, and Van Alstyne 2017), and while I can’t promise that it’ll make you a millionaire, I can promise that you’ll get a paper out of it, and more importantly, help your field.

Concluding remarks

I think that open data and reproducibility can be intimidating. I have a two-year-old and I imagine that when he first started to walk he was pretty intimidated also. But he did take his first steps and now he runs around the whole day. And I assure you that it’s the same for open data and reproducibility.

I’m looking forward to diving further into the work of the other speakers at this workshop, and thank Gwen and Josh for bringing these two communities together, and for letting me speak today.

Acknowledgments

Thank you very much to Monica for reading a draft of this. If you want something to watch after reading this then I recommend M. Alexander (2021)

References

Alexander, Monica. 2019. Reproducibility in Demographic Research. Max Planck Institute for Demographic Research: Talk given at Open Science Workshop. https://www.monicaalexander.com/posts/2019-10-20-reproducibility/.

———. 2021. Getting Started with Sharing Code. https://youtu.be/yvM2C6aZ94k.

Alexander, Rohan, and A Mahfouz. 2021. “Heapsofpapers.” https://rohanalexander.github.io/heapsofpapers/.

Bandy, Jack, and Nicholas Vincent. 2021. “Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus.” https://arxiv.org/abs/2105.05241.

Benzell, Seth, Guillermo Lagarda, and Marshall W Van Alstyne. 2017. “The Impact of APIs in Firm Performance.” Boston University Questrom School of Business Research Paper, no. 2843326.

Bovy, Jo. 2021. Python Code Packaging for Scientific Software. https://pythonpackaging.info.

DeWitt, Helen. 2000. The Last Samurai. Talk Miramax Books.

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III au2, and Kate Crawford. 2020. “Datasheets for Datasets.” https://arxiv.org/abs/1803.09010.

Hložek, Renée. 2019. “Data Challenges as a Tool for Time-Domain Astronomy.” Publications of the Astronomical Society of the Pacific 131 (1005): 118001. https://doi.org/10.1088/1538-3873/ab311d.

Hunt, Vivian, Sundiatu Dixon-Fyle, Sara Prince, and Kevin Dolan. 2020. Diversity Wins: How Inclusion Matters. McKinsey & Company. https://www.mckinsey.com/featured-insights/diversity-and-inclusion/diversity-wins-how-inclusion-matters.

LSST Dark Energy Science Collaboration, Bela Abolfathi, Robert Armstrong, Humna Awan, Yadu N. Babuji, Franz Erik Bauer, George Beckett, et al. 2021. “DESC DC2 Data Release Note.” https://arxiv.org/abs/2101.04855.

Open Data Institute. 2017. What Is ’Open Data’ and Why Should We Care? https://theodi.org/article/what-is-open-data-and-why-should-we-care/.

Prescord-Weinstein, Chandra. 2021. The Disordered Cosmos. Bold Type Books.

Rosenthal, Thomas William. 2021. “Datasheet for the COFFEE_COFFEE_COFFEE Dataset, Version 0.1.” https://github.com/mrpotatocode/COFFEE_COFFEE_COFFEE/blob/main/journal/Week8/DataSheet-0.1.md.

Ryan, Philip. 2015. Keeping a Lab Notebook. NIH. https://youtu.be/-MAIuaOL64I.

Villemure, Serge, and Anne Webb. 2017. Strengthening Research Excellence Through Equity, Diversity and Inclusion. NSERC Presentation. https://www.nserc-crsng.gc.ca/_doc/EDI/EDIpresentation_EN.pdf.

Wickham, Hadley, and Jenny Bryan. 2021. R Packages. https://r-pkgs.org/.

Footnotes

Thanks very much to Vianey Leos Barajas for gifting me this book.↩︎
Statistical Sciences has something in the order of 4,000 undergrads, which is quite a lot, but the great thing about it is that the top 5 per cent are just incredibly strong. You could hire a few of them for literally $25 an hour, and then just get them to do these three things! Possibly that hiring would even add to the diversity of your team.↩︎