Remarks at Trinity College

Introductory remarks delivered at Trinity College High Table panel on data science at the University of Toronto on 19 October 2021.

Author
Published

October 19, 2021

The slides are here.

Introduction

Hi, my name is Rohan Alexander. I’m an assistant professor in the Faculty of Information and the Department of Statistical Sciences at the University of Toronto. I’d like to thank Academic Don, Statistics, Manmeet Sangha for the invitation to talk today at Trinity College.

When I asked the undergraduate students that I work with who was associated with ‘Trin’, it was no surprise to me that the college was well represented among the strongest ones. ‘Trinnies’ have a well-deserved reputation for academic excellence and the number of people that have shown up for this event speaks to how that tradition continues even under trying circumstances.

I’ve only got 5 minutes, so I’d like to touch on just two aspects:

  • An example of what I do in data science.
  • Some steps to get started in data science.

What I do in data science

There is no agreed definition of data science, but an informal one is: ‘humans measuring stuff, typically related to other humans, and using sophisticated averaging to explain and predict’.

I like that definition because it does not treat data as terra nullius, or nobody’s land. Data must be gathered, cleaned, and prepared, and these decisions matter. Every dataset is sui generis, or a class by itself.

A lot of what my research does is try to understand the specific nuances of specific datasets, especially in terms of who is included in them, and who is not, and understand how this affects inference.

One example of this is natural language processing. There we use really big models, built on a lot of data, to generate text. To make sure everyone knows what that means, here’s a model that has two parameters: \(y = x1 + x2\). And is trained on 5 rows of data.

OpenAI’s GPT-3 has 175 billion parameters and was trained on essentially the whole internet. I’ll do a quick demonstration of asking it: “What is Trinity College, Toronto?”, “Which famous person went to Trinity College, Toronto?”

[Live demo]

GPT-3, like any language model, can generate racist/sexist text. With a student I recently used it to see whether its ability to generate racist/sexist text, also meant that it could recognize racist/sexist text and explain what made it racist/sexist. The results were disappointing and lead to a variety of conclusions, including the fact that we need to do a better job of putting together the enormous datasets that underpin these models.

Some steps to get started in data science

One way to contribute in data science is to be at the intersection of two areas. Typically statistics and something else. Then you need to immerse yourself in the data. One of the most successful ‘Trinnies’ - Michael Ignatieff (IG-NAH-TEE-UHF) - says about politics that

The grind of politics, the endless travel, the meetings, the impossible schedule, the constant being on show are all in search of an authority that can be acquired in no other way. You have to learn the country.

And it’s the same in data science - you need to become an expert in some dataset by doing the grinding, endless, impossible schlep to understand it. Only then can you even begin to consider inference and prediction.

A lot of what you do in class is play house; when what you need to do is actually build something. One thing that helped me in this regard was getting to work with professors.

So how do you get to work with a professor? There’s an easy answer, which I expect is relevant for many ‘Trinnies’, and that’s to just to get an A+ in their class. But more generally, you should send an email that demonstrates that you’ve got:

  • an interest; and
  • some skills.

For instance, these days a lot of papers in the fields that I’m interested in need some type of R package to go alongside them, and it can be hard to find students who can do this. If a student emailed having made an R package, that would demonstrate a lot of interest and skills, and it would be clear how I could involve them. There’s an enormous difference between a student who emails claiming to know R, and a student who emails a link to a GitHub repo that demonstrates it.

Concluding remarks

Thank you very much to Manmeet for the invitation to be on this panel. It’s a great honour to be here, alongside my colleagues - Sam, Emma, and Morris - and one of my bosses - Helen! Very much looking forward to the discussion.