Review of ‘Data Science: A First Introduction’

A brief review of ‘Data Science: A First Introduction’ by Tiffany-Anne Timbers, Trevor Campbell, and Melissa Lee.

Author
Published

November 30, 2021

“Data Science: A First Introduction” by Tiffany-Anne Timbers, Trevor Campbell, and Melissa Lee from University of British Columbia’s Department of Statistics is the first-year data science textbook by which all others will be judged. It is a kind, yet rigorous, textbook that provides a foundation on which students can quickly answer exciting questions. One can imagine the authors saying to a new student ‘I’ve spent my life learning all this exciting stuff, and I can’t wait to share it with you’.

The book is effectively divided into three parts: the first four chapters focus on data; the next six chapters go through statistical methods; and the final three brief chapters introduce tools such as Jupyter, version control, and local installation.

Chapter 1 focuses on R and the tidyverse and is based around a dataset of languages spoken in Canada. It introduces key verbs, such as filter, select, arrange, and slice, as well as ggplot2. Chapter 2 covers how to get data into R; initially using CSV and TSV files, but then turning to SQL. This early focus on SQL (starting at p. 41) is a key feature of this book, and one that will pay dividends for students as they progress with subsequent courses. This chapter again uses the Canadian languages dataset. Chapter 3 focuses on cleaning and wrangling data within the tidy data framework. Key verbs including mutate, summarize, map, pivot_wider, and pivot_longer are introduced in this chapter, as is the base pipe operator |>. This example of the authors’ decision to use the latest innovation, is reflected throughout the book in many other choices. The chapter uses a dataset of Canadian city populations and the Canadian languages dataset. Chapter 4 focuses on data visualization with ggplot2, especially scatter plots, line plots, bar plots, histograms, and how to improve on the default plots. The chapter uses the Old Faithful eruptions dataset and the Michelson’s speed of light dataset. The pages in this chapter that tell students how to explain a visualization (pp. 168-169) are an absolute treat.

Chapters 5 through to 10 focus on statistical methods. Chapters 5 and 6 are focused on classification, firstly introducing K-nearest neighbors, and then using this to introduce tidymodels and data preprocessing, which Chapter 6 then builds on to focus on test and training datasets, evaluation, and tuning, including lovely discussions of pre-processing and cross-validation. These chapters use the breast cancer dataset. There is a very nice discussion of what it means to be ‘good’ when it comes to accuracy on p. 224. Chapter 7 introduces regression using K-nearest neighbors and discusses underfitting and overfitting. It uses a real estate dataset from Sacramento. Chapter 8 turns to classical linear regression including simple and multivariable. There is some discussion of multicollinearity and outliers as well as prediction. This chapter again uses the Sacramento real estate dataset. Chapter 9 focuses on clustering within the context of K-means, including thorough discussions of how it all works, and some of the limitations. It uses the Palmer Penguins dataset. Finally, Chapter 10 discusses statistical inference, and sampling, including a lovely treatment of bootstrapping. It uses a dataset from Airbnb.

Chapters 11 through to 13 focus on tools. Chapter 11 introduces Jupyter including execution and markdown. Chapter 12 introduces version control and GitHub, covering not just commit, push, and pull, but also dealing with GitHub PATs, cloning, and collaboration including merge conflicts. Finally, Chapter 13 briefly covers installing all this locally, on the assumption that the reader to this point has been able to use a cloud solution that was already set-up for them.

To a certain extent data science textbooks are playing catch-up: we’ve long had methods texts in discipline-specific areas, but data sciences today is largely following demand from research/industry to move these methods into other areas – what discipline at a university covers real estate pricing? Into that void, Timbers, Campbell, and Lee, is the standard by which all other introductory data science textbooks will be judged. It is a rare immediate addition to the pantheon of introductory data science textbooks released (or updated) in the past five years, taking its place alongside R4DS, ISLR, and Statistical Rethinking. It introduces data science in a way that specifies exactly what a student needs and anticipates many of their questions. Its content will represent table stakes in terms of what we expect of students after they take first-year classes.

I made it 10 per cent of the way through Timbers et al before I learnt something new. Frankly I was surprised I made it so far. Data science pedagogy has been so disjoint and so many of us are self-taught that it is refreshing to have a class-room-tested textbook that is focused on workflows and reproducibility. The approaches are rigorous and opinionated, and the text is filled with kindness and warmth. It is the book that I wish I had when I first came to learn this material. The book is unashamedly focused on the newest innovations including tidymodels and the native pipe operator, and I soon found myself learning things, on average, at roughly one-thing-per-page, which was an exciting experience for someone who spends his days doing and teaching data science in R. This is a text that I can see myself coming back to regularly, not just in my teaching, but as a reference. I am hopeful that the authors will go on to write “Data Science”, and “Advanced Data Science”, without too much delay!

I mentioned kindness and warmth earlier, but it permeates the book. And there are many sections that proactively address questions that new students often have. For instance, p. 11 when a dataset is assigned to a name, the authors acknowledge how perplexing it can be to a student that nothing happens. Pleasingly the authors place reproducibility front-and-center of data science. For instance, the use of seeds is emphasised throughout the text, as is the importance of code that others can run.

There is extensive use of ‘Notes’ to separate more reflective content. One can imagine many of these are the result of the countless hours the authors have spent teaching this material. More generally, all aspects are clearly explained and built-up slowly. Chapter 3, which focuses on tidy data, is a particularly strong example of this. The text places potential research questions throughout, which may help retain student interest throughout.

While it’s clear that I think the book is great, there are a few areas there I wish they had expanded on a little. For instance, on p. 3 Timbers et al say ‘when you work with data, it is essential to think about how the data were collected, which affects the conclusions you can draw. If your data are biased, then you results will be biased.’ But there is not much coverage of this throughout the book. I wonder if subsequent editions of this book could consider more explicit coverage of ethics especially around data collection, following say, D’Ignazio and Klein Data Feminism, and also around applications of these methods, following say, O’Neil Weapons of Math Destruction? Similarly, it may make sense to introduce sampling concepts alongside data collection.

A free version of the textbook is available here: https://ubc-dsci.github.io/introduction-to-datascience/ and it is forthcoming from CRC Press. When it’s available I’d recommend that everyone buy it and assign it in their classes.

Acknowledgments

I was provided with a PDF version of the book for the purposes of writing a blurb. There are no other financial disclosures. I printed and bound it at my own expense. At my invitation Timbers has previously spoken at an event—Toronto Workshop on Reproducibility—that I organized earlier this year.