Courses: Topics in Statistical Data Science

STA4101: Topics in Statistical Data Science

Last updated: 2025-09-17

Essentials

Term: Fall 2025
Day/ Time: Monday, 2-5pm.
Location: HA Building #024, Room 410
Instructor: Rohan Alexander
Contact: rohan.alexander@utoronto.ca
Teaching Assistant: None
Office: Room 630, 140 St George St, Toronto
Office Hours: I am available after class or am in DoSS and able to meet most Tuesdays.

Overview

This course covers current topics in data science from a statistical perspective. I assume that you have some statistical analysis skills from other classes (for instance, that you've learned about linear regression). I emphasise practical aspects of data science, such as tools, workflows, reproducibility and communication. The main objective of this class is to provide you with a modern, end-to-end, reproducible data-science workflow, within which those analysis skills will sit.

Part 1 ensures that everyone has a common foundation of technical skills. These include: version control, reproducible environments, efficient data formats, Python (especially cleaning data with polars, estimating models with scikit-learn, and package management with uv), Quarto, database connections, using REST APIs, and implementing GitHub Actions.

Part 2 focuses on statistical workflow. Topics include writing and editing papers; gathering and validating data; data management; and model interpretation.

Part 3 integrates LLMs: using code assistants responsibly, running jobs on shared compute, making useful benchmarks, and deploying LLM-based tasks.

The course GitHub repo is: https://github.com/RohanAlexander/STA4101.

Learning objectives

The purpose of the course is to develop the skills needed to effectively do data science using modern tools, with a strong statistical focus. By the end of the course, you should be able to:

set-up and use reproducible pipelines;
manage real datasets;
communicate effectively, both written and oral;
create maintainable, production-ready analyses; and
apply LLMs in data science appropriately.

Content

Part 1: Technical skills

Week 1 "Reproducible and trustworthy workflows for data science"

8 September 2025

Learning outcomes:

think seriously about workflow;
know how and why to pull, commit, add, and push using Git and GitHub;
be comfortable making and receiving Pull Requests, and creating and closing Issues using GitHub to work as a team;
know how to set up and use a Python workspace in VS Code with uv;
know how to import CSV and Parquet files and inspect them; build and transform datasets with polars; and
know how to make presentations with Quarto.

Material:

Deffner, Dominik, Natalia Fedorova, Jeffrey Andrews, and Richard McElreath., 2024, "Bridging theory and data: A computational workflow for cultural evolution" PNAS, 10.1073/pnas.2322887121.
Zogheib, Ciara, "Data Practices of Cross-Domain Integration: Draw-and-Write Interviews with Interdisciplinary Scientists".
Timbers, Tiffany A., Joel Ostblom, Florencia D'Andrea, Rodolfo Lourenzutti, and Daniel Chen, 2025, Reproducible and Trustworthy Workflows for Data Science (the chapters on Version Control and Project Environments).
uv docs (especially the installation, first steps, installing Python, running scripts, working on projects sections).
Parquet docs.

Class:

Lecture:
- Workflow (20 min)
- Version control (20 min)
Demonstrate:
- Pull, commit, add, and push using Git and GitHub (10 min)
- Making and receiving PRs and Issues, and resolving Git conflicts (10 min)
- Downloading uv, then installing Python and core packages and creating project-specific environments, and linting (20 min)
- Saving and reading Parquet and CSVs (10 min)
- Cleaning and summarizing some data with polars (20 min)
- Installing Quarto and creating slides (10 min)
Worksheet (40 min).

Week 2 "Using Python for data science"

15 September 2025

Learning outcomes:

know how to generate synthetic data from common probability distributions;
know how to produce clear graphs and tables;
know how to apply scikit-learn (ridge, lasso, logistic, Poisson, random forests) with bootstrap and cross-validation, and present results in Quarto with cross-references;

Material:

Tufte, Edward, 2001, The Visual Display of Quantitative Information, Graphics Press, Chapter 1.
Cleveland, William, 1994, The Elements of Graphing Data, Hobart Press, Chapter 2.
Healy, Kieran, 2019, Data Visualization: A Practical Introduction, Princeton University Press, Chapter 1.
Alexander, Rohan, 2023, Telling Stories with Data, Chapter 5.
McKinney, Wes, Python for Data Analysis, [3e].
Efron, Bradley, and Trevor Hastie, 2016, Computer Age Statistical Inference, Cambridge University Press.

Class:

Lecture:
- Effective graphs (40 min)
Demonstrate:
- Generating synthetic datasets by sampling from normal, uniform, and Poisson distributions (20 min)
- Making effective graphs and tables (20 min)
- Fitting ridge, lasso, logistic, Poisson, and random forests with scikit-learn (30 min)
- Implementing bootstrap and cross-validation (20 min)
- Presenting results in Quarto with cross-references (15 min)
Worksheet (60 min).

Week 3 "DevOps for data science"

22 September 2025

Learning outcomes:

know how and why to create reproducible, portable environments;
know how to get data from secure data access;
understand efficient data engineering;
know how to ensure observability and quality; and
understand how to put a workflow into production with GitHub Actions.

Material:

Gold, Alex, 2024, DevOps for Data Science, Chs 1--6.
De Angelis, Inessa, 2026, "Using GitHub Actions for Communication Research: Applications with Bluesky Data".

Class:

Lecture:
- Why reproducible, portable environments matter (15 min)
- Secure data access (15 min)
- Observability and quality checks (15 min)
Demonstrate:
- Creating and using a Docker instance (15 min)
- Connecting to an API using environment variables for credentials (15 min)
- Efficient data engineering (15 min)
- Adding assertions to test joins, transformations, and model outputs (15 min)
- Writing a GitHub Actions workflow (20 min)
Worksheet:
- Build a reproducible workflow that loads data securely, applies transformations with checks, and runs automatically with GitHub Actions (40 min)

Week 4

29 September 2025

Mid-term I.

Part 2: Statistical workflow

Week 5 "Writing and editing"

6 October 2025

Learning outcomes:

understand that writing is a critical skill---perhaps the most important---of all the skills required to analyze data;
know how to focus on one main message that we want to communicate to the reader;
appreciate the value of being able to get to a first draft as quickly as possible; and
know how to rewrite.

Material:

Alexander, Rohan, 2023, Telling Stories with Data, Chapter 4.
Caro, Robert, 2019, Working, pp.141--158.
King, Stephen, 2000, On Writing, pp.111--137.
Zinsser, William, 1976, On Writing Well, pp.6--32 and pp.169--177.
King, Gary, 2006, "Publication, Publication", PS: Political Science & Politics, 39 (1): 119--25, 10.1017/S1049096506060252.

Class:

Lecture:
- Features of good writing in the following categories: Title, Abstract, Introduction, Data, Model, Results, Discussion (40 min)
Demonstrate:
- Provide examples of writing in the categories and show how to improve them (30 min)
Worksheet:
- Provide three examples of analysis and have students draft the papers (30 min)
- Provide three examples of drafts and have students edit them (30 min)
- Make a plan, based on G. King (2006), for how you will write a meaningful paper by the end of this class. Detail three journals/conferences, in order, that you will submit it to, and why the paper would be a good fit at each (10 min)

Week 6 "Data gathering and validation"

20 October 2025

Learning outcomes:

think seriously about measurement;
understand basic sampling concepts;
know how to use Python to scrape websites and gather data from APIs;
know how to establish tests for data;
know some common issues to be aware of.

Material:

Alexander, Rohan, 2023, Telling Stories with Data.
Statistics Canada, 2023, "Guide to the Census of Population, 2021", Statistics Canada.
Bowley, Arthur Lyon, 1913, "Working-Class Households in Reading", Journal of the Royal Statistical Society 76 (7): 672--701, 10.2307/2339708.
Neyman, Jerzy, 1934, "On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection", Journal of the Royal Statistical Society, 97 (4): 558--625, 10.2307/2342192.
Quartz, "Bad Data Guide".
Radcliffe, Nicholas J., 2026, Test-Driven Data Analysis, Chs 1--2. (This book is pre-publication. Extracts will be provided for class purposes.)

Class:

Lecture:
- Tea-tasting (15 min)
- Properties of measurements (10 min)
- Measurement error (10 min)
- Missing data (10 min)
- Data errors (15 min)
- Sampling (30 min)
Demonstrate:
- Web scraping (30 min)
- Using an API (30 min)
- Writing data tests (30 min)
Worksheet (30 min)

Week 7 "Data management and model interpretation"

3 November 2025

Learning outcomes:

Know how to rigorously manage data including filenames, variable names, codebooks, folder organization and documentation.
Know how to communicate model estimates for common models including linear, logistic, Poisson.
Know how to build DAGs.

Material:

Lewis, Crystal, 2024, Data Management in Large-Scale Education Research, https://datamgmtinedresearch.com
Arel-Bundock, Vincent, 2025, Model to Meaning: How to Interpret Statistical Models with marginaleffects for R and Python, https://marginaleffects.com.
Rohrer, Julia M. and Vincent Arel-Bundock, 2025, "Models as Prediction Machines: How to Convert Confusing Coefficients into Clear Quantities", https://osf.io/preprints/psyarxiv/g4s2a_v1

Class:

Lecture:
- Data organization (15 min)
- Human subjects data (15 min)
- Data management plan (15 min)
- Documentation (15 min)
- Style guide (15 min)
- Data tracking (15 min)
- Data collection (15 min)
Demonstrate:
- Creating marginal effects tables and graphs for linear, logistic, and Poisson models (30 min)
Worksheet:
- Provide output for three different models: linear, logistic, and Poisson, and then have the students write the results section. (30 min)

Week 8

10 November 2025

Mid-term II

Part 3: Integrating LLMs

Week 9 "Using Python for data science, with LLMs"

Learning outcomes:

Know how to use Claude code
Have principles for editing code.

Material:

Claude Code docs
Renni Browne and Dave King "Self-Editing for Fiction Writers"
Walter Murch "In the Blink of an Eye"
"What Editors Do: The Art, Craft, and Business of Book Editing"
Robert Martin "Clean Code"
Martin Fowler "Refactoring"
Andrew Hunt and David Thomas "The Pragmatic Programmer"
Braganza, Adrienne, "Looks Good to Me": Constructive Code Reviews"
https://stackoverflow.blog/2019/09/30/how-to-make-good-code-reviews-better/
https://posit.co/blog/setting-up-local-llms-for-r-and-python/

Class:

Lecture: Principles of editing (40 min)
Demonstrate:
- Setting up Claude Code (10 min)
- Using Claude Code (10 min)
Worksheet: Provide five examples of code and have students edit it.

Week 10 "Developing benchmarks"

17 November 2025

Learning outcomes:

Understand key sampling principles
Know what a benchmark is and why it is important.
Know how to build a benchmark.

Material:

Reuel, Anka, and Amelia Hardy and Chandler Smith and Max Lamparth and Malcolm Hardy and Mykel J. Kochenderfer, 2024, "BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices", https://arxiv.org/abs/2411.12990.
Hughes, Evelyn, and Rohan Alexander, "Autonomous end-to-end data analysis with LLMs".
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. IEEE Computer Vision and Pattern Recognition (CVPR), 2009, https://www.image-net.org/index.php

Class:

Lecture:
- Benchmarking (60 min)
- Ethics and fairness (20 min)
Demonstrate:
- Build a benchmark, as a class (60 min)
Worksheet (30 min)

Week 11 "Deploying LLMs"

24 November 2025

Learning outcomes:

Know how to deploy an LLM-based application.

Material:

Huyen, Chip, 2025, AI Engineering: Building Applications with Foundation Models, Chs 1-5.

Class:

Lecture:
- Building applications (20 min)
- Foundation models (20 min)
- Evaluation methodology (20 min)
- Evaluating AI systems (20 min)
- Prompt engineering (20 min)
Demonstrate:
- Creating a LLM-based application (40 min)
Worksheet (30 min)

Week 12

1 December 2025

Mid-term III

Assessment (option 1)

Mid-terms

Due dates: In-class Weeks 4, 8, and 12.
Weight: 30 per cent (x2) (the worst one is dropped).
Task: Write a test in exam conditions. Questions are based on content from the part.

Final paper

Due date: Exam block.
Weight: 40 per cent.
Task: Write an original paper on a topic covered in the class. The paper should be of submittable quality and there should be a clear path to submission. Some ideas include:
- Develop a benchmark in a particular area of interest and then evaluate an LLM's performance.
- Identify a dataset of interest from the list here.
Do not use a dataset from Kaggle, UCI, or Statistica. You should submit a one page proposal to me before Reading Week. Based on this, a rubric will involve the following aspects: Python and Python packages cited; Data cited; Class paper; Title; Author, date, and repo; Abstract; Introduction; Estimand; Data; Model; Results; Discussion; Prose; Cross-references; Captions; Graphs and tables; Referencing; Commits; Tests; Parquet; Reproducible workflow; and Miscellaneous.

Assessment (option 2)

In-class presentation

Due dates: You will present about the topic of the week.
Weight: 20 per cent.
Task: Working with one other student you should prepare slides and demonstrations and a worksheet using Quarto. You should then deliver a lecture and demonstration and lead the worksheet. Make sure that you are well equipped to answer questions and lead discussion. You should use GitHub PRs to put together the slides, working within the class repo. The Friday before your class you should meet with me at 9am to review all your material. You will be graded on: your use of GitHub to create content, your content and its delivery, and your ability to answer questions. Topics will be allocated in Week 1.

Mid-terms

Due dates: In-class Weeks 4, 8, and 12.
Weight: 20 per cent (x2) (the worst one is dropped).
Task: Write a test in exam conditions. Questions are based on content from the part.

Final paper

Due date: Exam block.
Weight: 40 per cent.
Task: Write an original paper on a topic covered in the class. The paper should be of submittable quality and there should be a clear path to submission. Some ideas include:
- Develop a benchmark in a particular area of interest and then evaluate an LLM's performance.
- Identify a dataset of interest from the list here.
Do not use a dataset from Kaggle, UCI, or Statistica. You should submit a one page proposal to me before Reading Week. Based on this, a rubric will involve the following aspects: Python and Python packages cited; Data cited; Class paper; Title; Author, date, and repo; Abstract; Introduction; Estimand; Data; Model; Results; Discussion; Prose; Cross-references; Captions; Graphs and tables; Referencing; Commits; Tests; Parquet; Reproducible workflow; and Miscellaneous.