STA4101: Topics in Statistical Data Science
Last updated: 2025-11-24
Essentials
- Term: Fall 2025
- Day/ Time: Monday, 2-5pm.
- Location: HA Building #024, Room 410
- Instructor: Rohan Alexander
- Contact: rohan.alexander@utoronto.ca
- Teaching Assistant: None
- Office: Room 630, 140 St George St, Toronto
- Office Hours: I am available after class or am in DoSS and able to meet most Tuesdays.
Overview
This course covers current topics in data science from a statistical perspective. I assume that you have some statistical analysis skills from other classes (for instance, that you've learned about linear regression). I emphasise practical aspects of data science, such as tools, workflows, reproducibility and communication. The main objective of this class is to provide you with a modern, end-to-end, reproducible data-science workflow, within which those analysis skills will sit.
Part 1 ensures that everyone has a common foundation of technical skills. These include: version control, reproducible environments, efficient data formats, Python (cleaning data with polars, estimating models with scikit-learn, and package management with uv), Quarto, database connections, using REST APIs, and implementing GitHub Actions.
Part 2 focuses on statistical workflow. Topics include writing and editing papers; gathering and validating data; data management; and model interpretation.
Part 3 integrates LLMs: using code assistants responsibly, running jobs on shared compute, making useful benchmarks, and deploying LLM-based tasks.
Learning objectives
The purpose of the course is to develop the skills needed to use data to tell stories using modern tools, with a strong statistical focus. By the end of the course, you should be able to:
- set-up and use reproducible pipelines;
- manage real datasets;
- communicate effectively, both written and oral;
- create maintainable, production-ready analyses; and
- apply LLMs in data science appropriately.
Content
Part 1: Technical skills
Week 1 "Reproducible and trustworthy workflows for data science"
Learning outcomes:
- think seriously about workflow;
- know how and why to pull, commit, add, and push using Git and GitHub;
- be comfortable making and receiving Pull Requests, and creating and closing Issues using GitHub to work as a team;
- know how to set up and use a Python workspace in VS Code with uv;
- know how to import CSV and Parquet files and inspect them; build and transform datasets with polars; and
- know how to make presentations with Quarto.
Material:
- Deffner, Dominik, Natalia Fedorova, Jeffrey Andrews, and Richard McElreath., 2024, "Bridging theory and data: A computational workflow for cultural evolution" PNAS, 10.1073/pnas.2322887121.
- Zogheib, Ciara, "Data Practices of Cross-Domain Integration: Draw-and-Write Interviews with Interdisciplinary Scientists".
- Timbers, Tiffany A., Joel Ostblom, Florencia D'Andrea, Rodolfo Lourenzutti, and Daniel Chen, 2025, Reproducible and Trustworthy Workflows for Data Science (the chapters on Version Control and Project Environments).
- uv docs (especially the installation, first steps, installing Python, running scripts, working on projects sections).
- Parquet docs.
Class:
- Lecture:
- Workflow (20 min)
- Version control (20 min)
- Demonstrate:
- Pull, commit, add, and push using Git and GitHub (10 min)
- Making and receiving PRs and Issues, and resolving Git conflicts (10 min)
- Downloading uv, then installing Python and core packages and creating project-specific environments, and linting (20 min)
- Saving and reading Parquet and CSVs (10 min)
- Cleaning and summarizing some data with polars (20 min)
- Installing Quarto and creating slides (10 min)
- Worksheet (40 min).
Week 2 "Using Python for data science"
Learning outcomes:
- know how to generate synthetic data from common probability distributions;
- know how to produce clear graphs and tables;
- know how to apply scikit-learn (ridge, lasso, logistic, Poisson, random forests) with bootstrap and cross-validation, and present results in Quarto with cross-references;
Material:
- Tufte, Edward, 2001, The Visual Display of Quantitative Information, Graphics Press, Chapter 1.
- Cleveland, William, 1994, The Elements of Graphing Data, Hobart Press, Chapter 2.
- Healy, Kieran, 2019, Data Visualization: A Practical Introduction, Princeton University Press, Chapter 1.
- Alexander, Rohan, 2023, Telling Stories with Data, Chapter 5.
- McKinney, Wes, Python for Data Analysis, [3e].
- Efron, Bradley, and Trevor Hastie, 2016, Computer Age Statistical Inference, Cambridge University Press.
Class:
- Lecture:
- Effective graphs (40 min)
- Demonstrate:
- Generating synthetic datasets by sampling from normal, uniform, and Poisson distributions (20 min)
- Making effective graphs and tables (20 min)
- Fitting ridge, lasso, logistic, Poisson, and random forests with scikit-learn (30 min)
- Implementing bootstrap and cross-validation (20 min)
- Presenting results in Quarto with cross-references (15 min)
- Worksheet (60 min).
Week 3 "DevOps for data science"
Learning outcomes:
- know how and why to create reproducible, portable environments;
- know how to get data from secure data access;
- understand efficient data engineering;
- know how to ensure observability and quality; and
- understand how to put a workflow into production with GitHub Actions.
Material:
- Gold, Alex, 2024, DevOps for Data Science, Chs 1--6.
- De Angelis, Inessa, 2026, "Using GitHub Actions for Communication Research: Applications with Bluesky Data".
Class:
- Lecture:
- Why reproducible, portable environments matter (15 min)
- Secure data access (15 min)
- Observability and quality checks (15 min)
- Demonstrate:
- Creating and using a Docker instance (15 min)
- Connecting to an API using environment variables for credentials (15 min)
- Efficient data engineering (15 min)
- Adding assertions to test joins, transformations, and model outputs (15 min)
- Writing a GitHub Actions workflow (20 min)
- Worksheet:
- Build a reproducible workflow that loads data securely, applies transformations with checks, and runs automatically with GitHub Actions (40 min)
Week 4
- In-class exam I.
Part 2: Statistical workflow
Week 5 "Writing and editing"
Learning outcomes:
- understand that writing is a critical skill---perhaps the most important---of all the skills required to analyze data;
- know how to focus on one main message that we want to communicate to the reader;
- appreciate the value of being able to get to a first draft as quickly as possible; and
- know how to rewrite.
Material:
- Alexander, Rohan, 2023, Telling Stories with Data, Chapter 4.
- Caro, Robert, 2019, Working, pp.141--158.
- King, Stephen, 2000, On Writing, pp.111--137.
- Zinsser, William, 1976, On Writing Well, pp.6--32 and pp.169--177.
- King, Gary, 2006, "Publication, Publication", PS: Political Science & Politics, 39 (1): 119--25, 10.1017/S1049096506060252.
Class:
- Lecture:
- Features of good writing in the following categories: Title, Abstract, Introduction, Data, Model, Results, Discussion (40 min)
- Demonstrate:
- Provide examples of writing in the categories and show how to improve them (30 min)
- Worksheet:
- Provide three examples of analysis and have students draft the papers (30 min)
- Provide three examples of drafts and have students edit them (30 min)
- Make a plan, based on G. King (2006), for how you will write a meaningful paper by the end of this class. Detail three journals/conferences, in order, that you will submit it to, and why the paper would be a good fit at each (10 min)
Week 6 "Data gathering and validation"
Learning outcomes:
- think seriously about measurement;
- understand basic sampling concepts;
- know how to use Python to scrape websites and gather data from APIs;
- know how to establish tests for data;
- know some common issues to be aware of.
Material:
- Alexander, Rohan, 2023, Telling Stories with Data, Ch 6.
- Statistics Canada, 2023, "Guide to the Census of Population, 2021", Ch 9, "Data quality evaluation".
- Bowley, Arthur Lyon, 1913, "Working-Class Households in Reading", Journal of the Royal Statistical Society 76 (7): 672--701, 10.2307/2339708.
- Neyman, Jerzy, 1934, "On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection", Journal of the Royal Statistical Society, 97 (4): 558--625, 10.2307/2342192.
- Quartz, "Bad Data Guide".
- Radcliffe, Nicholas J., 2026, Test-Driven Data Analysis, Chs 1--2.
Class:
- Lecture:
- Tea-tasting (15 min)
- Properties of measurements (10 min)
- Measurement error (10 min)
- Missing data (10 min)
- Data errors (15 min)
- Sampling (30 min)
- Demonstrate:
- Web scraping (30 min)
- Using an API (30 min)
- Writing data tests (30 min)
- Worksheet (30 min):
- Develop simulated examples of three errors from the Quartz Bad Data Guide. Then develop a test for them using pydantic (Python) or pointblank (R).
Week 7 "Data management and model interpretation"
Learning outcomes:
- Know how to manage data including filenames, variable names, codebooks, folder organization and documentation.
- Know how to communicate model estimates for common models including linear, logistic, Poisson.
- Know how to build DAGs.
Material:
- Lewis, Crystal, 2024, Data Management in Large-Scale Education Research, https://datamgmtinedresearch.com, Chapters 3, 4, 5, 8.3-8.5, and 9.
- Arel-Bundock, Vincent, 2025, Model to Meaning: How to Interpret Statistical Models with marginaleffects for R and Python, https://marginaleffects.com, Chapter 15.
Class:
- Lecture:
- Data organization (15 min)
- Human subjects data (15 min)
- Data management plan (15 min)
- Documentation (15 min)
- Style guide (15 min)
- Demonstrate:
- Creating marginal effects tables and graphs for linear, logistic, and Poisson models (30 min)
- Worksheet:
- Provide output for three different models: linear, logistic, and Poisson, and then have the students write the results section. (30 min)
Week 8
- In-class exam II
Part 3: Integrating LLMs
Week 9 "Foundation models"
Learning outcomes:
- Understand what a foundation model is and how to use one in an application.
Material:
- Huyen, Chip, 2025, AI Engineering: Building Applications with Foundation Models, Chs 1 and 2.
Class:
- Lecture:
- Building applications (60 min)
- Foundation models (60 min)
- Demonstrate:
- Creating a LLM-based application (40 min)
- Worksheet (30 min)
Week 10 "Evaluation"
Learning outcomes:
- Know how to evaluate foundation-model-based applications.
- Know what a benchmark is and why it is important.
- Know how to build a benchmark.
Material:
- Huyen, Chip, 2025, AI Engineering: Building Applications with Foundation Models, Chs 3 and 4.
- Reuel, Anka, and Amelia Hardy and Chandler Smith and Max Lamparth and Malcolm Hardy and Mykel J. Kochenderfer, 2024, "BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices", https://arxiv.org/abs/2411.12990.
- Hughes, Evelyn, and Rohan Alexander, "Autonomous end-to-end data analysis with LLMs".
Class:
- Lecture:
- Evaluation methodology (60 min)
- Evaluating AI systems (60 min)
- Demonstrate:
- Build a benchmark, as a class (60 min)
- Worksheet (30 min)
Week 11 "Prompts and agents"
Learning outcomes:
- Know how to deploy an LLM-based application.
- Know how to use an agent such as Claude Code or Amp.
- Have principles for editing code.
Material:
- Huyen, Chip, 2025, AI Engineering: Building Applications with Foundation Models, Ch 5.
Class:
- Lecture:
- Prompt engineering (60 min)
- Demonstrate:
- Principles of editing (40 min)
- Setting up Claude Code (10 min)
- Using Claude Code (10 min)
- Worksheet: Provide five examples of code and have students edit it.
Week 12
- In-class exam III
Assessment
In-class exams
- Due dates: In-class Weeks 4, 8, and 12.
- Weight: 30 per cent (x2) (the worst one is dropped).
- Task: Write a test in exam conditions. Questions are based on content from the part.
Final paper
- Due date: Exam block.
- Weight: 40 per cent.
- Task: Write an original paper on a topic covered in the class. The paper should be of submittable quality and there should be a clear path to submission. Some ideas include:
- Develop a benchmark in a particular area of interest and then evaluate an LLM's performance.
- Identify a dataset of interest from the list here.