Topics in Statistical Data Science
Last updated: 2025-09-02
THIS IS JUST RANDOM INITIAL THOUGHTS AND NOTHING IS FINALIZED YET.
Overview
This course covers current topics in data science from a statistical perspective. Emphasis is on practical aspects of data science, such as tools, workflows, reproducibility and communication.
Broadly, I assume that you have some statistical analysis skills from other classes. The main objective of this class is to provide you with a modern, end-to-end, reproducible data-science workflow, within which those analysis skills will sit.
Part 1 ensures that everyone has a common foundation of technical skills. These include: version control, reproducible environments, efficient data formats, Python (especially cleaning data with polars, estimating models with scikit-learn, and package management with uv), Quarto, database connections, REST APIs, and GitHub Actions.
Part 2 focuses on statistical workflow. Topics include writing and editing papers; gathering/validating data; data management; and model interpretation.
Part 3 integrates AI: using code assistants responsibly, running jobs on shared compute, making useful benchmarks, and deploying AI-based tasks.
Learning objectives
The purpose of the course is to develop the skills needed to effectively do data science using modern tools, with a strong statistical focus. By the end of the course, you should be able to:
- set-up and use reproducible pipelines;
- manage real datasets;
- communicate results effectively;
- create maintainable, production-ready analyses; and
- apply LLMs in data science appropriately.
Content
Part 1: Technical skills
Week 1 "Reproducible and trustworthy workflows for data science"
8 September 2025
Learning outcomes:
- think seriously about workflow;
- know how and why to pull, commit, add, and push using Git and GitHub;
- be comfortable making and receiving Pull Requests, and creating and closing Issues using GitHub to work as a team;
- know how to set up and use a Python workspace in VS Code with uv;
- know how to import CSV/Parquet and inspect them; build and transform datasets with polars; and
- know how to make presentations with Quarto.
Material:
- Deffner, Dominik, Natalia Fedorova, Jeffrey Andrews, and Richard McElreath., 2024, "Bridging theory and data: A computational workflow for cultural evolution" PNAS, 10.1073/pnas.2322887121.
- Timbers, Tiffany A., Joel Ostblom, Florencia D'Andrea, Rodolfo Lourenzutti, and Daniel Chen, 2025, Reproducible and Trustworthy Workflows for Data Science (Chapters on Version Control, and Project Environments).
- uv: An extremely fast Python package and project manager.
- Reading and Writing the Apache Parquet Format.
Class:
- Lecture: Workflow (20 min)
- Lecture: Version control (20 min)
- Demonstrate: Pull, commit, add, and push using Git and GitHub (20 min)
- Demonstrate: Making and receiving PRs and Issues, and resolving Git conflicts (20 min)
- Demonstrate: Downloading uv, then installing Python and core packages and creating project-specific environments, and linting (20 min)
- Demonstrate: Saving and reading Parquet and CSVs (10 min)
- Demonstrate: Cleaning and summarizing some data with polars (20 min)
- Demonstrate: Installing Quarto and creating slides (20 min)
- Worksheet (20 min)
Week 2 "Using Python for data science"
15 September 2025
Learning outcomes:
- know how to generate synthetic data from common probability distributions;
- know how to produce clear graphs and tables;
- know how to apply scikit-learn (ridge, lasso, logistic, Poisson, random forests) with bootstrap and cross-validation, and present results in Quarto with cross-references;
Material:
- McKinney, Wes, Python for Data Analysis, [3e].
- Efron, Bradley, and Trevor Hastie, 2016, Computer Age Statistical Inference, Cambridge University Press.
- Tufte, Edward, 2001, The Visual Display of Quantitative Information, Graphics Press.
- Cleveland, William, 1994, The Elements of Graphing Data, Hobart Press.
Class:
- Lecture: Effective graphs (20 min)
- Demonstration: Generating synthetic datasets by sampling from normal, uniform, and Poisson distributions (20 min)
- Demonstrate: Making effective graphs and tables (20 min)
- Demonstrate: Fitting ridge, lasso, logistic, Poisson, and random forests with scikit-learn (30 min)
- Demonstrate: Implementing bootstrap and cross-validation (20 min)
- Demonstrate: Presenting results in Quarto with cross-references (15 min)
- Worksheet (40 min)
Week 3 "DevOps for data science"
22 September 2025
Learning outcomes:
- know how and why to create reproducible, portable environments;
- know how to get data from secure data access;
- understand efficient data engineering;
- know how to ensure observability and quality; and
- understand how to put a workflow into production with GitHub Actions.
Material:
Class:
- Lecture: Why reproducible, portable environments matter (15 min)
- Demonstrate: Creating and using a Docker instance (15 min)
- Lecture: Secure data access (15 min)
- Demonstrate: Connecting to an API using environment variables for credentials (15 min)
- Lecture: Efficient data engineering (15 min)
- Lecture: Observability and quality checks (15 min)
- Demonstrate: Adding assertions to test joins, transformations, and model outputs (15 min)
- Demonstrate: Writing a GitHub Actions workflow (20 min)
- Worksheet: Build a reproducible workflow that loads data securely, applies transformations with checks, and runs automatically with GitHub Actions (30 min)
Week 4
29 September 2025
Part 2: Statistical workflow
Week 5 "Writing and editing"
6 October 2025
Learning outcomes:
- understand that writing is a critical skill---perhaps the most important---of all the skills required to analyze data;
- know how to focus on one main message that we want to communicate to the reader;
- appreciate the value of being able to get to a first draft as quickly as possible; and
- know how to rewrite.
Material:
- Alexander, Rohan, 2023, Telling Stories with Data.
- Caro, Robert, 2019, Working.
- King, Stephen, 2000, On Writing.
- Zinsser, William, 1976, On Writing Well.
- King, Gary, 2006, "Publication, Publication", PS: Political Science & Politics, 39 (1): 119--25, 10.1017/S1049096506060252.
Class:
- Lecture: Features of good writing in the following categories: Title, Abstract, Introduction, Data, Model, Results, Discussion (40 min)
- Demonstrate: Provide examples of writing in the categories and show how to improve them (30 min)
- Worksheet: Provide three examples of analysis and have students draft the papers (30 min)
- Worksheet: Provide three examples of drafts and have students edit them (30 min)
- Worksheet: Make a plan, based on G. King (2006), for how you will write a meaningful paper by the end of this class. Detail three journals/conferences, in order, that you will submit it to, and why the paper would be a good fit at each (10 min)
Week 6 "Data gathering and validation"
20 October 2025
Learning outcomes:
- think seriously about measurement;
- understand basic sampling concepts;
- know how to use Python to scrape websites and gather data from APIs;
- know how to establish tests for data;
- know some common issues to be aware of.
Material:
- Alexander, Rohan, 2023, Telling Stories with Data.
- Statistics Canada, 2023, "Guide to the Census of Population, 2021", Statistics Canada.
- Bowley, Arthur Lyon, 1913, "Working-Class Households in Reading", Journal of the Royal Statistical Society 76 (7): 672--701, 10.2307/2339708.
- Neyman, Jerzy, 1934, "On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection", Journal of the Royal Statistical Society, 97 (4): 558--625, 10.2307/2342192.
- Quartz, "Bad Data Guide".
- Radcliffe, Nicholas J., 2026, Test-Driven Data Analysis, Chs 1--2. (This book is pre-publication. Extracts will be provided for class purposes.)
Class:
- Lecture:
- Demonstrate:
- Worksheet:
Week 7 "Data management and model interpretation"
3 November 2025
Learning outcomes:
- Know how to rigiously manage data including filenames, variable names, codebooks, folder organization and documentation.
- Know how to communicate model estimates for common models including linear, logistic, Poisson.
Material:
Class:
- Lecture:
- Demonstrate:
- Worksheet: Provide output for five different models: linear, logistic, Poisson, X and Y, and then have the students write the results section.
Week 8
10 November 2025
Part 3: Integrating AI
Week 9 "Using Python for data science, with AI"
Learning outcomes:
- Know how to use Claude code
- Have principles for editing code.
Material:
Class:
- Lecture:
- Demonstrate:
- Worksheet:
Week 10 "Developing benchmarks"
17 November 2025
Learning outcomes:
- Understand key sampling principles
- Know what a benchmark is and why it is important.
- Know how to build a benchmark.
Material:
- Reuel, Anka, and Amelia Hardy and Chandler Smith and Max Lamparth and Malcolm Hardy and Mykel J. Kochenderfer, 2024, "BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices", https://arxiv.org/abs/2411.12990.
- Hughes, Evelyn, and Rohan Alexander, "Autonomous end-to-end data analysis with LLMs".
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. IEEE Computer Vision and Pattern Recognition (CVPR), 2009, https://www.image-net.org/index.php
Class:
- Lecture:
- Demonstrate:
- Worksheet:
Week 11 "Deploying AI"
24 November 2025
Learning outcomes:
- Know how to deploy an AI-based application.
Material:
- Huyen, Chip, 2025, AI Engineering: Building Applications with Foundation Models, Chs 1-5.
Class:
- Lecture:
- Demonstrate:
- Worksheet:
Week 12
1 December 2025
Assessment
In-class presentation
- Due dates: You will present about the topic of the week.
- Weight: 20 per cent.
- Task: You should prepare slides and a worksheet using Quarto. Use the slides to deliver a 1 hour lecture about the topic of the week. There must be some aspect of live-coding in the lecture. After the lecture, you should have the class actively go through your worksheet for 30 min. Finally you should answer questions and lead discussion for 30 min. You should use GitHub PRs to put together the slides, working within the class repo. You will be graded on: your use of GitHub to work as a team to create content, your content and its delivery, and your ability to answer questions. Topics will be allocated in Week 1.
Mid-terms
- Due dates: In-class Weeks 4, 8, and 12.
- Weight: 20 per cent (x2) (the worst one is dropped).
- Task: Write a test in exam conditions. Questions are based on content from the part.
Final paper
- Due date: Exam block.
- Weight: 40 per cent.
- Task: Write an original paper on a topic covered in the class, submit it to a journal, and get through desk-rejection. One idea would be to develop a benchmark in a particular area of interest and then evaluate an LLM's performance. You should discuss what you'd like to explore with me early in the semester, and you should have written permission from me by Reading Week.