Courses

STA4101: Topics in Statistical Data Science

Last updated: 2025-09-17

Essentials

Overview

This course covers current topics in data science from a statistical perspective. I assume that you have some statistical analysis skills from other classes (for instance, that you've learned about linear regression). I emphasise practical aspects of data science, such as tools, workflows, reproducibility and communication. The main objective of this class is to provide you with a modern, end-to-end, reproducible data-science workflow, within which those analysis skills will sit.

Part 1 ensures that everyone has a common foundation of technical skills. These include: version control, reproducible environments, efficient data formats, Python (especially cleaning data with polars, estimating models with scikit-learn, and package management with uv), Quarto, database connections, using REST APIs, and implementing GitHub Actions.

Part 2 focuses on statistical workflow. Topics include writing and editing papers; gathering and validating data; data management; and model interpretation.

Part 3 integrates LLMs: using code assistants responsibly, running jobs on shared compute, making useful benchmarks, and deploying LLM-based tasks.

The course GitHub repo is: https://github.com/RohanAlexander/STA4101.

Learning objectives

The purpose of the course is to develop the skills needed to effectively do data science using modern tools, with a strong statistical focus. By the end of the course, you should be able to:

Content

Part 1: Technical skills

Week 1 "Reproducible and trustworthy workflows for data science"

8 September 2025

Learning outcomes:

  1. think seriously about workflow;
  2. know how and why to pull, commit, add, and push using Git and GitHub;
  3. be comfortable making and receiving Pull Requests, and creating and closing Issues using GitHub to work as a team;
  4. know how to set up and use a Python workspace in VS Code with uv;
  5. know how to import CSV and Parquet files and inspect them; build and transform datasets with polars; and
  6. know how to make presentations with Quarto.

Material:

Class:

Week 2 "Using Python for data science"

15 September 2025

Learning outcomes:

  1. know how to generate synthetic data from common probability distributions;
  2. know how to produce clear graphs and tables;
  3. know how to apply scikit-learn (ridge, lasso, logistic, Poisson, random forests) with bootstrap and cross-validation, and present results in Quarto with cross-references;

Material:

Class:

Week 3 "DevOps for data science"

22 September 2025

Learning outcomes:

  1. know how and why to create reproducible, portable environments;
  2. know how to get data from secure data access;
  3. understand efficient data engineering;
  4. know how to ensure observability and quality; and
  5. understand how to put a workflow into production with GitHub Actions.

Material:

Class:

Week 4

29 September 2025

Part 2: Statistical workflow

Week 5 "Writing and editing"

6 October 2025

Learning outcomes:

  1. understand that writing is a critical skill---perhaps the most important---of all the skills required to analyze data;
  2. know how to focus on one main message that we want to communicate to the reader;
  3. appreciate the value of being able to get to a first draft as quickly as possible; and
  4. know how to rewrite.

Material:

Class:

Week 6 "Data gathering and validation"

20 October 2025

Learning outcomes:

  1. think seriously about measurement;
  2. understand basic sampling concepts;
  3. know how to use Python to scrape websites and gather data from APIs;
  4. know how to establish tests for data;
  5. know some common issues to be aware of.

Material:

Class:

Week 7 "Data management and model interpretation"

3 November 2025

Learning outcomes:

  1. Know how to rigorously manage data including filenames, variable names, codebooks, folder organization and documentation.
  2. Know how to communicate model estimates for common models including linear, logistic, Poisson.
  3. Know how to build DAGs.

Material:

Class:

Week 8

10 November 2025

Part 3: Integrating LLMs

Week 9 "Using Python for data science, with LLMs"

Learning outcomes:

  1. Know how to use Claude code
  2. Have principles for editing code.

Material:

Class:

Week 10 "Developing benchmarks"

17 November 2025

Learning outcomes:

  1. Understand key sampling principles
  2. Know what a benchmark is and why it is important.
  3. Know how to build a benchmark.

Material:

Class:

Week 11 "Deploying LLMs"

24 November 2025

Learning outcomes:

  1. Know how to deploy an LLM-based application.

Material:

Class:

Week 12

1 December 2025

Assessment (option 1)

Mid-terms

Final paper

Assessment (option 2)

In-class presentation

Mid-terms

Final paper