Courses

Topics in Statistical Data Science

Last updated: 2025-09-02

THIS IS JUST RANDOM INITIAL THOUGHTS AND NOTHING IS FINALIZED YET.

Overview

This course covers current topics in data science from a statistical perspective. Emphasis is on practical aspects of data science, such as tools, workflows, reproducibility and communication.

Broadly, I assume that you have some statistical analysis skills from other classes. The main objective of this class is to provide you with a modern, end-to-end, reproducible data-science workflow, within which those analysis skills will sit.

Part 1 ensures that everyone has a common foundation of technical skills. These include: version control, reproducible environments, efficient data formats, Python (especially cleaning data with polars, estimating models with scikit-learn, and package management with uv), Quarto, database connections, REST APIs, and GitHub Actions.

Part 2 focuses on statistical workflow. Topics include writing and editing papers; gathering/validating data; data management; and model interpretation.

Part 3 integrates AI: using code assistants responsibly, running jobs on shared compute, making useful benchmarks, and deploying AI-based tasks.

Learning objectives

The purpose of the course is to develop the skills needed to effectively do data science using modern tools, with a strong statistical focus. By the end of the course, you should be able to:

Content

Part 1: Technical skills

Week 1 "Reproducible and trustworthy workflows for data science"

8 September 2025

Learning outcomes:

  1. think seriously about workflow;
  2. know how and why to pull, commit, add, and push using Git and GitHub;
  3. be comfortable making and receiving Pull Requests, and creating and closing Issues using GitHub to work as a team;
  4. know how to set up and use a Python workspace in VS Code with uv;
  5. know how to import CSV/Parquet and inspect them; build and transform datasets with polars; and
  6. know how to make presentations with Quarto.

Material:

Class:

Week 2 "Using Python for data science"

15 September 2025

Learning outcomes:

  1. know how to generate synthetic data from common probability distributions;
  2. know how to produce clear graphs and tables;
  3. know how to apply scikit-learn (ridge, lasso, logistic, Poisson, random forests) with bootstrap and cross-validation, and present results in Quarto with cross-references;

Material:

Class:

Week 3 "DevOps for data science"

22 September 2025

Learning outcomes:

  1. know how and why to create reproducible, portable environments;
  2. know how to get data from secure data access;
  3. understand efficient data engineering;
  4. know how to ensure observability and quality; and
  5. understand how to put a workflow into production with GitHub Actions.

Material:

Class:

Week 4

29 September 2025

Part 2: Statistical workflow

Week 5 "Writing and editing"

6 October 2025

Learning outcomes:

  1. understand that writing is a critical skill---perhaps the most important---of all the skills required to analyze data;
  2. know how to focus on one main message that we want to communicate to the reader;
  3. appreciate the value of being able to get to a first draft as quickly as possible; and
  4. know how to rewrite.

Material:

Class:

Week 6 "Data gathering and validation"

20 October 2025

Learning outcomes:

  1. think seriously about measurement;
  2. understand basic sampling concepts;
  3. know how to use Python to scrape websites and gather data from APIs;
  4. know how to establish tests for data;
  5. know some common issues to be aware of.

Material:

Class:

Week 7 "Data management and model interpretation"

3 November 2025

Learning outcomes:

  1. Know how to rigiously manage data including filenames, variable names, codebooks, folder organization and documentation.
  2. Know how to communicate model estimates for common models including linear, logistic, Poisson.

Material:

Class:

Week 8

10 November 2025

Part 3: Integrating AI

Week 9 "Using Python for data science, with AI"

Learning outcomes:

  1. Know how to use Claude code
  2. Have principles for editing code.

Material:

Class:

Week 10 "Developing benchmarks"

17 November 2025

Learning outcomes:

  1. Understand key sampling principles
  2. Know what a benchmark is and why it is important.
  3. Know how to build a benchmark.

Material:

Class:

Week 11 "Deploying AI"

24 November 2025

Learning outcomes:

  1. Know how to deploy an AI-based application.

Material:

Class:

Week 12

1 December 2025

Assessment

In-class presentation

Mid-terms

Final paper