Courses

Topics in Statistical Data Science

Last updated: 2025-07-29

THIS IS JUST RANDOM INITIAL THOUGHTS AND NOTHING IS FINALIZED YET.

Overview

This course will cover current topics in Data Science from a statistical perspective. The exact topics will vary from year to year. Emphasis will be on practical aspects of data science. This could include tools, workflows, reproducibility and communication through a statistical lens but not all topics will be covered exhaustively every year.

The exact content will vary year to year but will include practical aspects of data science such as such as tools, workflows, reproducibility and communication. Topics will be taught through a statistical lens and the emphasis and content will vary from year to year. By the end of the course students will (i) have a better understanding of current tools and practices in data science, (ii) will have learned to effectively communicate their findings to a non-statistical audience and (iii) be able to critically evaluate the use of techniques they learned from a statistical viewpoint.

Learning objectives

The purpose of the course is to develop the skills needed to effectively do data science using modern tools, with a strong statistical focus. By the end of the course, you should be able to:

Content

Part 1: Technical skills

Week 1 "Research Software Engineering"

Learning outcomes:

  1. Know how and why to pull, commit, add, and push using Git and GitHub.
  2. Be comfortable making and receiving Pull Requests, and creating and closing Issues using GitHub to work as a team.
  3. Know how, when, and why to use CSVs and Parquet.

Material:

Week 2 "Using Python for data science"

Learning outcomes:

  1. Know how to set-up and use a Python workspace in VS Code using uv.
  2. Know how to import CSVs and Parquet files and glance at them.
  3. Can use polars for dataset creation and manipulation.
  4. Can simulate data with various probability distributions.
  5. Knows how to build graphs and tables.
  6. Can use scikit-learn to implement statistical approaches including: ridge regression, logistic regression, Poisson regression, the bootstrap, cross-validation, lasso models, and random forests.
  7. Can use Quarto to generate cross-referenced papers.

Material:

Week 3 "DevOps for data science"

Learning outcomes:

  1. Know how to create and use per-project isolated environments.
  2. Know why and how to keep big data small.
  3. Know how to connect to databases, and manage credentials with environment variables and secrets.
  4. Understand how to use REST APIs.
  5. Know what observability means in data science and how to add checks for joins, transformations, and model quality.
  6. Know about dev, test, and prod.
  7. Understand how to use branches and the importance of small, frequent merges.
  8. Know how to use GitHub Actions with triggers, runners, build steps, tests, and secret management.
  9. Know how and why to use Docker containers.

Material:

Week 4

Part 2: Statistical workflow

Week 5 "Workflow; Written and oral communication"

Learning outcomes:

  1. Understand why writing is a critical skill—perhaps the most important—of all the skills required to analyze data.
  2. Know how to focus on one main message that we want to communicate to the reader
  3. Appreciate the value of being able to get to a first draft as quickly as possible.
  4. Know how to rewrite
  5. Think seriously about workflow.

Material:

Week 6 "Data gathering and validation"

Learning outcomes:

  1. Know how to use Python to scrape websites and gather data from APIs
  2. Know how to establish tests for data
  3. Know some common issues to be aware of

Material:

Week 7 "Data management"

Learning outcomes:

  1. Know how to rigiously manage data including filenames, variable names, codebooks, folder organization and documentation.

Material:

Week 8 "Model interpretation"

Learning outcomes:

  1. Know how to communicate model estimates for common models including linear, logistic, Poisson.

Material:

Task idea:

Week 9

Part 3: Integrating AI

Week 10 "Python with AI"

Learning outcomes:

  1. Know how to use Claude code
  2. Have principles for editing code.
  3. Know how to run code on SF Compute.

Week 11 "Developing benchmarks"

Learning outcomes:

  1. Understand key sampling principles
  2. Know what a benchmark is and why it is important.
  3. Know how to build a benchmark.

Week 12 "Deploying AI"

Learning outcomes:

  1. Know how to deploy an AI-based application.

Assessment

Using AI is encouraged in the course, but you still need to know the underlying material, in the same way that you still need to know how to add even though everyone has calculators. As such, assessment is divided into "secure" and "insecure" based on whether you will have access to AI to help you. The maximum that you can receive overall is the proportion that you get on secure assessment. For instance, if you got 100% on secure assessment, then your grade would be worked out based on the weights below. But if you only got 50% on secure assessment, then even if you got 100% on everything else your overall grade would be 50%.

In-class presentation

Weekly online quiz

Mid-terms

Final paper