Topics in Statistical Data Science
Last updated: 2025-07-29
THIS IS JUST RANDOM INITIAL THOUGHTS AND NOTHING IS FINALIZED YET.
Overview
This course will cover current topics in Data Science from a statistical perspective. The exact topics will vary from year to year. Emphasis will be on practical aspects of data science. This could include tools, workflows, reproducibility and communication through a statistical lens but not all topics will be covered exhaustively every year.
The exact content will vary year to year but will include practical aspects of data science such as such as tools, workflows, reproducibility and communication. Topics will be taught through a statistical lens and the emphasis and content will vary from year to year. By the end of the course students will (i) have a better understanding of current tools and practices in data science, (ii) will have learned to effectively communicate their findings to a non-statistical audience and (iii) be able to critically evaluate the use of techniques they learned from a statistical viewpoint.
Learning objectives
The purpose of the course is to develop the skills needed to effectively do data science using modern tools, with a strong statistical focus. By the end of the course, you should be able to:
Content
Part 1: Technical skills
Week 1 "Research Software Engineering"
Learning outcomes:
- Know how and why to pull, commit, add, and push using Git and GitHub.
- Be comfortable making and receiving Pull Requests, and creating and closing Issues using GitHub to work as a team.
- Know how, when, and why to use CSVs and Parquet.
Material:
- Irving, Damien, Kate Hertweck, Luke Johnston, Joel Ostblom, Charlotte Wickham, and Greg Wilson, 2021 Research Software Engineering with Python, Chs 1-8, https://third-bit.com/py-rse/.
- Add something about Parquet
Week 2 "Using Python for data science"
Learning outcomes:
- Know how to set-up and use a Python workspace in VS Code using uv.
- Know how to import CSVs and Parquet files and glance at them.
- Can use polars for dataset creation and manipulation.
- Can simulate data with various probability distributions.
- Knows how to build graphs and tables.
- Can use scikit-learn to implement statistical approaches including: ridge regression, logistic regression, Poisson regression, the bootstrap, cross-validation, lasso models, and random forests.
- Can use Quarto to generate cross-referenced papers.
Material:
- Efron, Bradley, and Trevor Hastie, 2016, Computer Age Statistical Inference, Cambridge University Press, .
- https://github.com/astral-sh/uv and https://github.com/astral-sh/ruff
Week 3 "DevOps for data science"
Learning outcomes:
- Know how to create and use per-project isolated environments.
- Know why and how to keep big data small.
- Know how to connect to databases, and manage credentials with environment variables and secrets.
- Understand how to use REST APIs.
- Know what observability means in data science and how to add checks for joins, transformations, and model quality.
- Know about dev, test, and prod.
- Understand how to use branches and the importance of small, frequent merges.
- Know how to use GitHub Actions with triggers, runners, build steps, tests, and secret management.
- Know how and why to use Docker containers.
Material:
Week 4
Part 2: Statistical workflow
Week 5 "Workflow; Written and oral communication"
Learning outcomes:
- Understand why writing is a critical skill—perhaps the most important—of all the skills required to analyze data.
- Know how to focus on one main message that we want to communicate to the reader
- Appreciate the value of being able to get to a first draft as quickly as possible.
- Know how to rewrite
- Think seriously about workflow.
Material:
- Deffner, Dominik, Natalia Fedorova, Jeffrey Andrews, and Richard McElreath., 2024, "Bridging theory and data: A computational workflow for cultural evolution" PNAS, 10.1073/pnas.2322887121.
Week 6 "Data gathering and validation"
Learning outcomes:
- Know how to use Python to scrape websites and gather data from APIs
- Know how to establish tests for data
- Know some common issues to be aware of
Material:
- https://github.com/Quartz/bad-data-guide/blob/master/README.md
- Scraping and APIs and data validation
- Radcliffe, Nicholas J., 2026, Test-Driven Data Analysis, Chs 1-2. (This book is pre-publication. Extracts will be provided for class purposes.)
Week 7 "Data management"
Learning outcomes:
- Know how to rigiously manage data including filenames, variable names, codebooks, folder organization and documentation.
Material:
Week 8 "Model interpretation"
Learning outcomes:
- Know how to communicate model estimates for common models including linear, logistic, Poisson.
Material:
- Arel-Bundock, Vincent, 2025, Model to Meaning: How to Interpret Statistical Models with marginaleffects for R and Python, https://marginaleffects.com.
Task idea:
- Provide output for five different models: linear, logistic, Poisson, X and Y, and then have the students write the results section.
Week 9
Part 3: Integrating AI
Week 10 "Python with AI"
Learning outcomes:
- Know how to use Claude code
- Have principles for editing code.
- Know how to run code on SF Compute.
Week 11 "Developing benchmarks"
Learning outcomes:
- Understand key sampling principles
- Know what a benchmark is and why it is important.
- Know how to build a benchmark.
- Reuel, Anka, and Amelia Hardy and Chandler Smith and Max Lamparth and Malcolm Hardy and Mykel J. Kochenderfer, 2024, "BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices", https://arxiv.org/abs/2411.12990.
Week 12 "Deploying AI"
Learning outcomes:
- Know how to deploy an AI-based application.
- Huyen, Chip, 2025, AI Engineering: Building Applications with Foundation Models, Chs 1-5.
Assessment
Using AI is encouraged in the course, but you still need to know the underlying material, in the same way that you still need to know how to add even though everyone has calculators. As such, assessment is divided into "secure" and "insecure" based on whether you will have access to AI to help you. The maximum that you can receive overall is the proportion that you get on secure assessment. For instance, if you got 100% on secure assessment, then your grade would be worked out based on the weights below. But if you only got 50% on secure assessment, then even if you got 100% on everything else your overall grade would be 50%.
In-class presentation
- Due dates: As a small group, you will present about the topic of the week.
- Type: Secure
- Weight: 20 per cent.
- Task: As part of a small team, you should prepare slides and a worksheet using Quarto. Use the slides to deliver a 1 hour lecture about the topic of the week. There must be some aspect of live-coding in the lecture. After the lecture, you should have the class actively go through your worksheet for 30 min. Finally you should answer questions and lead discussion for 30 min. You should use GitHub PRs to put together the slides, working within the class repo. You will be graded on: your use of GitHub to work as a team to create content, your content and its delivery, and your ability to answer questions. All team members will receive the same grade. Groups and topics will be created in Week 1.
Weekly online quiz
- Due dates: Weekly, weeks 1-3, 5-8, 10-12.
- Type: Insecure
- Weight: 5 per cent, total.
- Task: Complete an online quiz.
Mid-terms
- Due dates: In-class Week 4 and Week 9.
- Type: Secure
- Weight: 20 per cent each.
- Task: Write a test in exam conditions. Questions are based on content from the part.
Final paper
- Due date: Exam block.
- Type: Insecure
- Weight: 35 per cent.
- Task: Write an original paper on a topic covered in the class, submit it to a journal, and get through desk-rejection. One idea would be to develop a benchmark in a particular area of interest and then evaluate an LLM's performance. You should discuss what you'd like to explore with me early in the semester, and you should have written permission from me by Reading Week.