Surveys, sampling, and observational data
Last updated: 2024-12-01
Preamble
Overview
The best thing about being a statistician, is that you get to play in everyone's backyard.
John Tukey
STA304 is an upper-level undergraduate course at the University of Toronto's Department of Statistical Sciences.
The work of applied statisticians, regardless of their specific job title and area of application, is the most important and exciting work in the world right now. The ability to gather data, analyse it, and communicate your understanding of the underlying process is incredibly valuable. In this course you will learn and apply the essentials of this.
We focus on surveys, sampling and observational data. The very stuff of statistical science! We will approach these topics from a practical perspective. You will actually run surveys and learn how messy it is to put one together. You will learn how to think about sampling, how to implement it, and why the details matter. You will forecast an election. And you will conduct original research. More generally, you will learn how to obtain and analyse data and use it to make sensible claims about the world.
To work as an applied statistician requires you to be able to, as part of a small team:
- Gather data in less-than-perfect settings.
- Efficiently prepare and clean data toward some purpose.
- Analyse it in a reproducible, thorough, modern, and statistically-mature manner.
- Communicate your analysis to stakeholders including colleagues and clients with and without formal statistical training.
You likely have some of these skills already. This course will further develop them. At the end of the course you will have a portfolio of work focused on surveying, sampling, and observational data, that you could show off to a potential employer.
Each week you will read relevant papers and books, engage with them through discussion with each other, myself, and the TA. You will bring this all together and show off how much you have learnt through practical, on-going, assessment.
It is important to recognise that putting together everything that you have learnt to this point in this way will be difficult. It is not possible to cover everything that you will need to know. You should proactively identify and address aspects where you are weak through seeking additional information and resources. This course acts as a guide as to what is important, it does not contain everything that is important.
This course is different to many other courses at the University of Toronto. At the end of this course, you will have a portfolio of work that you could show off to a potential employer. You will have developed the skills to work successfully as an applied statistician or data scientist. And you will know how to fill gaps in your knowledge yourself. A lot of scholarships and jobs these days ask for GitHub and blog links etc to show off a portfolio of your work. This is the class that gives you a chance to develop these. It's very important to having something to show that needs to go beyond what is done in a normal class.
How to succeed
In this course you will work in a self-directed, open-ended manner. Identify relevant areas of interest and then learn the skills that you need to explore those areas.
To successfully complete this course, you should expect to spend a large portion of your time reading and writing (both code and text). Deeply engage with the materials. Find a small study group and keep each other motivated and focused. At the start of the week, read the course notes, all compulsory materials and some recommended materials based on your interest. After doing that, but before the 'lecture' time you should complete the weekly quiz. During 'lectures' I'll live-code, discuss materials in the course notes, talk about an experiment, and you'll have a chance to discuss the materials with me.
You need to be more active in your learning in this course than others - read the notes and related materials - and then go out there and teach yourself more and apply it. You will not be spoon-fed in this course. Each week try to write reproducible, understandable, R code surrounded by beautifully crafted text that motivates, backgrounds, explains, discusses, and criticizes. Make steady progress toward the assessment.
This is not a 'bird course'. Typically, after the term is finished, students say that the course is difficult but rewarding. The TAs and I are always available to answer any questions. Please come to office hours!
How we'll work
This webpage will provide almost all the guiding materials that you need and links to the relevant parts of the notes. The course notes are available here. Those contain notes and other material that you could go over. We'll use Quercus really only for assessment submission and grading.
A rough weekly flow for the course would be something like:
- Read the week's course notes.
- Read/watch/listen to the required materials.
- Attend the lecture.
- Attend the lab.
- Complete the weekly quiz.
- Make progress on a paper.
Advice from past students
Successful past students have the following advice (completely unedited by me):
- "Start reading and writing on a weekly basis, watch some videos on R and RMD but more importantly learn how to use Google."
- "It is not a wise idea to take this course if you did not take any other STA 300 level course before."
- "Start early, find a group of people you trust enough to divide the work up fairly. Let people work to their strengths (people who know R should do the modelling, good writers should write most of the reports, etc.)"
- "Not to worry if you don't do well on the first problem set—the nature of the course is to build up skills overtime, and it's meant to be challenging in the beginning. In the end, it is worth it because you learn very valuable applicable skills on how to write professional reports."
- "Work on your writing and direction following skills."
- "Look at the rubric. There were times that I lost marks because I didn't follow the rubric properly. Go to office hours, they are very useful as you can ask your own question and also get answers to questions other people ask and you didn't think of. Also, do the assignments to the best of your ability. You will lose marks if you don't put in effort and the only person you're hurting is yourself."
- "During lectures, focus more on the why the prof is doing what he's doing. When he runs certain commands in R, figure out why that sequence of code gives what you want, because it'll help adapt his code into your assignment code. just remembering what he's doing in lecture becomes useless really quickly since the thought process matters more. also, start everything early."
- "Do this course when you really want to learn something and have a lot of time to working on it."
- "you need to be very skillful in RStudio and latex. Otherwise you would be struggling."
- "Try to incorporate the feedback given and read a looottttttttt. Also start early on the problem sets because they tend to take a lot of time. Don't give up!"
- "-Find a good group for problem sets"
- "If the assignments stay the same, I would tell students to approach this class from the perspective of 'storytelling with statistics' rather than a statistics course. You need to use R, and Markdown, and have a solid understanding of concepts like regression and sampling, but more importantly you need to be able to interpret results and write about them in a way coherent and professional way."
- "do your readings"
- "Definitely get ready to write reports"
- "Do not take sta304 with Prof Rohan, it is pretty tough"
- "Start your work a bit earlier, make sure to follow the format expected and the rubric exactly."
- "Read course material. Figure out WHY this paper/video is being shown to you and what you generally learn from it. Surround yourself with people dedicated to putting in the effort to understand material and who are thorough in their work so you can discuss content and/or work together."
- "1. Be prepared to work extremely hard (8-11 hours a week). 2. Learn RStudio before course begins--STA130 is ideal preparation. 3. Start problem sets as soon as they are released."
- "learn to code early and extensively use the office hours with the prof."
- "This course requires lots of time dedicated and is not an "easy bird course" but is an incredibly rewarding course if one wants to learn how statistics is applied in the real world."
Past iterations
Content
Week 1
Week 2
Week 3
- Content:
- (Optional) R bootcamp (please allow six hours)
Week 4
- Content:
- (Optional) Git and GitHub bootcamp (please allow six hours)
Week 5
Week 6
Week 7
- Content:
- Guest lecture: Chris Henry, Bank of Canada
- Christopher Henry (Chris) is a Senior Economist at the Bank of Canada. He serves as lead economist for the consumer survey research program on the Currency Department's Economic Research and Analysis team. Chris first joined the Bank as a Research Assistant in 2012, and recently rejoined in 2021 after completing his PhD in Economics. In his role, Chris contributes to the design, implementation, and analysis of a range of surveys that measure the use of cash and alternative methods of payment. He holds an PhD in Economics from Université Clermont Auvergne (France), and an MSc in Mathematics from McMaster University.
Week 8
Week 9
Week 10
Week 11
Week 12
- Content:
- TBD based on class progress and interest.
Assessment
Summary
Quiz
- 20
- Weekly before the lecture
Tutorial
- 20
- Weekly the day before the tutorial
Paper 1
Paper 2
Paper 3
Paper 4
Final Paper (initial submission)
Final Paper (peer review)
Final Paper