Methods of Data Analysis I
Preamble
Overview
“Methods”, “Data”, “Analysis”—we consider such loaded words in this course! What is data? What does it mean to do analysis? And what methods? The very core of statistical sciences!
This course develops in students an appreciation for how our world becomes data, what to do in the face of overwhelming options, and how to do all this in a way that provides value to others.
It is concerned with statistical modelling, but also everything that comes before and after modelling, and in doing so ensures modelling and analysis are placed on a firmer foundation. In assessment, students will conduct end-to-end data science projects using real-world data, enabling them to fully understand potential pitfalls, and build a portfolio.
The focus of the learning will be on:
- actively reading and consider relevant literature;
- actively using the statistical programming language R in real-world conditions;
- gathering, cleaning, and preparing datasets; and
- choosing and implementing statistical models and evaluating their estimates.
Essentially this course provides students with everything that they need to know to be able to do the most exciting thing in the world: use data to tell convincing stories.
FAQ
- Can I audit this course? Sure, but it is pointless, because the only way to learn this stuff is to do the work.
- Why is there so much assessment? The only way to learn this stuff is to actually do the work, and students only do the work when they are assessed. It is unfortunate, but there is no way around it.
- How difficult is the course? Of students that enrol, the median student drops the course. But the mode overall grade at the end of the course is an A+. The course is not difficult, but the hands-on-projects mean it is a lot of work.
- What is the format of the class? There are rarely lectures because those are not effective. You should read the relevant chapter and attempt the quiz before class. During class we will focus on examples, activities and discussion. We will also have industry guests discuss their experience.
- You are asking about X, but you didn’t teach that—what’s up with that? A key skill is being able to teach yourself what you need. In general, I will probably have directed you to the materials that you should go over, but you’re welcome to ask for more pointers if I’ve not been clear enough.
Learning objectives
The purpose of the course is to develop the core skills of data science that are applicable across academia and industry. By the end of the course, you should be able to:
- Engage critically with ideas and readings in data science (demonstrated in all papers but also tutorials and quizzes).
- Conduct research in data science in a reproducible and ethical way (demonstrated in all papers).
- Clearly communicate what was done, what was found, and why in writing (demonstrated in all papers).
- Understand what constitutes ethical high-quality data science practice, especially reproducibility and respect for those that underpin our data (demonstrated in all papers and selected quizzes).
- Respectfully identify strengths and weaknesses in the data science research conducted by others (demonstrated in quizzes, and the peer review).
- Develop the ability to appropriately choose and apply statistical models to real-world situations (demonstrated in the final paper)
- Conduct all aspects of the typical data science workflow (demonstrated in all papers).
- Reflect effectively on your own learning and professional development (demonstrated in some tutorials and quizzes).
Pre-requisites
- None.
Textbook
Content
Before class starts you should go through Chapter 1 “Telling stories with data” and Appendix A “R essentials”.
Week 1
- Drinking from a fire hose
- Guest: TBD
Week 2
- Reproducible workflows
- Writing research
- Guest: TBD
Week 3
- Static communication
- Guest: TBD
Week 4
- Farm data
- Gather data
- Guest: TBD
Week 5
- Hunt data
- Guest: TBD
Week 6
- Clean and prepare
- Store and share
- Guest: TBD
Week 7
- Missing data
- Linear models
- Guest: TBD
Week 8
- Linear models
- Guest: TBD
Week 9
- Directed Acyclic Graphs
- Generalized linear models
- Guest: TBD
Week 10
- Generalized linear models
- Guest: TBD
Week 11
- MRP and Prediction
- Guest: TBD
- We will focus on trying to predict the upcoming US presidential election, with a view to students being able to write a final paper that could be submitted to the PS: Political Science & Politics special issue.
Week 12
- Production
- Guest: TBD
Assessment
Summary
Item | Weight (%) | Due date | Notes |
---|---|---|---|
Quiz | 7 | Thursdays, noon, Weeks 1-12 | Only best seven out of twelve count. |
SQL quiz | 1 | Thursday, noon, Week 11 | You cannot pass the course if you do not get at least 70 per cent in this quiz. |
Personal website | 2 | Thursday, noon, Week 11 | You cannot pass the course if you do not get at least 70 per cent on this assessment. Create a personal website using Quarto and make it live via GitHub Pages. At a minimum, it must include a bio and a CV in PDF form. |
Tutorials | 6 | Thursday, noon, Weeks 1-12 | Only best three out of twelve count. |
Term papers | 48 | Thursdays, noon, Weeks 3, 6, 9 Term Paper I: 25 January 2024 Term Paper II: 15 February 2024 Term Paper III: 14 March 2024 |
You must submit Term Paper I in order to pass the course. Only best two of three term papers count. Marking starts, noon, on the following Monday, and you can update until then i.e. submissions made by noon, Thursday, Week 3 can be updated until noon, Monday, Week 4 (this is to allow you to incorporate peer review comments). Please do not make any changes after marking starts. Term Paper I: Donaldson Paper Term Paper II: Mawson Paper. Term paper III: Pick one of Murrumbidgee Paper, Spadina Paper, or Spofforth Paper |
Term papers (peer review) | 4 | Fridays, noon, Weeks 6, 9 | Conduct peer review for four other term papers, by creating a GitHub Issue or Pull Request. Papers will be distributed by a spreadsheet—add a link to the Issue/PR to a term paper that does not have four other entries. |
Final paper | 28 | Thursday, noon, Week 12 (4 April 2024) | You must submit this paper. Marking starts, noon, Monday 22 April and you can update update until then i.e. submissions made by noon, Thursday, Week 12 can be updated until noon, Monday, 22 April (this is to allow you to incorporate peer review comments). Please do not make any changes after marking starts. |
Final paper (peer review) | 4 | Friday, noon, Week 12 (5 April 2024) | Conduct peer review for four other final papers, by creating a GitHub Issue or Pull Request. Papers will be distributed by a spreadsheet—add a link to the Issue/PR to a paper that does not have four other entries. |
You must submit Term Paper 1. You must submit the Final Paper. You must submit and get at least 70 per cent on both the SQL quiz and the Personal website.
Beyond that, you have scope to pick an assessment schedule that works for you. I will take your best three of the twelve tutorials for that six per cent, and your best seven of twelve quizzes for that seven per cent. I take your two best papers from the three term papers for that 48 per cent (24 per cent for each). You get four marks for peer review of term papers, (one per cent per review). There is 28 per cent allocated for the Final paper. And four per cent for peer review of the Final paper.
Additional details:
- Quiz questions are drawn from those in the Quiz section that follows each chapter of Telling Stories with Data. Some of them are multiple choice, and you should expect to know the mark within a few days of submission. Please do them before coming to class.
- Tutorial questions are drawn from those in the Tutorial section that follows each chapter of Telling Stories with Data. The general expectation (although this differs from week to week) is about two pages of written content. You should expect to know the mark within a few days of the tutorial.
- In general term papers require a considerable amount of work, and are due after the material has been covered in quizzes and tutorials (i.e. you would draw on knowledge tested in the quizzes, and potentially material could be re-used from the tutorial material). In general, they require original work to some extent. Papers are taken from the Papers appendix of Telling Stories with Data and students have access to the grading rubrics before submission.
- If you already have a website, please communicate with me about this early in the term so that I can let you know whether it can be used for the purposes of this submission.
- Rubric for tutorial is:
- 0 - Any typos, major grammatical errors, other table stakes issues for this level. Too short.
- 0.25 - Grammatical errors, if relevant: tables/graphs not properly labeled, no references, other aspects that affect credibility.
- 0.6 - Makes some interesting and relevant points, related to course material (including required materials), but lacking in terms of structure and story/argument.
- 0.80 - Interesting paper that is well-structured, coherent, and credible.
- 1 - As with 0.80, but exceptional in some way.
- Only the best two of three term papers counts. This means each is worth 24 per cent.
- Peer review will occur for Term Paper I, but it is a just an optional ‘practise’--students are typically not yet familiar enough with the expectations of the course so as to be able to provide valuable comments (other than noticing whether R has been cited!).
Other
Children in the classroom
Babies (bottle-feeding, nursing, etc) are welcome in class as often as necessary. You are welcome to take breaks to feed them or express milk as needed, either in the classroom or elsewhere including here. A list of baby change stations is also available here. Please communicate with me so that I can make sure that we have regular breaks to accommodate this.
For toddlers and older children, I understand that unexpected disruptions in childcare/school can happen. You are welcome to bring your child to class in order to cover unforeseen gaps.
Accommodations with regard to assessment
Please do not reveal your personal or medical information to me. I understand that illness or personal emergencies can happen from time to time. The following accommodations to assessment requirements exist to provide for those situations.
Straight-forward (will automatically apply to all students—there is no need to ask for these):
- Quiz: Only your best seven quizzes count.
- Tutorial: Only your best three tutorials count.
- Term Papers: Only your best two term papers count.
So for those, if you have a situation, then just do not submit (or in the case of Term Paper I, just submit a blank page).
Slightly more involved:
- Term Paper I: You must submit something for Term Paper I, even if it gets zero. If you have a medical emergency that makes it impossible for you to submit even a blank page for Term Paper I, then please email me. In that situation one of the remaining term papers must be done individually to ensure fairness with the rest of the class.
- Peer review: No accommodation or late submission is possible for this because it would hold up the rest of the class. That said, there are two opportunities to get the peer review marks for the term papers i.e. Term Paper II and Term Paper III, so if you cannot do any for Term Paper II, then just do four for Term Paper III. If you have a medical emergency that makes this impossible, then please email me and cc your faculty/department/college advisor so that we can work out an alternative plan.
- Final paper: The final paper is a critical piece of assessment. It is also up against deadlines for submission of grades (especially for graduating students). If you have a medical emergency that makes it impossible for you to submit before marking begins, then I may be able to grant you an extension of up to three days. Email me and cc your faculty/department/college advisor so that we can work out a alternative plan.
Re-grading
Marking mistakes happen and I want to correct those. Requests to have your work re-graded will not be accepted within 24 hours of the release of grades. This is to give you a chance to reflect. Similarly, requests to have your work re-graded more than seven days after the release of the grades will not be accepted. This is to ensure the course runs smoothly.
Inside that 1-7 day period if you would like to request a re-grade, please email rohan.alexander@utoronto.ca and use the subject line “STA302: re-grade request”. Please specify where the marking mistake was made in relation to the marking guide. The entire assessment will be re-marked and it is possible that your grade could reduce.
Plenty of students get 0 on the first paper, but go on to get an A+ overall in the course. The nature of the work in this course requires students to adjust from what is expected in other courses, and the forgiving assessment weighting is designed to allow this.
Plagiarism and integrity
Please do not plagiarize. In particular, be careful to acknowledge the source of code—if it is extensive then through proper citation and if it is just a couple of lines from Stack Overflow then in a comment immediately next to the code.
You are responsible for knowing the content of the University of Toronto’s Code of Behaviour on Academic Matters.
Academic offenses include (but are not limited to) plagiarism, cheating, copying code without acknowledgement, purchasing labor for assessments (of any kind). Academic offenses will be taken seriously and dealt with accordingly. If you have any questions about what is or is not permitted in this course, please just email me.
Please consult the University’s site on Academic Integrity. Please also see the definition of plagiarism in section B.I.1.(d) of the University’s Code of Behaviour on Academic Matters available here. Please read the Code. Please review Cite it Right and if you require further clarification, consult the site How Not to Plagiarize.
Late policy
I am trying to develop you to be able to get a job as a professional data scientist. Part of that is learning to manage your time effectively.
If no extension has been granted and no accommodation applies, then the late submission of an assessment item carries a penalty of 10 percentage points per day to a maximum of one week after which it will no longer be accepted, e.g. a problem set submitted a day late that would have otherwise received 8/10 will receive 7/10, if that same problem set was submitted two days late then it would receive 6/10.
Term papers and the final paper cannot be accepted after marking has started.
Writing
Papers and reports should be well-written, well-organized, and easy to follow. They should flow easily from one point to the next. They should have proper sentence structure, spelling, vocabulary, and grammar. Each point should be articulated clearly and completely without being overly verbose. Papers should demonstrate your understanding of the topics you are studying in the course and your confidence in using the terms, techniques and issues you have learned. As always, references must be properly included and cited. If you have concerns about your ability to do any of this then please make use of the writing support provided to the faculty, colleges and the SGS Graduate Centre for Academic Communication.
Minimum submission requirement
If you are going to not be able to submit at least two term papers, and/or be unable to submit the final paper then it would be unfair on the other students to allow you to pass the course. But it is not a situation that I want to get into. Please ensure you and your college registrar or faculty/department advisor get in touch with me as early as possible if this may be the case for you so that we can work out a solution.