Supporters
We gratefully acknowledge the support of the Faculty of Information and the Department of Statistical Sciences at the University of Toronto, and CANSSI Ontario; in particular Dean Wendy Duff, Chair Radu Craiu, and Professor Lisa Strug for their support.
Overview
The Faculty of Information and the Department of Statistical Sciences at the University of Toronto are excited to host a two-day conference bringing together academic and industry participants on the critical issue of reproducibility in applied statistics and related areas. The conference is free and will be hosted online on Thursday and Friday 25-26 February 2021. Everyone is welcome, you don’t need to be affiliated with a university, and you can register here.
The conference has three broad areas of focus:
- Evaluating reproducibility: Systematically looking at the extent of reproducibility of a paper or even in a whole field is important to understand where weaknesses exist. Does, say, economics fall flat while demography shines? How should we approach these reproductions? What aspects contribute to the extent of reproducibility.
- Practices of reproducibility: We need new tools and approaches that encourage us to think more deeply about reproducibility and integrate it into everyday practice.
- Teaching reproducibility: While it is probably too late for most of us, how can we ensure that today’s students don’t repeat our mistakes? What are some case studies that show promise? How can we ensure this doesn’t happen again?
We intend to record the presentations and will add links here after the conference. Again, the conference is free and online via Zoom, everyone is welcome - you don’t need to be affiliated with a university. If you would like to attend, then please sign up here.
Schedule
Thursday, 25 February, 2021
9:00-9:10am |
Rohan Alexander, University of Toronto |
Welcome |
- |
9:10-9:20am |
Radu Craiu, University of Toronto |
Opening remarks |
https://youtu.be/JGGVEgMBURU |
9:20-9:30am |
Wendy Duff, University of Toronto |
Opening remarks |
https://youtu.be/Z3aWU1A0FCw |
9:30-10:25am |
Mine Çetinkaya-Rundel, University of Edinburgh |
Keynote - Teaching |
https://youtu.be/ANH2tv2vkew |
10:30-11:30am |
Riana Minocher, Max Planck Institute for Evolutionary Anthropology |
Keynote - Evaluating |
https://youtu.be/O3t8TwWeli0 |
11:30-11:55am |
Tiffany Timbers, University of British Columbia |
Teaching |
https://youtu.be/mh93W8XimOg |
Noon-12:25pm |
Tyler Girard, University of Western Ontario |
Teaching |
https://youtu.be/k3qgmUAjIvA |
12:30-12:55pm |
Shiro Kuriwaki, Harvard University |
Practices |
https://youtu.be/-J-eiPnmoNE |
1:00-1:25pm |
Meghan Hoyer, Washington Post & Larry Fenn AP |
Practices |
https://youtu.be/FFwMfNk83rc |
1:30-1:55pm |
Tom Barton, Royal Holloway, University of London |
Evaluating |
https://youtu.be/YTdhcSDqFNQ |
2:00-2:25pm |
Break |
- |
- |
2:30-2:55pm |
Mauricio Vargas, Catholic University of Chile & Nicolas Didier Arizona State University |
Evaluating |
https://youtu.be/VpTavLYEMgg |
3:00-3:25pm |
Jake Bowers, University of Illinois & The Policy Lab |
Practices |
https://youtu.be/3N0YwJIbbHg |
3:30-3:55pm |
Amber Simpson, Queens University |
Practices |
https://youtu.be/uUfrcB6aynQ |
4:00-4:25pm |
Garret Christensen, US FDIC |
Evaluating |
https://youtu.be/595KkVKJ29w |
4:30-4:55pm |
Yanbo Tang, University of Toronto |
Practices |
https://youtu.be/0x6gOkldOvk |
5:00-5:25pm |
Lauren Kennedy, Monash University |
Practices |
https://youtu.be/HhfogRbgbA4 |
5:30-6:00pm |
Lisa Strug, University of Toronto & CANSSI Ontario |
Closing remarks |
https://youtu.be/B_9puTSp3f8 |
Friday, 26 February, 2021
8:00-8:30am |
Nick Radcliffe and Pei Shan Yu, Global Open Finance Centre of Excellence & University of Edinburgh |
Practices |
https://youtu.be/pWEc8XoIIKE |
8:30-9:00am |
Julia Schulte-Cloos, LMU Munich |
Practices |
- |
9:00-9:25am |
Simeon Carstens, Tweag/IO |
Practices |
https://youtu.be/fpoFzDvrJAA |
9:30-9:55am |
Break |
- |
- |
10:00-10:55am |
Eva Vivalt, University of Toronto |
Keynote - Practices |
https://youtu.be/0WZUzf2oSGY |
11:00-11:25am |
Andrés Cruz, Pontificia Universidad Católica de Chile |
Practices |
https://youtu.be/HjdPDEACxmA |
11:30-11:55am |
Emily Riederer, Capital One |
Practices |
https://youtu.be/BknQ0ZNkMNY |
Noon-12:25pm |
Florencia D’Andrea, National Institute of Agricultural Technology |
Practices |
https://youtu.be/9FVUIPfBeXw |
12:30-12:55pm |
John Blischak, Freelance scientific software developer |
Practices |
https://youtu.be/RrcaGukYDyE |
1:00-1:25pm |
Shemra Rizzo, Genentech |
Practices |
https://youtu.be/rEYtB3CG76Q |
1:30-2:25pm |
Break |
- |
- |
2:30-2:55pm |
Wijdan Tariq, University of Toronto |
Evaluating |
- |
3:00-3:25pm |
Sharla Gelfand, Freelance R Developer |
Practices |
https://youtu.be/G5Nm-GpmrLw |
3:30-3:55pm |
Ryan Briggs, University of Guelph |
Practices |
https://youtu.be/_dgGbxItiB4 |
4:00-4:25pm |
Monica Alexander, University of Toronto |
Practices |
https://youtu.be/yvM2C6aZ94k |
4:30-4:55pm |
Annie Collins, University of Toronto |
Practices |
https://youtu.be/u4ibhN_nWyI |
5:00-5:25pm |
Nancy Reid, University of Toronto |
Practices |
https://youtu.be/sIsOPuZOQL4 |
5:30-6:00pm |
Rohan Alexander, University of Toronto |
Closing remarks |
https://youtu.be/7LttFNOI6p8 |
All times are Toronto / US east coast. 9am in Toronto 🇨🇦 is:
- 7:30pm in Bangalore 🇮🇳;
- 3pm in Berlin 🥨;
- 2pm in London 💂;
- 11am in Santiago 🇨🇱;
- 6am in Vancouver 🎿; and
- 1am in Melbourne 🦘.
Presenter biographies and abstracts
Keynotes:
- Eva Vivalt
- Bio: Eva Vivalt is an Assistant Professor in the Department of Economics at the University of Toronto. Her main research interests are in cash transfers, reducing barriers to evidence-based decision-making, and global priorities research.
- Abstract: An overview of the role of forecasting and a new platform for making them.
- Mine Çetinkaya-Rundel
- Bio: Mine Çetinkaya-Rundel is a Senior Lecturer in Statistics and Data Science in the School of Maths at University of Edinburgh, and currently on leave as Associate Professor of the Practice in the Department of Statistical Science at Duke University as well as a Professional Educator and Data Scientist at RStudio. She is the author of three open source statistics textbooks and is an instructor for Coursera. She is the chair-elect of the Statistical Education Section of the American Statistical Association. Her work focuses on innovation in statistics pedagogy, with an emphasis on student-centered learning, computation, reproducible research, and open-source education.
- Abstract: In the beginning was R Markdown. In this talk I will give a brief review of teaching statistics and data analysis through the lens of reproducibility with R Markdown, and how to use this tool effectively in teaching to maintain reproducibility as the scope of your students’ projects and their experience grow.
- Riana Minocher
- Bio: Riana Minocher is a doctoral student at the Max Planck Institute for Evolutionary Anthropology in Leipzig. She is an evolutionary biologist with broad interests. She has worked on a range of projects on human and non-human primate behaviour and ecology. She is particularly interested in the evolutionary processes that create and shape diversity between and within groups. Through her PhD research, she is keen on exploring the dynamics of cultural transmission and learning in human populations, to better understand the diverse patterns of behaviour we observe.
- Abstract: Interest in improving reproducibility, replicability and transparency of research has increased substantially across scientific fields over the last few decades. We surveyed 560 empirical, quantitative publications published between 1955 and 2018, to estimate the rate of reproducibility for research on social learning, a large subfield of behavioural ecology. We found supporting materials were available for less than 30% of publications during this period. The availability of data declines exponentially with time since publication, with a half-life of about six years, and this “data decay rate” varies systematically with both study design and study species. Conditional on materials being available, we estimate that a reasonable researcher could expect to successfully reproduce about 80% of published results, based on our evaluating a subset of 40 publications. Taken together, this indicates an overall success rate of 24% for both acquiring materials and recovering published results, with non-reproducibility of results primarily due to unavailable, incomplete, or poorly-documented data. We provide recommendations to improve the reproducibility of research on the ecology and evolution of social behaviour.
Invited presentations:
- Amber Simpson
- Bio: Amber Simpson is the Canada Research Chair in Biomedical Computing and Informatics and Associate Professor in the School of Computing (Faculty of Arts and Science) and Department of Biomedical and Molecular Sciences (Faculty of Health Sciences). She specializes in biomedical data science and computer-aided surgery. Her research group is focused on developing novel computational strategies for improving human health. She joined the Queen’s University faculty in 2019, after four years as faculty at Memorial Sloan Kettering Cancer Center in New York and three years as a Research Assistant Professor in Biomedical Engineering at Vanderbilt University in Nashville. She is an American Association of Cancer Research award winner and the holder of multiple National Institutes of Health grants. She received her PhD in Computer Science from Queen’s University.
- Abstract: The development of predictive and prognostic biomarkers is a major area of investigation in cancer research. Our lab specializes in the development of quantitative imaging markers for personalized treatment of cancer. Progress in developing these novel markers is limited by a lack of optimization, standardization, and validation, all critical barriers to clinical use. This talk will describe our work in the repeatability and reproducibility of imaging biomarkers.
- Andrés Cruz
- Bio: Andrés Cruz is an adjunct instructor at Pontificia Universidad Católica de Chile, where he teaches computational social science. He holds a BA and MA in Political Science, and is the co-editor of “R for Political Data Science: A Practical Guide” (CRC Press, 2020), an R manual for social science students and practitioners.
- Abstract:
inexact
is an RStudio addin to supervise fuzzy joins. Merging data sets is a simple procedure in most statistical software packages. However, applied researchers frequently face problems when dealing with data in which ID variables are not properly standardized. For instance, politicians’ names can be spelled differently in multiple sources (press reports, official documents, etc.), causing regular merging methods to fail. The most common approach to fix this issue when working with small and medium data sets is manually fixing the problematic values before merging. However, this solution is time-consuming and not reproducible. An RStudio addin called “inexact” was created to help with this. The package draws from approximate string matching algorithms, which quantify the distance between two given strings. When merging data sets with non-standardized ID variables, inexact
users benefit from automatic match suggestions, while also being able to override the automatic choices when needed, using a user-friendly graphical user interface (GUI). The output is simply code to perform the corrected merging procedure, which registers the employed algorithm and any corrections made by the user, ensuring reproducibility. A development version of inexact
is available on GitHub.
- Annie Collins
- Bio: Annie Collins is an undergraduate student in the Department of Mathematics specializing in applied mathematics and statistics with a minor in history and philosophy of science. In her free time, she focusses her efforts on student governance, promoting women’s representation in STEM, and working with data in the non-profit and charitable sector.
- Abstract: We create a dataset of all the pre-prints published on medRxiv between 28 January 2020 and 31 January 2021. We extract the text from these pre-prints and parse them looking for keyword markers signalling the availability of the data and code underpinning the pre-print. We are unable to find markers of either open data or open code for 81 per cent of the pre-prints in our sample. Our paper demonstrates the need to have authors categorize the degree of openness of their pre-print as part of the medRxiv submissions process, and more broadly, the need to better integrate open science training into a wide range of fields
- Emily Riederer
- Bio: Emily Riederer is a Senior Analytics Manager at Capital One. Her team focuses on reimagining our analytical infrastructure by building data products, elevating business analysis with novel data sources and statistical methods, and providing consultation and training to our partner teams.
- Abstract: Complex software systems make performance guarantees through documentation and unit tests, and they communicate these to users with conscientious interface design. However, published data tables exist in a gray area; they are static enough not to be considered a ‘service’ or ‘software’, yet too raw to earn attentive user interface design. This ambiguity creates a disconnect between data producers and consumers and poses a risk for analytical correctness and reproducibility. In this talk, I will explain how controlled vocabularies can be used to form contracts between data producers and data consumers. Explicitly embedding meaning in each component of variable names is a low-tech and low-friction approach which builds a shared understanding of how each field in the dataset is intended to work. Doing so can offload the burden of data producers by facilitating automated data validation and metadata management. At the same time, data consumers benefit by a reduction in the cognitive load to remember names, a deeper understanding of variable encoding, and opportunities to more efficiently analyze the resulting dataset. After discussing the theory of controlled vocabulary column-naming and related workflows, I will illustrate these ideas with a demonstration of the
convo
R package, which aids in the creation, upkeep, and application of controlled vocabularies. This talk is based on my related blog post and R package.
- Florencia D’Andrea
- Bio: Florencia D’Andrea is a post-doc at the Argentine National Institute of Agricultural Technology where she develops computer tools to assess the risk of pesticide applications for aquatic ecosystems. She holds a PhD in Biological Sciences from the University of Buenos Aires, Argentina, and is part of the ReproHack core-team and the R-Ladies global team. She believes that code and data should also be recognized as valuable products of scientific work.
- Abstract: Choose your own adventure to a reproducible scientific article: learnings from ReproHack “I shared the code and data of my last scientific article, does it mean that it is reproducible?” One might think that having access to the research data and the code used to analyze that data would be enough to reproduce published results, but often this is much more involved. Is reproducibility dependent on the reviewer’s knowledge? What things do we not usually think about can affect reproducibility? Can the choice of how to capture the computational environment influence the experience of the reviewer? In this talk, we are going to think together some of the necessary steps that make someone else able to reproduce a scientific article or project. I will share some thoughts from my experience in ReproHack and show you how reviewing is a great practice to learn about reproducibility. What is ReproHack? Reprohack is a hackathon-style event focused on the reproducibility of research results. These hackathons provide a low-pressure sandbox environment for practicing reproducible research: Authors can practice producing reproducible research and receive friendly feedback and appreciation of their efforts. Participants can practice reviewing, learn about reproducibility best practices as well as common pitfalls from working with real-life materials rather than just dummy. They also get inspired and grow confidence in working more openly themselves. Research Community benefits from: Evaluating what best practice is in practice. More practice in both developing and reviewing materials.
- Garret Christensen
- Bio: Garret Christensen received his economics PhD from UC Berkeley in 2011. He is an economist with the FDIC. Before that he worked for the Census Bureau, and he was a project scientist with the Berkeley Initiative for Transparency in the Social Sciences and a Data Science Fellow with the Berkeley Institute for Data Science.
- Abstract: Adoption of Open Science Practices is Increasing: Survey Evidence on Attitudes, Norms and Behavior in the Social Sciences. Has there been meaningful movement toward open science practices within the social sciences in recent years? Discussions about changes in practices such as posting data and pre-registering analyses have been marked by controversy—including controversy over the extent to which change has taken place. This study, based on the State of Social Science (3S) Survey, provides the first comprehensive assessment of awareness of, attitudes towards, perceived norms regarding, and adoption of open science practices within a broadly representative sample of scholars from four major social science disciplines: economics, political science, psychology, and sociology. We observe a steep increase in adoption: as of 2017, over 80% of scholars had used at least one such practice, rising from one quarter a decade earlier. Attitudes toward research transparency are on average similar between older and younger scholars, but the pace of change differs by field and methodology. According with theories of normal science and scientific change, the timing of increases in adoption coincides with technological innovations and institutional policies. Patterns are consistent with most scholars underestimating the trend toward open science in their discipline.
- Jake Bowers
- Bio: Jake Bowers is a Senior Scientist at The Policy Lab and a member of the Lab’s data science practice. Jake is Associate Professor of Political Science and Statistics at the University of Illinois Urbana-Champaign. He has served as a Fellow in the Office of Evaluation Sciences in the General Services Administration of the US Federal Government and is Methods Director for the Evidence in Governance and Politics network. Jake holds a PhD in Political Science from the University of California, Berkeley, and a BA in Ethics, Politics and Economics from Yale University.
- Abstract: For evidence-based public policy to grow in impact and importance, practices to enhance scientific credibility should be brought into governmental contexts and also should be modified for those contexts. For example, few analyses of governmental data allow data sharing (in contrast with most scientific studies); and many analyses of governmental administrative data inform high stakes immediate decisions (in contrast with the slow accumulation of scientific knowledge). We make several proposals to adjust scientific norms of reproducibility and pre-registration to the policy context.
- John Blischak
- Bio: John Blischak is a freelance scientific software developer for the life sciences industry. He is the primary author of the R package workflowr and the co-maintainer of the CRAN Task View on Reproducible Research. He received his PhD in Genetics from the University of Chicago.
- Abstract: The
workflowr
R package helps organize computational research in a way that promotes effective project management, reproducibility, collaboration, and sharing of results. workflowr
combines literate programming (knitr and rmarkdown) and version control (Git, via git2r) to generate a website containing time-stamped, versioned, and documented results. Any R user can quickly and easily adopt workflowr
, which includes four key features: (1) workflowr
automatically creates a directory structure for organizing data, code, and results; (2) workflowr
uses the version control system Git to track different versions of the code and results without the user needing to understand Git syntax; (3) to support reproducibility, workflowr
automatically includes code version information in webpages displaying results and; (4) workflowr
facilitates online Web hosting (e.g. GitHub Pages) to share results. Our goal is that workflowr
will make it easier for researchers to organize and communicate reproducible results. Documentation and source code are available.
- Julia Schulte-Cloos
- Bio: Julia Schulte-Cloos is a Marie Skłodowska-Curie funded research fellow at LMU Munich. She has earned her PhD in Political Science from the European University Institute. Julia is passionate about developing tools and templates for generating reproducible workflows and creating reproducible research outputs with R Markdown.
- Abstract: We present a template package in R that allows users without any prior knowledge of R Markdown to implement reproducible research practices in their scientific workflows. We provide a single Rmd-file that is fully optimized for two different output formats, HTML and PDF. While in the stage of explorative analysis and when focusing on content only, researchers may rely on the ‘draft mode’ of our template that knits to HTML When in the stage of research dissemination and when focusing on the presentation of results, in contrast, researchers may rely on the ‘manuscript mode’ that knits to PDF. Our template outlines the basics for successfully writing a reproducible paper in R Markdown by showing how to include citations, figures, and cross-references. It also provides examples for the use of
ggplot2
to include plots, both in static and animated outputs, and it shows how to present the most commonly used tables in scientific research (descriptive statistics and regression tables). Finally, in our template, we discuss some more advanced features of literate programming and helpful tweaks in R Markdown.
- Lauren Kennedy
- Bio: Lauren Kennedy is a lecturer in the Econometrics and Business Statistics department at Monash University. She works on applied statistical problems in the social sciences using primarily Bayesian methodology. Her most recent work is with survey data, particularly the use of model and poststratify methods to make population and subpopulation predictions.
- Abstract: Survey data is challenging to work with. It frequently contains entry errors (either from respondent recollection or interviewer entry) that are difficult to verify and identify. Survey data is often received in a form that is sensible for the software for which entry is intuitive, which does not necessarily follow through to a data structure that is intuitive to work with as an analyst. When we consider the use of tools like multilevel regression and poststratification, our challenges compound. Even if the population data is precleaned before release, measurements and items in the sample need to be mapped to measurements and items in the population. In this talk we discuss case studies of how and where these challenges appear in practice.
- Larry Fenn
- Bio: Larry Fenn is a data journalist at the Associated Press. His investigative work has covered a broad range of topics, from guns to education to housing policy. Prior to journalism, he was an adjunct lecturer at Hunter College for applied mathematics and statistics.
- Abstract: Please see Meghan Hoyer.
- Mauricio Vargas Sepúlveda
- Bio: Mauricio Vargas Sepúlveda loves working with data and statistical programming, and is constantly learning new skills and tooling in his spare time. He mostly works in R due to its huge number of libraries and emphasis on reproducible analysis.
- Abstract: Evidence-based policymaking has turned into a high priority for governments across the world. The possibility of gaining efficiencies in the public expenditure and linking the policy design to the desired outcomes have been presented as significant advantages for the field of comparative policy. However, the same movement that supports the use of evidence in public policy decision making has brought a great concern about the sources of the supposed evidence. How should policymakers evaluate the evidence? The possibilities are open and depend on the institutional arrangements that support governmental operation and the possibility of properly judging the nature of the evidence. The movement of science reproducibility could enlighten the discussion about the quality of the evidence by providing a structured approach towards the source’s validity based on the possibility of reproducing the logic and analysis proper of scientific communication. This paper attempts to analyze the nature and quality of civil society organizations’ contributions to develop evidence for policymaking process from reproducibility perspective.
- Meghan Hoyer
- Bio: Meghan Hoyer is Data Director at The Washington Post where she leads data projects and acts as a consulting editor on data-driven stories, graphics and visualizations across the newsroom. Before this she helped lead the AP’s data journalism. Meghan earned a bachelor of science in journalism at Northwestern University and an MFA in creative nonfiction writing at Old Dominion University.
- Abstract: This talk will cover AP DataKit, which is an open-source command-line tool designed to better structure and manage projects, and more generally, talk about creating sane, reproducible workflows.
- Monica Alexander
- Bio: Monica Alexander is an Assistant Professor in Statistical Sciences and Sociology at the University of Toronto. She received her PhD in Demography from the University of California, Berkeley. Her research interests include statistical demography, mortality and health inequalities, and computational social science.
- Abstract: Sharing code for papers and projects is an important part of reproducible research. However, sometimes sharing code may be difficult, if the researcher feels their code is ‘not good enough’ and may reflect poorly on their broader research skills. This presentation contains some brief reflections from research, consulting, and teaching experiences that have led to overcoming my own barriers to sharing code and to help others do the same.
- Nancy Reid
- Bio: Nancy Reid is Professor of Statistical Sciences at the University of Toronto and Canada Research Chair in Statistical Theory and Applications. Her main area of research is theoretical statistics. This treats the foundations and properties of methods of statistical inference. She is interested in how best to use information in the data to construct inferential statements about quantities of interest. A very simple example of this is the widely quoted ‘margin of error’ in the reporting of polls, another is the ubiquitous ‘p-value’ reported in medical and health studies. Much of her research considers how to ensure that these inferential statements are both accurate and effective at summarizing complex sets of data.
- Abstract: Are p-values contributing to a crisis in replicability and reproducibility? This has been the topic of many dialogues, diatribes, and discussions among statisticians and scientists in recent years. I will share my thoughts on the issues, with emphasis on the role of inferential theory in helping to clarify the arguments.
- Nick Radcliffe
- Bio: Nick Radcliffe is the founder of the data science consulting and software firm, Stochastic Solutions Limited, the Interim Chief Scientist at the Global Open Finance Centre of Excellence, and a Visiting Professor in Maths and Stats at University of Edinburgh, Scotland. His background combines theoretical physics, operations research, machine learning and stochastic optimization. Nick’s current research interests include a focus on test-driven data analysis, (an approach to improving correctness of analytical results that combines ideas from reproducible research and test-driven development) and privacy-respecting analysis. He is the lead author of the open-source Python tdda package, which provides practical tools for testing analytical software and data, and also of the Miró data analysis suite.
- Abstract: The Global Open Finance Centre of Excellence is currently engaged in analysis of the financial impact of COVID-19 on the citizens and businesses of the UK. This research uses non-consented but de-identified financial data on individuals and businesses, on the basis of legitimate interest. All analysis is carried out in a highly locked-down analytical environment known as a Safe Haven. This talk will explain our approach to the challenges of ensuring the correctness and robustness of results in an environment where neither code nor input data can be opened up for review and even outputs need to be subject to disclosure control to reduce further any risks to privacy. Topics will include: testing input data for conformance and lack of personal identifiers using constraints; multiple implementations and verification of equivalence of results; regression tests and reference tests; verification of output artefacts; verification of output disclosure controls; data provenance and audit trails; test-driven data analysis—the underlying philosophy (and library) that we use to underpin this work.
- Nicolas Didier
- Bio: Nicolas Didier is studying for a PhD in Public Administration and Policy at the Arizona State University. During his PhD studies and previous studies, he has worked extensively on developing evidence that addresses policy in labour markets and public expenditure.
- Abstract: Please see Mauricio Vargas Sepúlveda.
- Ryan Briggs
- Bio: Ryan Briggs is a social scientist who studies the political economy of poverty alleviation. Most of his research focuses on the spatial targeting of foreign aid. He is an Assistant Professor in the Guelph Institute of Development Studies and Department of Political Science at the University of Guelph. Before that, he taught at Virginia Tech and American University.
- Abstract: It is hard to do research. One reason for this is that it has a production function where one low quality input (among many high quality inputs) can poison a final result. This talk explains how such ‘o-ring’ production functions work and draws out lessons for applied researchers.
- Sharla Gelfand
- Bio: Sharla Gelfand is a freelance R and Shiny developer specializing in enabling easy access to data and replacing manual, redundant processes with ones that are automated, reproducible, and repeatable. They also co-organize R-Ladies Toronto and the Greater Toronto Area R User Group. They like R (of course), dogs, learning Spanish, playing bass, and punk.
- Abstract: Getting stuck, looking around for a solution, and eventually asking for help is an inevitable and constant aspect of being a programmer. If you’ve ever looked up a question only to find some brave soul getting torn apart on Stack Overflow for not providing a minimum working example, you know it’s also one of the most intimidating parts! A minimum working example, or a reproducible example as it’s more often called in the R world, is one of the best ways to get help with your code - but what exactly is a reproducible example? How do you create one, and do it efficiently? Why is it so scary? This talk will cover what components are needed to make a good reproducible example to maximize your ability to get help (and to help yourself!), strategies for coming up with an example and testing its reproducibility, and why you should care about making one. We will also discuss how to extend the concept of reproducible examples beyond “Help! my code doesn’t work” to other environments where you might want to share code, like teaching and blogging.
- Shemra Rizzo
- Bio: Shemra Rizzo is a senior data scientist in Genentech’s Personalized Healthcare group. Shemra’s role includes research on COVID-19 using electronic health records, and the development of data-driven approaches to evaluate clinical trial eligibility criteria. Shemra obtained her PhD in Biostatistics from UCLA. Before joining Genentech, Shemra was an assistant professor of statistics at UC Riverside, where her research covered topics in mental health, health disparities, and nutrition. In her free time, Shemra enjoys spending time with her family and running.
- Abstract: Real world data for an emerging disease has unique challenges. In this talk, I’ll describe how our group made sense of complex Electronic Health Records (EHR) data for COVID19 early in the pandemic. I will share our experience working towards reliable, replicable and reproducible studies using EHR licensed data.
- Shiro Kuriwaki
- Bio: Shiro Kuriwaki is a PhD Candidate in the Department of Government at Harvard University. His research focuses on democratic representation in American Politics. In an ongoing project, he studies the structure of voter’s choices across levels of government and the political economy of local elections, using cast vote records and surveys. His other projects also help understand the mechanics of representation, including: public opinion and Congress, modern survey statistics and causal inference, and election administration. Prior to and during graduate school, he worked at the Analyst Institute in Washington D.C.
- Abstract: I show how new features of the dataverse R package facilitate reproducibility in empirical, substantive projects. While packages and scripts make our code transparent and portable across forms, the import of large and complex datasets is often a nuisance in project workflows that involve various data cleaning and wrangling tasks. And the GUI for Dataverse can be sometimes tedious to integrate into code-based workflow. Will Beasley and I, along with multiple other contributors, updated the dataverse R package for the first time since 2017 with the goal of spreading its use in empirical workflow. In this iteration, we make it easier to retrieve dataframes of various file format and options for version specification and variable subsetting. I also discuss the latest updates to pyDataverse, a independent implementation in Python which is currently more advanced in its implementation but focused on uploading and creating datasets to dataverse.
- Simeon Carstens
- Bio: Simeon Carstens is a Data Scientist at Tweag I/O, a software innovation lab and consulting company. Originally a physicist, Simeon did a PhD and postdoc research in computational biology, focusing on Bayesian determination of three-dimensional chromosome structures.
- Abstract: Data analysis often requires a complex software environment containing one or several programming languages, language-specific modules and external dependencies, all in compatible versions. This poses a challenge to reproducibility: what good is a well-designed, tested and documented data analysis pipeline if it is difficult to replicate the software environment required to run it? Standard tools such as Python / R virtual environments solve part of the problem, but do not take into account external and system-level dependencies. Nix is a fully declarative, open-source package manager solving this problem: a program packaged with Nix comes with a complete description of its full dependency tree, down to system libraries. In this presentation, I will give an introduction to Nix, show in a live demo how to set up a fully reproducible software environment and compare Nix to existing solutions such as virtual environments and Docker.
- Tiffany Timbers
- Bio: Tiffany Timbers is an Assistant Professor of Teaching in the Department of Statistics and an Co-Director for the Master of Data Science program (Vancouver Option) at the University of British Columbia. In these roles she teaches and develops curriculum around the responsible application of Data Science to solve real-world problems. One of her favourite courses she teaches is a graduate course on collaborative software development, which focuses on teaching how to create R and Python packages using modern tools and workflows.
- Abstract: In the data science courses at UBC, we define data science as the study and development of reproducible and auditable processes to obtain value (i.e., insight) from data. While reproducibility is core to our definition, most data science learners enter the field with other aspects of data science in mind, such as predictive modelling. This fact, along with the highly technical nature of the industry standard reproducibility tools currently employed in data science, present out-of-the gate challenges in teaching reproducibility in the data science classroom. Put simply, students are not as intrinsically motivated to learn this topic, and it is not an easy one for them to learn. What can a data science educator do? Over several iterations of teaching courses focused on reproducible data science tools and workflows, we have found that motivation, direct instruction and practice are key to effectively teach this challenging, yet important subject. In this talk, I will present examples of how we deeply motivate, effectively instruct and provide ample practice opportunities to our Master of Data Science students to effectively engage them in learning about this topic.
- Tom Barton
- Bio: Tom Barton is a PhD student in Politics at Royal Holloway, University of London. His PhD focuses on the impact of Voter Identification laws on political participation and attitudes. More generally his interests include elections, public opinion (particularly social values) and quantitative research methods.
- Abstract: I reproduce Surridge, 2016, ‘Education and liberalism: pursuing the link’, Oxford Review of Education, 42:2, pp. 146-164, using the 1970 British Cohort Study (BCS70), instead using a difference-in-difference regression approach with more waves of data. I find that whilst there is evidence for both socialisation and self-selection models, self-selection dominates the link between social values and university attendance. This is counter to what Surridge (2016) concluded. The need for re-specification was two-fold, first Surridge’s methodology did not fully test for causality and secondly later waves have data have become available since.
- Tyler Girard
- Bio: Tyler Girard is a PhD Candidate in political science at the University of Western Ontario (London, Ontario, Canada). His dissertation research seeks to explain the origins and diffusion of the global financial inclusion agenda by focusing on the role of ambiguous ideas in mobilizing and consolidating transnational coalitions. More generally, his work also explores new approaches to conceptual measurement in international relations.
- Abstract: In what ways can we incorporate reproducible practices in pedagogy for social science courses? I discuss how individual and group exercises centered around the replication of existing datasets and analyses offer a flexible tool for experiential learning. However, maximizing the benefits of such an approach requires customizing the activity to the students and the availability of instructor support. I offer several suggestions for effectively using replication exercises in both undergraduate and graduate level courses.
- Wijdan Tariq
- Bio: Wijdan Tariq is an undergraduate student in the Department of Statistical Sciences at the University of Toronto.
- Abstract: I undertake a narrow replication of Caicedo, 2019, ‘The Mission: Human Capital Transmission, Economic Persistence, and Culture in South America’, Quarterly Journal of Economics, 134:1, pp. 507-556. Caicedo reports of a remarkable, religiously inspired human capital intervention that took place in remote parts of South America 250 years ago and whose positive economic effects, he claims, persist to this day. I replicate some of the paper’s key results using data files that are available on the Harvard Dataverse portal. I discuss some lessons learned in the process of replicating this paper and share some reflections on the state of reproducibility in economics.
- Yanbo Tang
- Bio: Yanbo Tang is a PhD candidate at the University of Toronto in the Department of Statistical Sciences, under the joint supervision of Nancy Reid and Daniel Roy. He is interested in the study and application of methods in higher order asymptotics and statistical inference in the presence of many nuisance parameters. Nowadays, he works under the careful gaze of his pet parrot.
- Abstract: Hypothesis testing results often rely on simple, yet important assumptions about the behavior of the distribution of p-values under the null and alternative. We show that commonly held beliefs regarding the distribution of p-values are misleading when the variance or location of the test statistic are not well-calibrated or when the higher order cumulants of the test statistic are not negligible. We further examine the impact of having these misleading p-values on reproducibility of scientific studies, with some examples focused on GWAS studies. Certain corrected tests are proposed and are shown to perform better than their traditional counterparts in certain settings.
Code of conduct
Code
The organizers of the Toronto Workshop on Reproducibility are dedicated to providing a harassment-free experience for everyone regardless of age, gender, sexual orientation, disability, physical appearance, race, or religion (or lack thereof).
All participants (including attendees, speakers, sponsors and volunteers) at the Toronto Workshop on Reproducibility are required to agree to the following code of conduct.
The code of conduct applies to all conference activities including talks, panels, workshops, and social events. It extends to conference-specific exchanges on social media, for instance posts tagged with the identifier of the conference (e.g. #TOrepro on Twitter), and replies to such posts.
Organizers will enforce this code throughout and expect cooperation in ensuring a safe environment for all.
Expected Behaviour
All conference participants agree to:
- Be considerate in language and actions, and respect the boundaries of fellow participants.
- Refrain from demeaning, discriminatory, or harassing behaviour and language. Please refer to ‘Unacceptable Behaviour’ for more details.
- Alert Rohan Alexander - rohan.alexander@utoronto.ca - or Kelly Lyons - kelly.lyons@utoronto.ca - if you notice someone in distress, or observe violations of this code of conduct, even if they seem inconsequential. Please refer to the section titled ‘What To Do If You Witness or Are Subject To Unacceptable Behaviour’ for more details.
Unacceptable Behaviour
Behaviour that is unacceptable includes, but is not limited to:
- Stalking
- Deliberate intimidation
- Unwanted photography or recording
- Sustained or willful disruption of talks or other events
- Use of sexual or discriminatory imagery, comments, or jokes
- Offensive comments related to age, gender, sexual orientation, disability, race or religion
- Inappropriate physical contact, which can include grabbing, or massaging or hugging without consent.
- Unwelcome sexual attention, which can include inappropriate questions of a sexual nature, asking for sexual favours or repeatedly asking for dates or contact information.
If you are asked to stop harassing behaviour you should stop immediately, even if your behaviour was meant to be friendly or a joke, it was clearly not taken that way and for the comfort of all conference attendees you should stop.
Attendees who behave in a manner deemed inappropriate are subject to actions listed under ‘Procedure for Code of Conduct Violations’.
Additional Requirements for Conference Contributions
Presentation slides and posters should not contain offensive or sexualised material. If this material is impossible to avoid given the topic (for example text mining of material from hate sites) the existence of this material should be noted in the abstract and, in the case of oral contributions, at the start of the talk or session.
Procedure for Code of Conduct Violations
The organizing committee reserves the right to determine the appropriate response for all code of conduct violations. Potential responses include:
- a formal warning to stop harassing behaviour
- expulsion from the conference
- cancellation or early termination of talks or other contributions to the program
What To Do If You Witness or Are Subject To Unacceptable Behaviour
If you are being harassed, notice that someone else is being harassed, or have any other concerns relating to harassment, please contact Rohan Alexander - rohan.alexander@utoronto.ca, or Kelly Lyons - kelly.lyons@utoronto.ca.
We will take all good-faith reports of harassment by Toronto Workshop on Reproducibility participants seriously.
We reserve the right to reject any report we believe to have been made in bad faith. This includes reports intended to silence legitimate criticism.
We will respect confidentiality requests for the purpose of protecting victims of abuse. We will not name harassment victims without their affirmative consent.
Questions or concerns about the Code of Conduct can be addressed to rohan.alexander@utoronto.ca.
Acknowledgements
Parts of the above text are licensed CC BY-SA 4.0. Credit to SRCCON. This code of conduct was based on that developed for useR! 2018 which was a revision of the code of conduct used at previous useR!s and also drew from rOpenSci’s code of conduct.