Overview
This conference brings together academic and industry participants on
the critical issue of reproducibility in applied statistics and related
areas. The conference is free and hosted online. Everyone is welcome,
you don’t need to be affiliated with a university.
The conference has three broad areas of focus:
- Evaluating reproducibility: Systematically looking
at the extent of reproducibility of a paper or even in a whole field is
important to understand where weaknesses exist. Does, say, economics
fall flat while demography shines? How should we approach these
reproductions? What aspects contribute to the extent of
reproducibility.
- Practices of reproducibility: We need new tools and
approaches that encourage us to think more deeply about reproducibility
and integrate it into everyday practice.
- Teaching reproducibility: While it is probably too
late for most of us, how can we ensure that today’s students don’t
repeat our mistakes? What are some case studies that show promise? How
can we ensure this doesn’t happen again?
2022
Wednesday, 23 February 2022
08:40-09:00 |
Lisa Strug, University of Toronto |
Introduction and
welcome |
09:00-09:30 |
Benjamin Haibe-Kains, University Health Network |
The (Not-So-)Hard Path To
Transparency and Reproducibility in AI Research |
09:30-10:00 |
Colm-cille Caulfield, University of Cambridge |
Reproducibility in an
Uncertain World: How should academic data science researchers give
advice? |
10:00-10:30 |
Stephen Eglen, University of Cambridge |
Evaluating the
reproducibility of computational results reported in scientific
journals |
10:30-11:00 |
Valentin Danchev, University of Essex |
Reproducibility and
Replicability of Large Pre-trained Language Models |
11:00-11:30 |
Monica Alexander, University of Toronto |
Reproducibility in
Demography: where are we at and where can we go? |
11:30-12:00 |
Break |
|
12:00-12:30 |
Ariel Mundo, University of Arkansas |
Statistics and
reproducibility in biomedical research: Why we need both |
12:30-13:00 |
Shilaan Alzahawi, Stanford University |
Lay perceptions of scientific
findings: Swayed by the crowd? |
13:00-13:30 |
Break |
|
13:30-14:00 |
Fernando Hoces de la Guardia, University of California,
Berkeley |
Social Sciences
Reproducibility Platform |
14:00-15:30 |
Break |
|
15:30-16:00 |
Carl Laflamme, YCharOS |
Antibody Characterization
through Open Science (YCharOS) |
16:00-16:30 |
Robert Hanisch, National Institute of Standards and Technology and
Research Data Alliance |
Reproducibility: A Metrology
Perspective |
16:30-17:00 |
Yann Joly, McGill University |
Incentivizing open data
sharing - what’s in it for me!? |
Thursday, 24 February 2022
08:30-09:00 |
Julien Chiquet, Université Paris-Saclay |
Computo: a journal of the
French Statistical Society promoting reproductibility |
09:00-09:30 |
Nick Radcliffe, Global Open Finance Centre at the University of
Edinburgh |
Gentest: Automatic Test
Generation for Data Science |
09:30-10:00 |
Markus Fritsch, University of Passau |
Towards reproducible GMM
estimation |
10:00-10:30 |
Break |
|
10:30-11:00 |
Aneta Piekut, Sheffield Methods Institute, University of
Sheffield |
Integrating reproducibility
into the curriculum of an undergraduate social sciences degree |
11:00-12:30 |
Break |
|
12:30-13:00 |
Jason Hattrick,-Simpers, University of Toronto |
Towards Trust and
Reproducibility in Materials AI |
13:00-13:30 |
Aya Mitani, University of Toronto |
Reproducible, reliable,
replicable? In-class exercise using peer-reviewed studies |
13:30-14:00 |
Shannon Ellis, UC San Diego |
Structuring & Managing
Group Projects in Large-Enrollment Undergraduate Data Science
Courses |
14:00-14:30 |
Maria Tackett, Duke University |
Knit, Commit, and Push:
Teaching version control in undergraduate statistics courses |
14:30-15:00 |
Break |
|
15:00-15:30 |
Lars Vilhuber, Cornell University |
Teaching for large-scale
Reproducibility Verification |
15:30-16:00 |
Michael Geuenich, Lunenfeld Tanenbaum Research Institute and
University of Toronto |
With great data come great
pipelines: creating flexible standardized pipelines for common
biomedical analysis tasks using Snakemake |
16:00-16:30 |
Paraskevi Massara, University of Toronto |
MOSS4Research: A maturity
model to evaluate and improve reproducibility in research
projects |
16:30-17:00 |
Chris Kenny, Harvard University |
Reproducible
Redistricting |
17:00-17:30 |
Dewi Amaliah, Monash University |
Reproducible Practice in
Taming the Wild Data |
Friday, 25 February 2022
09:00-09:30 |
Marco Prado, University of Western Ontario |
Reproducibility for Behavior
Experiments in Basic Science |
09:30-10:00 |
David Grubbs and Lara Spieker, CRC Press |
On book publishing |
10:00-11:00 |
Joelle Pineau, McGill University & Meta (Facebook) AI
Research |
Improving Reproducibility in
Machine Learning Research |
11:00-11:30 |
Debbie Yuster, Ramapo College of New Jersey |
Infusing Reproducibility into
Introductory Data Science |
11:30-12:00 |
Colin Rundel, Duke University |
Teaching Statistical
computing with Git and GitHub |
12:00-12:30 |
Mine Çetinkaya,-Rundel, Duke University and RStudio |
Reproducible authoring with
Quarto |
12:30-13:00 |
Erin Heerey, Western University |
The Experimenter in the
Room |
13:00-13:30 |
John McLevey, University of Waterloo |
Reproducibility and
Principled Data Processing in Python |
13:30-14:00 |
Break |
|
14:00-14:30 |
Kevin Wilson, Brown University and Jake Bowers, University of
Illinois at Urbana-Champaign |
Six Tips for Reproducible
Field Experiments |
14:30-15:00 |
Abel Brodeur, University of Ottawa |
Introducing the Institute for
Replication |
15:00-15:30 |
Allison Koenecke, Cornell University and Microsoft Research |
Reproducible Retrospective
Analysis |
15:30-16:30 |
Michael Hoffman, University Health Network and University of
Toronto |
Reproducibility standards for
machine learning in the life sciences |
Presenter biographies and
abstracts
Keynotes
- Joelle Pineau
- Title: Improving Reproducibility in Machine Learning Research
Findings from the NeurIPS Reproduciblity Program and the ML
Reproducibility Challenge
- Biography: Joelle Pineau is an Associate Professor and William
Dawson Scholar at the School of Computer Science at McGill University,
where she co-directs the Reasoning and Learning Lab. She is a core
academic member of Mila and a Canada CIFAR AI chairholder. She is also
co-Managing Director of Facebook AI Research. She holds a BASc in
Engineering from the University of Waterloo, and an MSc and PhD in
Robotics from Carnegie Mellon University. Dr. Pineau’s research focuses
on developing new models and algorithms for planning and learning in
complex partially-observable domains. She also works on applying these
algorithms to complex problems in robotics, health care, games and
conversational agents. She serves on the editorial board of the Journal
of Machine Learning Research and is Past-President of the International
Machine Learning Society. She is a recipient of NSERC’s E.W.R. Steacie
Memorial Fellowship (2018), a Fellow of the Association for the
Advancement of Artificial Intelligence (AAAI), a Senior Fellow of the
Canadian Institute for Advanced Research (CIFAR), a member of the
College of New Scholars, Artists and Scientists by the Royal Society of
Canada, and a 2019 recipient of the Governor General’s Innovation
Awards.
- Michael Hoffman
- Title: Reproducibility standards for machine learning in the life
sciences
- Abstract: To make machine-learning analyses in the life sciences
more computationally reproducible, we propose standards based on data,
model and code publication, programming best practices and workflow
automation. By meeting these standards, the community of researchers
applying machine-learning methods in the life sciences can ensure that
their analyses are worthy of trust.
- Biography: Michael Hoffman creates predictive computational models
to understand interactions between genome, epigenome, and phenotype in
human cancers. His influential machine learning approaches have reshaped
researchers’ analysis of gene regulation. These approaches include the
genome annotation method Segway, which enables simple interpretation of
multivariate genomic data. He is a Senior Scientist at Princess Margaret
Cancer Centre and Associate Professor in the Departments of Medical
Biophysics and Computer Science, University of Toronto. He was named a
CIHR New Investigator and has received several awards for his academic
work, including the NIH K99/R00 Pathway to Independence Award, and the
Ontario Early Researcher Award.
Invited talks
- Abel
Brodeur
- Title: Introducing the Institute for Replication
- Biography: Abel Brodeur is an associate professor in the department
of economics at the University of Ottawa. He is the chair of the
Institute for Replication (I4R), which he founded in January 2022. I4R
works to improve the credibility of science by systematically
reproducing and replicating research findings in leading academic
journals.
- Allison Koenecke
- Title: Reproducible Retrospective Analysis
- Biography: Allison Koenecke is a postdoc at Microsoft Research in
the Machine Learning and Statistics group, and starting Summer 2022 will
be an Assistant Professor of Information Science at Cornell University.
Her research primarily spans two domains: algorithmic fairness in online
services, and causal inference in public health. Previously, she
received her PhD from Stanford’s Institute for Computational &
Mathematical Engineering.
- Aneta
Piekut
- Title: Integrating reproducibility into the curriculum of an
undergraduate social sciences degree
- Abstract: While appreciation for reproducibility and research
transparency in social sciences research has grown substantially
recently, teaching research reproducibility is still less common,
especially at the undergraduate level. Crucially, teaching reproducible
research to undergraduate students requires sequencing various open
science skills across the curriculum and normalising reproducible
research for students. In the talk I will discuss a reproducibility
assignment implemented in an undergraduate-level advanced Quantitative
Social Sciences course. As part of the assignment, students reproduced a
model in a paper published in a high-impact social science journal,
added a small extension, and published it as a reproducible report
online. I will reflect on the lessons learnt from teaching several
interactions of the module and whether one stand-alone ‘replication
project’ module is enough to change students’ practice.
- Biography: Sociologist specialising in migration and ethnic studies,
including measurement of attitudes, migrant integration and segregation.
At Sheffield Methods Institute, University of Sheffield, Aneta provides
training to undergraduate and postgraduate students in advanced
quantitative methods, survey methodology and mixed-method methodology.
Aneta is committed to teaching reproducible research methods; in 2020
she was Project TIER Fellow (https://www.projecttier.org/), and in 2021 joined its
Executive Committee.
- Ariel Mundo
- Title: Statistics and reproducibility in biomedical research: Why we
need both
- Abstract: The biomedical field still struggles at large to make
research reproducible. In this talk, I argue that part of this problem
is that most of us in biomedical research do not seem to realize the
importance of choosing appropriate Statistical models for our data, and
how this in turn enables reproducibility. Moreover, I also argue that we
need a “statistical rethinking” in biomedical research in order to
establish reproducibility as a core aspect of our work.
- Biography: Ariel Mundo is a Fulbright alum and PhD Candidate in the
Department of Biomedical Engineering at the University of Arkansas. His
work focuses on the longitudinal study of changes in cancer metabolism
using optical and molecular tools, and the use of semi-parametric
methods to analyze such data. He is also an R enthusiast and avid
reader.
- Aya Mitani
- Title: Reproducible, reliable, replicable? In-class exercise using
peer-reviewed studies
- Abstract: I will share my experience in preparing and implementing
an in-class exercise to reproduce the results from peer-reviewed
publications in health science journals. The course, titled Analysis of
Correlated Data, enrolls 20 students mostly pursuing a Master of Science
degree in biostatistics. Challenges include finding a suitable clustered
or longitudinal study that provides original data and translating the
information given (and not given) in the “Methods” section into actual
code. Through this exercise, students learn whether the results are not
only reproducible but reliable, and whether the analysis can be
replicated on a different set of data. The goal through this exercise is
to teach the students how to write an applied manuscript or report as
modern biostatisticians.
- Biography: I am an Assistant Professor in the Division of
Biostatistics at the Dalla Lana School of Public Health (DLSPH) of the
University of Toronto. I obtained my Ph.D. in Biostatistics from Boston
University and did my postdoctoral research fellowship at Harvard T. H.
Chan School of Public Health. My research includes the development of
statistical methods for complex oral health data, multiple imputation
for missing data, modelling agreement in cancer screening studies, and
biased sampling designs in surveys and observational studies. At DLSPH,
I teach Analysis of Correlated Data and Introduction to Joint Modeling
in Health Research. I am passionate about incorporating good
reproducible research practices into my teaching. In 2021, I co-founded
the Health Data Working Group at DLSPH to provide an accessible space
for students and researchers to learn about data and coding outside of
the classroom. I live in Etobicoke with my husband and two
children.
- Benjamin Haibe-Kains
- Title: The (Not-So-)Hard Path To Transparency and Reproducibility in
AI Research
- Abstract: As artificial intelligence (AI) becomes a method of choice
to analyze biomedical data, the field is facing multiple challenges
around research reproducibility and transparency. Given the
proliferation of studies investigating the applications of AI in
research and clinical studies, it is essential for independent
researchers to be able to scrutinize and reproduce the results of a
study using its materials, and build upon them in future studies.
Computational reproducibility is achievable when the data can easily be
shared and the required computational resources are relatively common.
However, the complexity of AI algorithms and their implementation, the
need for specific computer hardware and the use of sensitive biomedical
data represent major obstacles in healthy-related AI research. In this
talk, I will describe the various aspects of an AI biomedical study that
are necessary for reproducibility and the platforms that exist for
sharing these materials with the scientific community.
- Biography: Dr. Benjamin Haibe-Kains is a Senior Scientist at the
Princess Margaret Cancer Centre (PM), University Health Network, and
Associate Professor in the Medical Biophysics department of the
University of Toronto. Dr. Haibe-Kains earned his PhD in Bioinformatics
at the Université Libre de Bruxelles (Belgium). Supported by a Fulbright
Award, he did his postdoctoral fellowship at the Dana-Farber Cancer
Institute and Harvard School of Public Health (USA). Dr. Haibe-Kains’
research focuses on the integration of high-throughput data from various
sources to simultaneously analyze multiple facets of carcinogenesis.
Dr. Haibe- Kains’ team is analyzing large-scale radiological and
(pharmaco)genomic datasets to develop new prognostic and predictive
models to improve cancer care.
- Carl Laflamme
- Title: Antibody Characterization through Open Science (YCharOS)
- Abstract: Global sales of commercial antibodies are estimated at $2
billion per year with approximately half that money wasted on
underperforming reagents. Both public and private sectors agree that a
robust, independent, and scalable process to characterize commercial
antibodies is required, but all attempts to find a solution have failed
due to the tangle of conflicting interests in both academia and
industry. YCharOS (Antibody Characterization using Open Science), in
collaboration with the Structural Genomics Consortium (SGC) and the
Montreal Neurological Institute (The Neuro, McGill University) has
created an open science ecosystem in which antibody manufacturers,
knockout cell line providers, academics, pharma and granting agencies
contribute resources and knowledge to solve the antibody liability
crisis. We have already publicly shared the identification of hundreds
of high-performing antibodies for dozens of neuroscience targets. We
have scaled up our platform, developed automation and expanded our team.
We now aim to characterize antibodies for the human proteome.
- Chris Kenny
- Title: Reproducible Redistricting
- Abstract: Modern redistricting is known for occurring behind closed
doors where incumbent politicians can work to advance their
co-partisan’s interests. Recent advancements in political science and
statistical research have developed the tools to help resolve these
problems. I overview the R-package-based workflow that the ALARM Project
and its members use for research, advocacy, and testimony to courts. Key
packages developed for these purposes include redist, redistmetrics, and
geomander.
- Biography: Chris Kenny is a Ph.D. candidate in the Department of
Government at Harvard University, studying American Politics and
Political Methodology. He is currently the Political Science
Pre-Doctoral Fellow at the Harvard Election Law Clinic. His substantive
focus is on redistricting and gerrymandering. He primarily develops
open-source R tools for analyzing redistricting and voting rights in
geographic and contemporary contexts. He is an affiliate with the Center
for American Political Studies at Harvard University, The Institute for
Quantitative Social Science, and the Algorithm-Assisted Redistricting
Methodology (ALARM) Project.
- Colin Rundel
- Title: Teaching Statistical computing with Git and GitHub
- Colm-cille Caulfield
- Title: Reproducibility in an Uncertain World: How should academic
data science researchers give advice? open science-type
initiatives
- David Grubbs and Lara Spieker
- Title: On book publishing
- Abstract: In this very practical and interactive workshop, four book
editors from Chapman and Hall/CRC will discuss why you should consider
publishing an R or Data Science book and why you should work with CRC.
The editors will go over the publishing process and provide best
practices for shaping your ideas and submitting a book proposal; discuss
their bestsellers and popular series as well as emerging topics and
trends. The lively discussion will provide plenty of opportunities for
the attendees to ask questions and discuss ideas.
- Debbie
Yuster
- Title: Infusing Reproducibility into Introductory Data Science
- Abstract: In this talk, I will discuss the role of reproducibility
in my Introduction to Data Science course. The course has no
prerequisites, so many students are coding and analyzing data for the
first time. They develop habits of reproducibility from the start: their
analyses are done within R Markdown documents, and GitHub is used to
facilitate both version control and collaboration among teammates.
Through scaffolded coding exercises, gradual onboarding to GitHub, and
focusing on a small subset of GitHub functionality, even beginner
students can become adept at using these technologies. I will also
discuss tips learned from teaching the course in a fully remote format,
and will provide pointers to training resources for instructors who want
to use similar tools and workflows in their own courses.
- Biography: Debbie Yuster is an Assistant Professor of Data Science
and Mathematics at Ramapo College of New Jersey. She holds a Ph.D. in
Mathematics from Columbia University. Prior to joining Ramapo, Debbie
served as a math professor at SUNY Maritime College, earning the SUNY
Chancellor’s Award for Excellence in Teaching. Debbie served as a
Visiting Data Science Scholar at the Wall Street Journal, and has
cultivated industry partnerships leading to undergraduate research
projects. She also has an interest in K-12 STEM outreach, having worked
with secondary school teachers and students for many years.
- Dewi Amaliah
- Title: Reproducible Practice in Taming the Wild Data
- Abstract: I will talk about my experience in refreshing the wages
data from the prominent survey, NLSY79, which is used as an example of
longitudinal data in a textbook (Singer and Willet, 2003). The
motivation of this study is to demonstrate the steps (extracting,
tidying, cleaning, and exploring) and clearly articulate the decision
made when data is refreshed from the raw (wild) to the textbook (tame)
data. All of those steps are documented to ensure reproducibility.
- Erin Heerey
- Title: The Experimenter in the Room
- Fernando Hoces de la Guardia
- Title: Social Sciences Reproducibility Platform Social Sciences
Reproducibility Platform
- Kevin Wilson and Jake
Bowers
- Title: Six Tips for Reproducible Field Experiments
- Jason
Hattrick-Simpers
- Title: Towards Trust and Reproducibility in Materials AI
- Biography: Jason Hattrick-Simpers is a Professor at the Department
of Materials Science and Engineering, University of Toronto, and a
Research Scientist at CanmetMATERIALS. He graduated with a B.S. in
Mathematics and a B.S. in Physics from Rowan University and a Ph.D. in
Materials Science and Engineering from the University of Maryland. Prior
to joining UofT Prof. Hattrick-Simpers was a staff scientist at the
National Institute of Standards and Technology (NIST) in Gaithersburg,
MD where he co-developed tools for discovering novel corrosion
resistance of alloys, developed active learning approaches to guide thin
film and additive manufacturing alloy studies, and developed tools and
best practices to enable trust in AI within the materials science
community.
- John
McLevey
- Title: Reproducibility and Principled Data Processing in Python
- Julien Chiquet
- Title: Computo: a journal of the French Statistical Society
promoting reproductibility
- Abstract: This talk will present Computo (https://computo.sfds.asso.fr/), an academic journal that
has just been born, which calls for higher standards in the publication
of scientific results. In order to achieve this goal, Computo goes
beyond classical static publications by leveraging technical advances in
literate programming and scientific reporting. Computo focuses on
computational and algorithmic methodological contributions to the field
of statistics and machine learning. The journal is designed to allow
authors to demonstrate the usefulness of their methods for data
analysis, but also to promote the numerical illustration of theoretical
properties. In the era of the reproducibility crisis, Computo differs
from other journals in the centrality given to the issues of
replicability and open science: - Computo is distributed solely online,
free for authors and readers; - It systematically makes available the
exchanges between authors and reviewers, the latter being able to choose
to remain anonymous; - Computo uses an original publication format that
guarantees the reproducibility of results: articles are submitted and
published in the form of interactive documents (“notebook” integrating
text, code, equations and bibliographic references), associated with a
github repository configured to demonstrate, dynamically and durably,
the reproducibility of the contribution. On the Computo submission page,
we offer various templates to prepare your submissions, as well as an
example of a finalized article and the associated repository.
- Biography: Julien Chiquet, editor of Computo, is a senior researcher
in statistical learning. He is supported for this project by co-editors
Chloé Azencott, Pierre Neuvial and Nelle Varoquaux, all researchers in
machine learning and statistics
- Lars
Vilhuber
- Title: Teaching for large-scale Reproducibility Verification
- Abstract: We describe a unique environment in which undergraduate
students from various STEM and social science disciplines are trained in
data provenance and reproducible methods, and then apply that knowledge
to real, conditionally accepted manuscripts and associated replication
packages. We describe in detail the recruitment, training, and regular
activities. While the activity is not part of a regular curriculum, the
skills and knowledge taught through explicit training of reproducible
methods and principles, and reinforced through repeated application in a
real-life workflow, contribute to the education of these undergraduate
students, and prepare them for post-graduation jobs and further
studies.
- Biography: Lars Vilhuber holds a Ph.D. in Economics from Université
de Montréal, Canada, and is currently on the faculty of the Cornell
University Economics Department. He has interests in labor economics,
statistical disclosure limitation and data dissemination, and
reproducibility and replicability in the social sciences. He is the Data
Editor of the American Economic Association, and Managing Editor of the
Journal of Privacy and Confidentiality.
- Lisa
Strug
- Title: Introduction and overview
- Biography: Dr. Strug is Professor in the Departments of Statistical
Sciences, Computer Science and cross-appointed in Biostatistics at the
University of Toronto and is a Senior Scientist in the Program in
Genetics and Genome Biology at the Hospital for Sick Children. Dr. Strug
is the inaugural Director of the Data Sciences Institute (DSI), a
tri-campus, multi-divisional, multi-institutional, multi-disciplinary
hub for data science activity at the University of Toronto and
affiliated Research Institutes. The DSI’s goal is to accelerate the
impact of data sciences across the disciplines to address pressing
societal questions and drive positive social change. Dr. Strug holds
several other leadership positions at the University of Toronto
including the Director of the Canadian Statistical Sciences Institute
Ontario Region (CANSSI Ontario), and at the Hospital for Sick Children
as Associate Director of the Centre for Applied Genomics and the Lead of
the Canadian Cystic Fibrosis Gene Modifier Consortium and the Biology of
Juvenile Myoclonic Epilepsy International Consortium. She is a
statistical geneticist and her research focuses on the development of
novel statistical approaches to analyze and integrate multi-omics data
to identify genetic contributors to complex human disease. She has
received several honours including the Tier 1 Canada Research Chair in
Genome Data Science.
- Marco Prado
- Title: Reproducibility for Behavior Experiments in Basic
Science
- Biography: Marco Prado is scientist at the Robarts Research
Institute and a full professor at the University of Western Ontario,
where he holds a Canada Research Chair in Neurochemistry of Dementia. He
is interested in understanding how neurochemical alterations in
neurodegenerative diseases contribute to protein misfolding and
cognitive dysfunction. He has made contributions to understanding
maladaptive signaling in Alzheimer’s and Prion diseases by investigating
physiological functions of the prion protein and in how molecular
chaperones affect signaling and protein misfolding in neurodegenerative
diseases. He has developed multiple genetic mouse models of
neurochemical dysfunction in dementia. Marco’s group combines the use of
sophisticated touchscreen tests of high-level cognition and detailed
biochemical analysis to reveal several mechanisms regulating executive
function and mechanisms of pathological changes in mouse models. He is
currently spearheading with several colleagues an Open Science
Repository (www.mousebytes.ca) for high-level cognitive data in mouse
models of neurodegenerative disease. This effort will support a
community of more than 300 laboratories to increase reproducibility and
replicability of cognitive datasets in pre-clinical research. Marco
Prado received several awards, including the Guggenheim Fellowship, for
his work and has published over 200 manuscripts.
- Maria Tackett
- Title: Knit, Commit, and Push: Teaching version control in
undergraduate statistics courses
- Abstract: In recent years there has been increased focus on
incorporating the skills required to conduct well-documented and
reproducible analyses in the undergraduate statistics curriculum.
Because data analysis is an iterative process, version control, a record
of changes to a set of files over time, is a foundational part of a
reproducible workflow. In this talk, I will describe how version control
with Git can be included as a learning objective in the first and second
statistics courses. I’ll discuss strategies for introducing version
control to students, incorporating it in individual and team-based
assignments, and assessing students’ understanding. I’ll also share
lessons learned and an example of how this can be implemented using
RStudio and GitHub.
- Biography: Maria Tackett is an Assistant Professor of the Practice
in the Department of Statistical Science at Duke University. Prior to
joining the faculty at Duke, Maria earned a Ph.D. in Statistics from the
University of Virginia and worked as a statistician at Capital One. Her
work focuses on using active learning strategies to increase engagement
in large undergraduate statistics courses, and understanding how
classroom practices impact students’ sense of community in these
courses. Maria is active in the statistics education community,
including serving as the current Communications Officer for the ASA
Section on Statistics and Data Science Education.
- Markus Fritsch
- Title: Towards reproducible GMM estimation
- Abstract: Generalized method of moments (GMM) estimation is a way
forward in regression setups where endogeneity is present. A practically
relevant area of application is the estimation of linear dynamic panel
data models. This context forces the researcher to make many decisions
that seem marginal at first, but which often affect the estimation and
inference dramatically. The decisions comprise the number and type of
employed moment conditions, their weighting scheme, how covariates
and/or dummy variables are included, whether we iterate the estimation
procedure and/or bias-correct, etc. Due to the many possible choices,
clear documentation and reproducibilty are vital for the communication
of GMM estimation results. We provide guidelines for reproducible GMM
estimation and demonstrate their relevance by replicating and extending
several empirical applications.
- Biography: Markus Fritsch is Assistant Professor at the Chair of
Statistics and Data Analytics of the University of Passau. He is the
creator of the CRAN package pdynmc. His research interest include Data
Science & Statistical Learning, GMM estimation, Quantile regression,
and reproducible Applied Statistics.
- Michael Geuenich
- Title: With great data come great pipelines: creating flexible
standardized pipelines for common biomedical analysis tasks using
Snakemake
- Abstract: Biomedical data analysis pipelines are becoming
increasingly complex as projects frequently involve the analysis of raw
data from distinct batches and experimental modalities. Work frequently
starts with processing and normalizing several large datasets in a
variety of ways, often requiring custom filtering approaches for each
individual dataset. Existing and novel analysis methods are then
frequently applied to the processed data using a variety of parameters
prior to subsequent benchmarking, resulting in many individual analysis
steps that need to be tied together. Importantly, some data processing
steps are frequently dependent on the data itself, requiring inspection
of preliminary results before being able to run a standardized pipeline
in full. In addition, pre-processing steps are frequently revised as
part of the iterative analysis workflow common to most projects, thus
requiring downstream analyses to be re-run as input data changes. These
challenges make it cumbersome and error prone to run individual analysis
steps manually. Workflow managers such as Snakemake allow for the
creation of reproducible and easily
- Biography: Michael is a PhD student in the computational track of
the molecular genetics department at the University of Toronto and at
the Lunenfeld Tanenbaum Research Institute with Kieran Campbell. His
work focusses on better understanding immune escape in pancreatic cancer
using machine learning tools and a diverse set of -omics data.
- Mine Çetinkaya-Rundel
- Title: Reproducible authoring with Quarto
- Biography: I am a Professor of the Practice and the Director of
Undergraduate Studies at the Department of Statistical Science and an
affiliated faculty in the Computational Media, Arts, and Cultures
program at Duke University. My work focuses on innovation in statistics
and data science pedagogy, with an emphasis on computing, reproducible
research, student-centered learning, and open-source education. I work
on integrating computation into the undergraduate statistics curriculum,
using reproducible research methodologies and analysis of real and
complex datasets. In addition to my academic position, I also work with
RStudio, where I focus primarily on education for open-source R packages
as well as building resources and tools for educators teaching
statistics and data science with R and RStudio.
- Monica Alexander
- Title: Reproducibility in Demography: where are we at and where can
we go?
- Biography: Monica Alexander is an Assistant Professor in Statistical
Sciences and Sociology at the University of Toronto. Her research
focuses on developing statistical methods to help measure demographic
and health outcomes. She received a PhD in Demography and Masters in
Statistics from the University of California, Berkeley. She has worked
on research projects with organizations such as UNICEF, the World Health
Organization, the Bill and Melinda Gates Foundation, and the Human
Mortality Database.
- Nick
Radcliffe
- Title: Gentest: Automatic Test Generation for Data Science
- Abstract: This talk will focus on reference tests—scripts that test
the ongoing correctness of scripts, programs and pipelines with a
particular focus on data science-oriented tasks. The TDDA library has
long offered support to allow humans to write useful tests for data
science workflows, with a focus on supporting tests for what might be
called semantic/functional correctness, rather than syntactic/form
correctness. New “Gentest” functionality in TDDA goes further by
automating large parts of test production. Using Gentest, researchers
can concentrate on developing robust/correct analysis pipelines,
verifying them in the usual way (probably by hand), and then use Gentest
to generate executable tests automatically. Although Gentest is written
in Python, it can be also used to generate tests for R or almost any
other language. If all goes well, this talk will include a demonstration
of automatically generating tests for R scripts.
- Biography: Nick Radcliffe is the founder of the data science
consulting and software firm, Stochastic Solutions Limited, the Interim
Chief Scientist at the Global Open Finance Centre of Excellence, and a
Visiting Professor in Maths and Stats at University of Edinburgh,
Scotland. His background combines theoretical physics, operations
research, machine learning and stochastic optimization. Nick’s current
research interests include a focus on test-driven data analysis, (an
approach to improving correctness of analytical results that combines
ideas from reproducible research and test-driven development) and
privacy-respecting analysis. He is the lead author of the open-source
Python tdda package, which provides practical tools for testing
analytical software and data, and also of the Miró data analysis
suite.
- Paraskevi Massara
- Title: MOSS4Research: A maturity model to evaluate and improve
reproducibility in research projects.
- Abstract: Our ability to gather large amounts of data, store it and
analyze it efficiently has created new research opportunities in health
sciences and it has led to novel practices. One such practice is the
creation of large datasets that can be used in multiple studies
effectively increasing our research output. However, big data is no free
lunch and it comes with its own challenges. On one hand, improper
management of data may lead to problems when communicating or sharing
data. Different terminology, inaccessible storage,
ethical/economic/social barriers may be some of the problems related to
sharing the common large datasets. On the other hand, improper
management is not constrained only in data, but can extend to the
analysis as well, where processes or analytical tasks are not properly
documented or permanently stored. These problems significantly inhibit
the reproducibility of studies, which in turn may make the verification
of research results practically impossible, and they can also lead to
waste in terms of lost data, time, effort and funds. Other practical
domains, such as computer science or engineering, have long employed
methods to systematically document data and processing tasks to allow
for repetition and reproducibility. Based on such methods, we propose a
novel framework to evaluate the maturity of the reproducibility
practices employed in the context of individual projects or within an
entire research team. The framework consists of a self-assessment
questionnaire and a maturity model to allow teams to evaluate the
maturity of their responsibility practices and a guide on how to
increase their maturity level. The guide contains practices drawn from
other domains to improve communication, collaboration and
reproducibility.
- Biography: Paraskevi Massara is a PhD candidate supervised by Drs.
Elena Comelli and Robert Bandsma. Her research interests include growth
pattern detection in children in association with gut microbiome. She is
a coding enthusiast and an aspiring data science have extensive
practical experience with programming, machine learning and statistics,
and development and management platforms such as Github. She is a member
of R ladies and Women Who Code. She is the recipient among others of
Ontario Graduate Scholarship, Peterborough Hunter Charitable Foundation
Graduate Award, Connaught International Scholarship.
- Robert
Hanisch
- Title: Reproducibility: A Metrology Perspective
- Biography: Dr. Robert J. Hanisch is the Director of the Office of
Data and Informatics in the Material Measurement Laboratory at NIST.
Prior to this appointment (July 2014) he was a Senior Scientist at the
Space Telescope Science Institute (STScI), Baltimore, Maryland, and
Director of the US Virtual Astronomical Observatory. In the past
twenty-five years Dr. Hanisch has led many efforts in the astronomy
community in the area of information systems and services, focusing
particularly on efforts to improve the accessibility and
interoperability of data archives and catalogs. He was the first chair
of the International Virtual Observatory Alliance Executive Committee
(2002-2003) and continues as a member of the IVOA Executive. From 2000
to 2002 he served as Chief Information Officer at STScI, overseeing all
computing, networking, and information services for the Institute. Prior
to that he had oversight responsibilities for the Hubble Space Telescope
Data Archive and led the effort to establish the Multimission Archive at
Space Telescope—MAST—as the optical/UV archive center for NASA
astrophysics missions. He has served as chair of the Program Organizing
Committee for the Astronomical Data Analysis Software and Systems
(ADASS) conferences, chair of the Astrophysics Data Centers Coordinating
Committee, and co-chair of the Decadal Survey Study Group on
Computation, Simulation, and Data Handling. He is currently president of
IAU Commission 5 (Data and Documentation), chair of the IAU Comm. 5
Working Group on Virtual Observatories, Data Centers, and Networks, and
co-chair of the Comm. 5 Working Group on Libraries. He completed his
Ph.D. in Astronomy in 1981 at the University of Maryland, College Park,
working in the field of extragalactic radio astronomy with Prof. William
Erickson.
- Shannon Ellis
- Title: Structuring & Managing Group Projects in Large-Enrollment
Undergraduate Data Science Courses
- Abstract: Computational notebooks are a popular tool for generating
technical data science reports, as they allow for narrative text, code,
and code outputs in a single explanatory document. Given their
popularity, many data science courses utilize computational notebooks
for instruction, assignments, and projects, the output of which can be
analyzed to better understand student behavior and improve instruction.
Here, we present the results from the analysis of 686 final group data
science projects from 8 iteractions of the undergraduate course COGS 108
Data Science in Practice to explain how students approach open-ended
data science projects and provide data science instructors with general
recommendations on structuring and managing reproducible data science
projects in large-enrollment data science courses.
- Biography: Shannon E. Ellis is an Assistant Teaching Professor at UC
San Diego in the Cognitive Science Department, where her primary focus
is teaching programming and data science to thousands of undergraduate
students each academic year. Prior to her arrival at UC San Diego,
Shannon received her Ph.D. in Human Genetics from the Johns Hopkins
School of Medicine and completed a postdoctoral fellow in the Department
of Biostatistics at the Johns Hopkins Bloomberg School of Public Health.
Shannon is particularly passionate about data science, ethical data
analysis, and education. She aims to ensure that data science education
is accessible to everyone, with a particular focus on individuals from
marginalized groups who typically have not had access to such materials
and training.
- Shilaan Alzahawi
- Title: Lay perceptions of scientific findings: Swayed by the
crowd?
- Abstract: Every day, important scientific findings are rejected at
large. To increase public faith in science, some have proposed the use
of crowd science. Drawing from theories on social norms and numerical
cognition, we test whether crowd science improves lay perceptions of
scientific findings. We run an experiment (N = 2,019; preregistration,
data, code, and materials at osf.io/vedb4) to study the effects of
scientific findings emerging from a crowd of researchers (vs. a typical
research collaboration) on lay consumers’ posterior beliefs, confidence
in an aggregate effect size estimate, and ratings of credibility, bias,
and error. We focus on crowdsourced data analysis: a crowd of scientists
who independently analyze the same data to estimate and report a
parameter of interest. Contrary to our hypotheses, we do not find that
consistent crowd estimates increase the sway and credibility of
scientific findings to lay consumers: instead, to our surprise, they
lead to lower posterior beliefs and higher ratings of error. In the
future, it is important for crowd scientists to consider how to tackle
science skepticism and effectively communicate variable crowd estimates
to lay consumers.
- Biography: Shilaan Alzahawi is a Master student in Statistics at
Ghent University and a PhD candidate in Organizational Behavior at
Stanford University. Shilaan is interested in meta-science and
inferential statistics, with a particular interest in the coordination
and effectiveness of large-scale science collaborations.
- Stephen Eglen
- Title: Evaluating the reproducibility of computational results
reported in scientific journals
- Abstract: A recent study (< http://dx.doi.org/10.1371/journal.pbio.3001107>)
estimated that only 2% of biomedical articles shared code relating to
computations. This lack of sharing of code inhibits reproducibility of
findings and reusability of methods. I will introduce our CODECHECK
project < https://codecheck.org.uk> that reviews computational
findings underlying research articles in biosciences. Compared to
traditional peer review, this review is open and interactive, with the
aim of helping all authors make their work reproducible. All code/data
required to reproduce computational results, and the results themselves,
are shared freely following FAIR guidelines. We hope our system will be
used across multiple publishers and bring a cultural change towards more
transparent, open, and reusable computational workflows. This is joint
work with Daniel Nust.
- Biography: SJE is Professor of Computational Neuroscience, in the
Department of Applied Mathematics and Theoretical Physics, University of
Cambridge. He has a long-standing interest in open science and
reproducible research. He co-leads the CODECHECK project for
reproducibility of computations in scientific publications (https://codecheck.org.uk). He is an associate editor for
BiorXiv and is on advisory boards for F1000 Research and Gigabyte.
- Valentin
Danchev
- Title: Reproducibility and Replicability of Large Pre-trained
Language Models
- Abstract: A major recent development in artificial intelligence and
deep learning research are large language models (LLMs) (e.g., BERT,
GPT-3, Gopher) that are trained on a massive amount of language data and
are subsequently applied to a wide range of downstream tasks. Over the
last couple of years, LLMs have been adopted and have shown promise
across research domains, pointing to the importance of evaluating the
scientific potential and challenges of these models through the lenses
of research transparency, computational reproducibility, and
replicability. While challenges for reproducibility and replicability in
data-intensive computational applications are not new, pre-trained LLMs
built on deep learning approaches bring some novel epistemic challenges
as well as related ethical and social risks. Specifically, the massive
and often sensitive, publicly unavailable, and proprietary data sets on
which these models are pre-trained; the scale of the models with
hundreds of billions of parameters and associated computationally
intensive infrastructure; and the pre-trained nature of the models
forming a basis for subsequent applications in the context of restricted
access to many of the models, their software, and training procedures
can all pose challenges to research transparency, computational
reproducibility, and replicability. I will discuss these challenges and
outline possible improvements drawing on principles of responsible and
reproducible research and on recent frameworks and practices in
data-intensive computational sciences aiming to securely access and
model sensitive data at scale.
- Biography: Valentin Danchev is a Lecturer in Computational Social
Science at the University of Essex and a Fellow of the Software
Sustainability Institute. He holds a DPhil from the University of Oxford
and held postdoctoral positions at the University of Chicago and the
Stanford University School of Medicine. His research combines
computational methods from data science and network analysis with
approaches from reproducible research and metascience to study the
transparency, reproducibility, bias, and social impact of data-intensive
research, with a current focus on evaluating and improving the
transparency and reproducibility of applications of data science,
artificial inteligence, and machine learning in the social and health
sciences. In another stream of research, he uses computational social
science and network analysis to examine health-related misinformation,
digital-health interventions, and inequality in network structures of
global migration. He teaches data science with an emphasis on open
reproducible workflows and responsible analysis of real-world data.
- Yann Joly
- Title: Incentivizing open data sharing - what’s in it for me!?
2021
Thursday, 25 February, 2021
9:00-9:10am |
Rohan Alexander, University of Toronto |
Welcome |
- |
9:10-9:20am |
Radu Craiu, University of Toronto |
Opening remarks |
https://youtu.be/JGGVEgMBURU |
9:20-9:30am |
Wendy Duff, University of Toronto |
Opening remarks |
https://youtu.be/Z3aWU1A0FCw |
9:30-10:25am |
Mine Çetinkaya-Rundel, University of Edinburgh |
Keynote - Teaching |
https://youtu.be/ANH2tv2vkew |
10:30-11:30am |
Riana Minocher, Max Planck Institute for Evolutionary
Anthropology |
Keynote - Evaluating |
https://youtu.be/O3t8TwWeli0 |
11:30-11:55am |
Tiffany Timbers, University of British Columbia |
Teaching |
https://youtu.be/mh93W8XimOg |
Noon-12:25pm |
Tyler Girard, University of Western Ontario |
Teaching |
https://youtu.be/k3qgmUAjIvA |
12:30-12:55pm |
Shiro Kuriwaki, Harvard University |
Practices |
https://youtu.be/-J-eiPnmoNE |
1:00-1:25pm |
Meghan Hoyer, Washington Post & Larry Fenn
AP |
Practices |
https://youtu.be/FFwMfNk83rc |
1:30-1:55pm |
Tom Barton, Royal Holloway, University of London |
Evaluating |
https://youtu.be/YTdhcSDqFNQ |
2:00-2:25pm |
Break |
- |
- |
2:30-2:55pm |
Mauricio Vargas, Catholic University of Chile & Nicolas
Didier Arizona State University |
Evaluating |
https://youtu.be/VpTavLYEMgg |
3:00-3:25pm |
Jake Bowers, University of Illinois & The Policy
Lab |
Practices |
https://youtu.be/3N0YwJIbbHg |
3:30-3:55pm |
Amber Simpson, Queens University |
Practices |
https://youtu.be/uUfrcB6aynQ |
4:00-4:25pm |
Garret Christensen, US FDIC |
Evaluating |
https://youtu.be/595KkVKJ29w |
4:30-4:55pm |
Yanbo Tang, University of Toronto |
Practices |
https://youtu.be/0x6gOkldOvk |
5:00-5:25pm |
Lauren Kennedy, Monash University |
Practices |
https://youtu.be/HhfogRbgbA4 |
5:30-6:00pm |
Lisa Strug, University of Toronto & CANSSI Ontario |
Closing remarks |
https://youtu.be/B_9puTSp3f8 |
Friday, 26 February, 2021
8:00-8:30am |
Nick Radcliffe and Pei Shan Yu, Global Open Finance Centre of
Excellence & University of Edinburgh |
Practices |
https://youtu.be/pWEc8XoIIKE |
8:30-9:00am |
Julia Schulte-Cloos, LMU Munich |
Practices |
- |
9:00-9:25am |
Simeon Carstens, Tweag/IO |
Practices |
https://youtu.be/fpoFzDvrJAA |
9:30-9:55am |
Break |
- |
- |
10:00-10:55am |
Eva Vivalt, University of Toronto |
Keynote - Practices |
https://youtu.be/0WZUzf2oSGY |
11:00-11:25am |
Andrés Cruz, Pontificia Universidad Católica de Chile |
Practices |
https://youtu.be/HjdPDEACxmA |
11:30-11:55am |
Emily Riederer, Capital One |
Practices |
https://youtu.be/BknQ0ZNkMNY |
Noon-12:25pm |
Florencia D’Andrea, National Institute of Agricultural
Technology |
Practices |
https://youtu.be/9FVUIPfBeXw |
12:30-12:55pm |
John Blischak, Freelance scientific software developer |
Practices |
https://youtu.be/RrcaGukYDyE |
1:00-1:25pm |
Shemra Rizzo, Genentech |
Practices |
https://youtu.be/rEYtB3CG76Q |
1:30-2:25pm |
Break |
- |
- |
2:30-2:55pm |
Wijdan Tariq, University of Toronto |
Evaluating |
- |
3:00-3:25pm |
Sharla Gelfand, Freelance R Developer |
Practices |
https://youtu.be/G5Nm-GpmrLw |
3:30-3:55pm |
Ryan Briggs, University of Guelph |
Practices |
https://youtu.be/_dgGbxItiB4 |
4:00-4:25pm |
Monica Alexander, University of Toronto |
Practices |
https://youtu.be/yvM2C6aZ94k |
4:30-4:55pm |
Annie Collins, University of Toronto |
Practices |
https://youtu.be/u4ibhN_nWyI |
5:00-5:25pm |
Nancy Reid, University of Toronto |
Practices |
https://youtu.be/sIsOPuZOQL4 |
5:30-6:00pm |
Rohan Alexander, University of Toronto |
Closing remarks |
https://youtu.be/7LttFNOI6p8 |
Presenter biographies and
abstracts
Keynotes:
- Eva Vivalt
- Bio: Eva Vivalt is an Assistant Professor in the Department of
Economics at the University of Toronto. Her main research interests are
in cash transfers, reducing barriers to evidence-based decision-making,
and global priorities research.
- Abstract: An overview of the role of forecasting and a new platform
for making them.
- Mine
Çetinkaya-Rundel
- Bio: Mine Çetinkaya-Rundel is a Senior Lecturer in Statistics and
Data Science in the School of Maths at University of Edinburgh, and
currently on leave as Associate Professor of the Practice in the
Department of Statistical Science at Duke University as well as a
Professional Educator and Data Scientist at RStudio. She is the author
of three open source statistics textbooks and is an instructor for
Coursera. She is the chair-elect of the Statistical Education Section of
the American Statistical Association. Her work focuses on innovation in
statistics pedagogy, with an emphasis on student-centered learning,
computation, reproducible research, and open-source education.
- Abstract: In the beginning was R Markdown. In this talk I will give
a brief review of teaching statistics and data analysis through the lens
of reproducibility with R Markdown, and how to use this tool effectively
in teaching to maintain reproducibility as the scope of your students’
projects and their experience grow.
- Riana
Minocher
- Bio: Riana Minocher is a doctoral student at the Max Planck
Institute for Evolutionary Anthropology in Leipzig. She is an
evolutionary biologist with broad interests. She has worked on a range
of projects on human and non-human primate behaviour and ecology. She is
particularly interested in the evolutionary processes that create and
shape diversity between and within groups. Through her PhD research, she
is keen on exploring the dynamics of cultural transmission and learning
in human populations, to better understand the diverse patterns of
behaviour we observe.
- Abstract: Interest in improving reproducibility, replicability and
transparency of research has increased substantially across scientific
fields over the last few decades. We surveyed 560 empirical,
quantitative publications published between 1955 and 2018, to estimate
the rate of reproducibility for research on social learning, a large
subfield of behavioural ecology. We found supporting materials were
available for less than 30% of publications during this period. The
availability of data declines exponentially with time since publication,
with a half-life of about six years, and this “data decay rate” varies
systematically with both study design and study species. Conditional on
materials being available, we estimate that a reasonable researcher
could expect to successfully reproduce about 80% of published results,
based on our evaluating a subset of 40 publications. Taken together,
this indicates an overall success rate of 24% for both acquiring
materials and recovering published results, with non-reproducibility of
results primarily due to unavailable, incomplete, or poorly-documented
data. We provide recommendations to improve the reproducibility of
research on the ecology and evolution of social behaviour.
Invited presentations:
- Amber
Simpson
- Bio: Amber Simpson is the Canada Research Chair in Biomedical
Computing and Informatics and Associate Professor in the School of
Computing (Faculty of Arts and Science) and Department of Biomedical and
Molecular Sciences (Faculty of Health Sciences). She specializes in
biomedical data science and computer-aided surgery. Her research group
is focused on developing novel computational strategies for improving
human health. She joined the Queen’s University faculty in 2019, after
four years as faculty at Memorial Sloan Kettering Cancer Center in New
York and three years as a Research Assistant Professor in Biomedical
Engineering at Vanderbilt University in Nashville. She is an American
Association of Cancer Research award winner and the holder of multiple
National Institutes of Health grants. She received her PhD in Computer
Science from Queen’s University.
- Abstract: The development of predictive and prognostic biomarkers is
a major area of investigation in cancer research. Our lab specializes in
the development of quantitative imaging markers for personalized
treatment of cancer. Progress in developing these novel markers is
limited by a lack of optimization, standardization, and validation, all
critical barriers to clinical use. This talk will describe our work in
the repeatability and reproducibility of imaging biomarkers.
- Andrés Cruz
- Bio: Andrés Cruz is an adjunct instructor at Pontificia Universidad
Católica de Chile, where he teaches computational social science. He
holds a BA and MA in Political Science, and is the co-editor of “R for
Political Data Science: A Practical Guide” (CRC Press, 2020), an R
manual for social science students and practitioners.
- Abstract:
inexact
is an RStudio addin to supervise
fuzzy joins. Merging data sets is a simple procedure in most statistical
software packages. However, applied researchers frequently face problems
when dealing with data in which ID variables are not properly
standardized. For instance, politicians’ names can be spelled
differently in multiple sources (press reports, official documents,
etc.), causing regular merging methods to fail. The most common approach
to fix this issue when working with small and medium data sets is
manually fixing the problematic values before merging. However, this
solution is time-consuming and not reproducible. An RStudio addin called
“inexact” was created to help with this. The package draws from
approximate string matching algorithms, which quantify the distance
between two given strings. When merging data sets with non-standardized
ID variables, inexact
users benefit from automatic match
suggestions, while also being able to override the automatic choices
when needed, using a user-friendly graphical user interface (GUI). The
output is simply code to perform the corrected merging procedure, which
registers the employed algorithm and any corrections made by the user,
ensuring reproducibility. A development version of inexact
is available on GitHub.
- Annie Collins
- Bio: Annie Collins is an undergraduate student in the Department of
Mathematics specializing in applied mathematics and statistics with a
minor in history and philosophy of science. In her free time, she
focusses her efforts on student governance, promoting women’s
representation in STEM, and working with data in the non-profit and
charitable sector.
- Abstract: We create a dataset of all the pre-prints published on
medRxiv between 28 January 2020 and 31 January 2021. We extract the text
from these pre-prints and parse them looking for keyword markers
signalling the availability of the data and code underpinning the
pre-print. We are unable to find markers of either open data or open
code for 81 per cent of the pre-prints in our sample. Our paper
demonstrates the need to have authors categorize the degree of openness
of their pre-print as part of the medRxiv submissions process, and more
broadly, the need to better integrate open science training into a wide
range of fields
- Emily Riederer
- Bio: Emily Riederer is a Senior Analytics Manager at Capital One.
Her team focuses on reimagining our analytical infrastructure by
building data products, elevating business analysis with novel data
sources and statistical methods, and providing consultation and training
to our partner teams.
- Abstract: Complex software systems make performance guarantees
through documentation and unit tests, and they communicate these to
users with conscientious interface design. However, published data
tables exist in a gray area; they are static enough not to be considered
a ‘service’ or ‘software’, yet too raw to earn attentive user interface
design. This ambiguity creates a disconnect between data producers and
consumers and poses a risk for analytical correctness and
reproducibility. In this talk, I will explain how controlled
vocabularies can be used to form contracts between data producers and
data consumers. Explicitly embedding meaning in each component of
variable names is a low-tech and low-friction approach which builds a
shared understanding of how each field in the dataset is intended to
work. Doing so can offload the burden of data producers by facilitating
automated data validation and metadata management. At the same time,
data consumers benefit by a reduction in the cognitive load to remember
names, a deeper understanding of variable encoding, and opportunities to
more efficiently analyze the resulting dataset. After discussing the
theory of controlled vocabulary column-naming and related workflows, I
will illustrate these ideas with a demonstration of the
convo
R package, which aids in the creation, upkeep, and
application of controlled vocabularies. This talk is based on my related
blog
post and R
package.
- Florencia D’Andrea
- Bio: Florencia D’Andrea is a post-doc at the Argentine National
Institute of Agricultural Technology where she develops computer tools
to assess the risk of pesticide applications for aquatic ecosystems. She
holds a PhD in Biological Sciences from the University of Buenos Aires,
Argentina, and is part of the ReproHack core-team and the R-Ladies
global team. She believes that code and data should also be recognized
as valuable products of scientific work.
- Abstract: Choose your own adventure to a reproducible scientific
article: learnings from ReproHack “I shared the code and data of my last
scientific article, does it mean that it is reproducible?” One might
think that having access to the research data and the code used to
analyze that data would be enough to reproduce published results, but
often this is much more involved. Is reproducibility dependent on the
reviewer’s knowledge? What things do we not usually think about can
affect reproducibility? Can the choice of how to capture the
computational environment influence the experience of the reviewer? In
this talk, we are going to think together some of the necessary steps
that make someone else able to reproduce a scientific article or
project. I will share some thoughts from my experience in ReproHack and
show you how reviewing is a great practice to learn about
reproducibility. What is ReproHack? Reprohack is a hackathon-style event
focused on the reproducibility of research results. These hackathons
provide a low-pressure sandbox environment for practicing reproducible
research: Authors can practice producing reproducible research and
receive friendly feedback and appreciation of their efforts.
Participants can practice reviewing, learn about reproducibility best
practices as well as common pitfalls from working with real-life
materials rather than just dummy. They also get inspired and grow
confidence in working more openly themselves. Research Community
benefits from: Evaluating what best practice is in practice. More
practice in both developing and reviewing materials.
- Garret
Christensen
- Bio: Garret Christensen received his economics PhD from UC Berkeley
in 2011. He is an economist with the FDIC. Before that he worked for the
Census Bureau, and he was a project scientist with the Berkeley
Initiative for Transparency in the Social Sciences and a Data Science
Fellow with the Berkeley Institute for Data Science.
- Abstract: Adoption of Open Science Practices is Increasing: Survey
Evidence on Attitudes, Norms and Behavior in the Social Sciences. Has
there been meaningful movement toward open science practices within the
social sciences in recent years? Discussions about changes in practices
such as posting data and pre-registering analyses have been marked by
controversy—including controversy over the extent to which change has
taken place. This study, based on the State of Social Science (3S)
Survey, provides the first comprehensive assessment of awareness of,
attitudes towards, perceived norms regarding, and adoption of open
science practices within a broadly representative sample of scholars
from four major social science disciplines: economics, political
science, psychology, and sociology. We observe a steep increase in
adoption: as of 2017, over 80% of scholars had used at least one such
practice, rising from one quarter a decade earlier. Attitudes toward
research transparency are on average similar between older and younger
scholars, but the pace of change differs by field and methodology.
According with theories of normal science and scientific change, the
timing of increases in adoption coincides with technological innovations
and institutional policies. Patterns are consistent with most scholars
underestimating the trend toward open science in their discipline.
- Jake Bowers
- Bio: Jake Bowers is a Senior Scientist at The Policy Lab and a
member of the Lab’s data science practice. Jake is Associate Professor
of Political Science and Statistics at the University of Illinois
Urbana-Champaign. He has served as a Fellow in the Office of Evaluation
Sciences in the General Services Administration of the US Federal
Government and is Methods Director for the Evidence in Governance and
Politics network. Jake holds a PhD in Political Science from the
University of California, Berkeley, and a BA in Ethics, Politics and
Economics from Yale University.
- Abstract: For evidence-based public policy to grow in impact and
importance, practices to enhance scientific credibility should be
brought into governmental contexts and also should be modified for those
contexts. For example, few analyses of governmental data allow data
sharing (in contrast with most scientific studies); and many analyses of
governmental administrative data inform high stakes immediate decisions
(in contrast with the slow accumulation of scientific knowledge). We
make several proposals to adjust scientific norms of reproducibility and
pre-registration to the policy context.
- John Blischak
- Bio: John Blischak is a freelance scientific software developer for
the life sciences industry. He is the primary author of the R package
workflowr and the co-maintainer of the CRAN Task View on Reproducible
Research. He received his PhD in Genetics from the University of
Chicago.
- Abstract: The
workflowr
R package helps organize
computational research in a way that promotes effective project
management, reproducibility, collaboration, and sharing of results.
workflowr
combines literate programming (knitr and
rmarkdown) and version control (Git, via git2r) to generate a website
containing time-stamped, versioned, and documented results. Any R user
can quickly and easily adopt workflowr
, which includes four
key features: (1) workflowr
automatically creates a
directory structure for organizing data, code, and results; (2)
workflowr
uses the version control system Git to track
different versions of the code and results without the user needing to
understand Git syntax; (3) to support reproducibility,
workflowr
automatically includes code version information
in webpages displaying results and; (4) workflowr
facilitates online Web hosting (e.g. GitHub Pages) to share results. Our
goal is that workflowr
will make it easier for researchers
to organize and communicate reproducible results. Documentation and
source code are available.
- Julia Schulte-Cloos
- Bio: Julia Schulte-Cloos is a Marie Skłodowska-Curie funded research
fellow at LMU Munich. She has earned her PhD in Political Science from
the European University Institute. Julia is passionate about developing
tools and templates for generating reproducible workflows and creating
reproducible research outputs with R Markdown.
- Abstract: We present a template package in R that allows users
without any prior knowledge of R Markdown to implement reproducible
research practices in their scientific workflows. We provide a single
Rmd-file that is fully optimized for two different output formats, HTML
and PDF. While in the stage of explorative analysis and when focusing on
content only, researchers may rely on the ‘draft mode’ of our template
that knits to HTML When in the stage of research dissemination and when
focusing on the presentation of results, in contrast, researchers may
rely on the ‘manuscript mode’ that knits to PDF. Our template outlines
the basics for successfully writing a reproducible paper in R Markdown
by showing how to include citations, figures, and cross-references. It
also provides examples for the use of
ggplot2
to include
plots, both in static and animated outputs, and it shows how to present
the most commonly used tables in scientific research (descriptive
statistics and regression tables). Finally, in our template, we discuss
some more advanced features of literate programming and helpful tweaks
in R Markdown.
- Lauren Kennedy
- Bio: Lauren Kennedy is a lecturer in the Econometrics and Business
Statistics department at Monash University. She works on applied
statistical problems in the social sciences using primarily Bayesian
methodology. Her most recent work is with survey data, particularly the
use of model and poststratify methods to make population and
subpopulation predictions.
- Abstract: Survey data is challenging to work with. It frequently
contains entry errors (either from respondent recollection or
interviewer entry) that are difficult to verify and identify. Survey
data is often received in a form that is sensible for the software for
which entry is intuitive, which does not necessarily follow through to a
data structure that is intuitive to work with as an analyst. When we
consider the use of tools like multilevel regression and
poststratification, our challenges compound. Even if the population data
is precleaned before release, measurements and items in the sample need
to be mapped to measurements and items in the population. In this talk
we discuss case studies of how and where these challenges appear in
practice.
- Larry Fenn
- Bio: Larry Fenn is a data journalist at the Associated Press. His
investigative work has covered a broad range of topics, from guns to
education to housing policy. Prior to journalism, he was an adjunct
lecturer at Hunter College for applied mathematics and statistics.
- Abstract: Please see Meghan Hoyer.
- Mauricio Vargas Sepúlveda
- Bio: Mauricio Vargas Sepúlveda loves working with data and
statistical programming, and is constantly learning new skills and
tooling in his spare time. He mostly works in R due to its huge number
of libraries and emphasis on reproducible analysis.
- Abstract: Evidence-based policymaking has turned into a high
priority for governments across the world. The possibility of gaining
efficiencies in the public expenditure and linking the policy design to
the desired outcomes have been presented as significant advantages for
the field of comparative policy. However, the same movement that
supports the use of evidence in public policy decision making has
brought a great concern about the sources of the supposed evidence. How
should policymakers evaluate the evidence? The possibilities are open
and depend on the institutional arrangements that support governmental
operation and the possibility of properly judging the nature of the
evidence. The movement of science reproducibility could enlighten the
discussion about the quality of the evidence by providing a structured
approach towards the source’s validity based on the possibility of
reproducing the logic and analysis proper of scientific communication.
This paper attempts to analyze the nature and quality of civil society
organizations’ contributions to develop evidence for policymaking
process from reproducibility perspective.
- Meghan Hoyer
- Bio: Meghan Hoyer is Data Director at The Washington Post where she
leads data projects and acts as a consulting editor on data-driven
stories, graphics and visualizations across the newsroom. Before this
she helped lead the AP’s data journalism. Meghan earned a bachelor of
science in journalism at Northwestern University and an MFA in creative
nonfiction writing at Old Dominion University.
- Abstract: This talk will cover AP DataKit, which is an open-source
command-line tool designed to better structure and manage projects, and
more generally, talk about creating sane, reproducible workflows.
- Monica Alexander
- Bio: Monica Alexander is an Assistant Professor in Statistical
Sciences and Sociology at the University of Toronto. She received her
PhD in Demography from the University of California, Berkeley. Her
research interests include statistical demography, mortality and health
inequalities, and computational social science.
- Abstract: Sharing code for papers and projects is an important part
of reproducible research. However, sometimes sharing code may be
difficult, if the researcher feels their code is ‘not good enough’ and
may reflect poorly on their broader research skills. This presentation
contains some brief reflections from research, consulting, and teaching
experiences that have led to overcoming my own barriers to sharing code
and to help others do the same.
- Nancy Reid
- Bio: Nancy Reid is Professor of Statistical Sciences at the
University of Toronto and Canada Research Chair in Statistical Theory
and Applications. Her main area of research is theoretical statistics.
This treats the foundations and properties of methods of statistical
inference. She is interested in how best to use information in the data
to construct inferential statements about quantities of interest. A very
simple example of this is the widely quoted ‘margin of error’ in the
reporting of polls, another is the ubiquitous ‘p-value’ reported in
medical and health studies. Much of her research considers how to ensure
that these inferential statements are both accurate and effective at
summarizing complex sets of data.
- Abstract: Are p-values contributing to a crisis in replicability and
reproducibility? This has been the topic of many dialogues, diatribes,
and discussions among statisticians and scientists in recent years. I
will share my thoughts on the issues, with emphasis on the role of
inferential theory in helping to clarify the arguments.
- Nick
Radcliffe
- Bio: Nick Radcliffe is the founder of the data science consulting
and software firm, Stochastic Solutions Limited, the Interim Chief
Scientist at the Global Open Finance Centre of Excellence, and a
Visiting Professor in Maths and Stats at University of Edinburgh,
Scotland. His background combines theoretical physics, operations
research, machine learning and stochastic optimization. Nick’s current
research interests include a focus on test-driven data analysis, (an
approach to improving correctness of analytical results that combines
ideas from reproducible research and test-driven development) and
privacy-respecting analysis. He is the lead author of the open-source
Python tdda package, which provides practical tools for testing
analytical software and data, and also of the Miró data analysis
suite.
- Abstract: The Global Open Finance Centre of Excellence is currently
engaged in analysis of the financial impact of COVID-19 on the citizens
and businesses of the UK. This research uses non-consented but
de-identified financial data on individuals and businesses, on the basis
of legitimate interest. All analysis is carried out in a highly
locked-down analytical environment known as a Safe Haven. This talk will
explain our approach to the challenges of ensuring the correctness and
robustness of results in an environment where neither code nor input
data can be opened up for review and even outputs need to be subject to
disclosure control to reduce further any risks to privacy. Topics will
include: testing input data for conformance and lack of personal
identifiers using constraints; multiple implementations and verification
of equivalence of results; regression tests and reference tests;
verification of output artefacts; verification of output disclosure
controls; data provenance and audit trails; test-driven data
analysis—the underlying philosophy (and library) that we use to underpin
this work.
- Nicolas
Didier
- Bio: Nicolas Didier is studying for a PhD in Public Administration
and Policy at the Arizona State University. During his PhD studies and
previous studies, he has worked extensively on developing evidence that
addresses policy in labour markets and public expenditure.
- Abstract: Please see Mauricio Vargas Sepúlveda.
- Ryan Briggs
- Bio: Ryan Briggs is a social scientist who studies the political
economy of poverty alleviation. Most of his research focuses on the
spatial targeting of foreign aid. He is an Assistant Professor in the
Guelph Institute of Development Studies and Department of Political
Science at the University of Guelph. Before that, he taught at Virginia
Tech and American University.
- Abstract: It is hard to do research. One reason for this is that it
has a production function where one low quality input (among many high
quality inputs) can poison a final result. This talk explains how such
‘o-ring’ production functions work and draws out lessons for applied
researchers.
- Sharla Gelfand
- Bio: Sharla Gelfand is a freelance R and Shiny developer
specializing in enabling easy access to data and replacing manual,
redundant processes with ones that are automated, reproducible, and
repeatable. They also co-organize R-Ladies Toronto and the Greater
Toronto Area R User Group. They like R (of course), dogs, learning
Spanish, playing bass, and punk.
- Abstract: Getting stuck, looking around for a solution, and
eventually asking for help is an inevitable and constant aspect of being
a programmer. If you’ve ever looked up a question only to find some
brave soul getting torn apart on Stack Overflow for not providing a
minimum working example, you know it’s also one of the most intimidating
parts! A minimum working example, or a reproducible example as it’s more
often called in the R world, is one of the best ways to get help with
your code - but what exactly is a reproducible example? How do you
create one, and do it efficiently? Why is it so scary? This talk will
cover what components are needed to make a good reproducible example to
maximize your ability to get help (and to help yourself!), strategies
for coming up with an example and testing its reproducibility, and why
you should care about making one. We will also discuss how to extend the
concept of reproducible examples beyond “Help! my code doesn’t work” to
other environments where you might want to share code, like teaching and
blogging.
- Shemra Rizzo
- Bio: Shemra Rizzo is a senior data scientist in Genentech’s
Personalized Healthcare group. Shemra’s role includes research on
COVID-19 using electronic health records, and the development of
data-driven approaches to evaluate clinical trial eligibility criteria.
Shemra obtained her PhD in Biostatistics from UCLA. Before joining
Genentech, Shemra was an assistant professor of statistics at UC
Riverside, where her research covered topics in mental health, health
disparities, and nutrition. In her free time, Shemra enjoys spending
time with her family and running.
- Abstract: Real world data for an emerging disease has unique
challenges. In this talk, I’ll describe how our group made sense of
complex Electronic Health Records (EHR) data for COVID19 early in the
pandemic. I will share our experience working towards reliable,
replicable and reproducible studies using EHR licensed data.
- Shiro Kuriwaki
- Bio: Shiro Kuriwaki is a PhD Candidate in the Department of
Government at Harvard University. His research focuses on democratic
representation in American Politics. In an ongoing project, he studies
the structure of voter’s choices across levels of government and the
political economy of local elections, using cast vote records and
surveys. His other projects also help understand the mechanics of
representation, including: public opinion and Congress, modern survey
statistics and causal inference, and election administration. Prior to
and during graduate school, he worked at the Analyst Institute in
Washington D.C.
- Abstract: I show how new features of the dataverse R
package facilitate reproducibility in empirical, substantive projects.
While packages and scripts make our code transparent and portable across
forms, the import of large and complex datasets is often a nuisance in
project workflows that involve various data cleaning and wrangling
tasks. And the GUI for Dataverse can be sometimes tedious to integrate
into code-based workflow. Will Beasley and I, along with multiple other
contributors, updated the dataverse R package for the first time since
2017 with the goal of spreading its use in empirical workflow. In this
iteration, we make it easier to retrieve dataframes of various file
format and options for version specification and variable subsetting. I
also discuss the latest updates to pyDataverse, a independent
implementation in Python which is currently more advanced in its
implementation but focused on uploading and creating datasets to
dataverse.
- Simeon Carstens
- Bio: Simeon Carstens is a Data Scientist at Tweag I/O, a software
innovation lab and consulting company. Originally a physicist, Simeon
did a PhD and postdoc research in computational biology, focusing on
Bayesian determination of three-dimensional chromosome structures.
- Abstract: Data analysis often requires a complex software
environment containing one or several programming languages,
language-specific modules and external dependencies, all in compatible
versions. This poses a challenge to reproducibility: what good is a
well-designed, tested and documented data analysis pipeline if it is
difficult to replicate the software environment required to run it?
Standard tools such as Python / R virtual environments solve part of the
problem, but do not take into account external and system-level
dependencies. Nix is a fully declarative, open-source package manager
solving this problem: a program packaged with Nix comes with a complete
description of its full dependency tree, down to system libraries. In
this presentation, I will give an introduction to Nix, show in a live
demo how to set up a fully reproducible software environment and compare
Nix to existing solutions such as virtual environments and Docker.
- Tiffany Timbers
- Bio: Tiffany Timbers is an Assistant Professor of Teaching in the
Department of Statistics and an Co-Director for the Master of Data
Science program (Vancouver Option) at the University of British
Columbia. In these roles she teaches and develops curriculum around the
responsible application of Data Science to solve real-world problems.
One of her favourite courses she teaches is a graduate course on
collaborative software development, which focuses on teaching how to
create R and Python packages using modern tools and workflows.
- Abstract: In the data science courses at UBC, we define data science
as the study and development of reproducible and auditable processes to
obtain value (i.e., insight) from data. While reproducibility is core to
our definition, most data science learners enter the field with other
aspects of data science in mind, such as predictive modelling. This
fact, along with the highly technical nature of the industry standard
reproducibility tools currently employed in data science, present
out-of-the gate challenges in teaching reproducibility in the data
science classroom. Put simply, students are not as intrinsically
motivated to learn this topic, and it is not an easy one for them to
learn. What can a data science educator do? Over several iterations of
teaching courses focused on reproducible data science tools and
workflows, we have found that motivation, direct instruction and
practice are key to effectively teach this challenging, yet important
subject. In this talk, I will present examples of how we deeply
motivate, effectively instruct and provide ample practice opportunities
to our Master of Data Science students to effectively engage them in
learning about this topic.
- Tom Barton
- Bio: Tom Barton is a PhD student in Politics at Royal Holloway,
University of London. His PhD focuses on the impact of Voter
Identification laws on political participation and attitudes. More
generally his interests include elections, public opinion (particularly
social values) and quantitative research methods.
- Abstract: I reproduce Surridge, 2016, ‘Education and liberalism:
pursuing the link’, Oxford Review of Education, 42:2,
pp. 146-164, using the 1970 British Cohort Study (BCS70), instead using
a difference-in-difference regression approach with more waves of data.
I find that whilst there is evidence for both socialisation and
self-selection models, self-selection dominates the link between social
values and university attendance. This is counter to what Surridge
(2016) concluded. The need for re-specification was two-fold, first
Surridge’s methodology did not fully test for causality and secondly
later waves have data have become available since.
- Tyler Girard
- Bio: Tyler Girard is a PhD Candidate in political science at the
University of Western Ontario (London, Ontario, Canada). His
dissertation research seeks to explain the origins and diffusion of the
global financial inclusion agenda by focusing on the role of ambiguous
ideas in mobilizing and consolidating transnational coalitions. More
generally, his work also explores new approaches to conceptual
measurement in international relations.
- Abstract: In what ways can we incorporate reproducible practices in
pedagogy for social science courses? I discuss how individual and group
exercises centered around the replication of existing datasets and
analyses offer a flexible tool for experiential learning. However,
maximizing the benefits of such an approach requires customizing the
activity to the students and the availability of instructor support. I
offer several suggestions for effectively using replication exercises in
both undergraduate and graduate level courses.
- Wijdan Tariq
- Bio: Wijdan Tariq is an undergraduate student in the Department of
Statistical Sciences at the University of Toronto.
- Abstract: I undertake a narrow replication of Caicedo, 2019, ‘The
Mission: Human Capital Transmission, Economic Persistence, and Culture
in South America’, Quarterly Journal of Economics, 134:1,
pp. 507-556. Caicedo reports of a remarkable, religiously inspired human
capital intervention that took place in remote parts of South America
250 years ago and whose positive economic effects, he claims, persist to
this day. I replicate some of the paper’s key results using data files
that are available on the Harvard Dataverse portal. I discuss some
lessons learned in the process of replicating this paper and share some
reflections on the state of reproducibility in economics.
- Yanbo
Tang
- Bio: Yanbo Tang is a PhD candidate at the University of Toronto in
the Department of Statistical Sciences, under the joint supervision of
Nancy Reid and Daniel Roy. He is interested in the study and application
of methods in higher order asymptotics and statistical inference in the
presence of many nuisance parameters. Nowadays, he works under the
careful gaze of his pet parrot.
- Abstract: Hypothesis testing results often rely on simple, yet
important assumptions about the behavior of the distribution of p-values
under the null and alternative. We show that commonly held beliefs
regarding the distribution of p-values are misleading when the variance
or location of the test statistic are not well-calibrated or when the
higher order cumulants of the test statistic are not negligible. We
further examine the impact of having these misleading p-values on
reproducibility of scientific studies, with some examples focused on
GWAS studies. Certain corrected tests are proposed and are shown to
perform better than their traditional counterparts in certain
settings.
Code of conduct
Code
The organizers of the Toronto Workshop on Reproducibility are
dedicated to providing a harassment-free experience for everyone
regardless of age, gender, sexual orientation, disability, physical
appearance, race, or religion (or lack thereof).
All participants (including attendees, speakers, sponsors and
volunteers) at the Toronto Workshop on Reproducibility are required to
agree to the following code of conduct.
The code of conduct applies to all conference activities including
talks, panels, workshops, and social events. It extends to
conference-specific exchanges on social media, for instance posts tagged
with the identifier of the conference (e.g. #TOrepro on Twitter), and
replies to such posts.
Organizers will enforce this code throughout and expect cooperation
in ensuring a safe environment for all.
Expected Behaviour
All conference participants agree to:
- Be considerate in language and actions, and respect the boundaries
of fellow participants.
- Refrain from demeaning, discriminatory, or harassing behaviour and
language. Please refer to ‘Unacceptable Behaviour’ for more
details.
- Alert Rohan Alexander - rohan.alexander@utoronto.ca - or Kelly Lyons - kelly.lyons@utoronto.ca - if you notice someone in
distress, or observe violations of this code of conduct, even if they
seem inconsequential. Please refer to the section titled ‘What To Do If
You Witness or Are Subject To Unacceptable Behaviour’ for more
details.
Unacceptable Behaviour
Behaviour that is unacceptable includes, but is not limited to:
- Stalking
- Deliberate intimidation
- Unwanted photography or recording
- Sustained or willful disruption of talks or other events
- Use of sexual or discriminatory imagery, comments, or jokes
- Offensive comments related to age, gender, sexual orientation,
disability, race or religion
- Inappropriate physical contact, which can include grabbing, or
massaging or hugging without consent.
- Unwelcome sexual attention, which can include inappropriate
questions of a sexual nature, asking for sexual favours or repeatedly
asking for dates or contact information.
If you are asked to stop harassing behaviour you should stop
immediately, even if your behaviour was meant to be friendly or a joke,
it was clearly not taken that way and for the comfort of all conference
attendees you should stop.
Attendees who behave in a manner deemed inappropriate are subject to
actions listed under ‘Procedure for Code of Conduct Violations’.
Additional Requirements for Conference
Contributions
Presentation slides and posters should not contain offensive or
sexualised material. If this material is impossible to avoid given the
topic (for example text mining of material from hate sites) the
existence of this material should be noted in the abstract and, in the
case of oral contributions, at the start of the talk or session.
Procedure for Code of Conduct Violations
The organizing committee reserves the right to determine the
appropriate response for all code of conduct violations. Potential
responses include:
- a formal warning to stop harassing behaviour
- expulsion from the conference
- cancellation or early termination of talks or other contributions to
the program
What To Do If You Witness or Are Subject To Unacceptable
Behaviour
If you are being harassed, notice that someone else is being
harassed, or have any other concerns relating to harassment, please
contact Rohan Alexander - rohan.alexander@utoronto.ca, or Kelly Lyons - kelly.lyons@utoronto.ca.
We will take all good-faith reports of harassment by Toronto Workshop
on Reproducibility participants seriously.
We reserve the right to reject any report we believe to have been
made in bad faith. This includes reports intended to silence legitimate
criticism.
We will respect confidentiality requests for the purpose of
protecting victims of abuse. We will not name harassment victims without
their affirmative consent.
Questions or concerns about the Code of Conduct can be addressed to
rohan.alexander@utoronto.ca.
Acknowledgements
Parts of the above text are licensed CC BY-SA 4.0. Credit to SRCCON.
This code of conduct was based on that developed for useR! 2018 which
was a revision of the code of conduct used at previous useR!s and also
drew from rOpenSci’s code of conduct.