I look at Toronto shelter usage numbers between 2017 and January 2021. I document and adjust for a systematic error in the 2017 data. The data show that homelessness in Toronto is a large problem; essentially all shelters are almost always full. COVID changed the nature of the problem by reducing the number of people using shelters, however they must be sleeping somewhere. I compare January 2021 with January 2020 and estimate that following COVID there are now around an additional 3,500 people sleeping somewhere outside of shelters, possibly outdoors.
The extent of homelessness in Toronto was highlighted for the R community in December 2020 when the dataset was used in TidyTuesday. That dataset contained data for 2017 to 2019, inclusive. In this post I expand the dataset through to January 2021 to see what has happened since COVID. I also document and adjust for a systematic error in the 2017 dataset. I compare the usage of shelters this year with last year. I estimate that on Friday, 22 January 2021, when it was -11C with winds of 17km/h, there were roughly 3,634 additional people sleeping somewhere other than a shelter, compared with this time last year.
Set up workspace and gather data
I’ll use the R statistical programming language (R Core Team 2020). The datasets are accessed via the opendatatoronto package (Gelfand 2020). This package wraps around the City of Toronto’s Open Data Portal and allows the direct import of data rather than the need to visit the website. The website is great, but using the package enhances reproducibility. I’ll use the tidyverse package to make data manipulation easier (Wickham et al. 2019).
library(opendatatoronto)library(tidyverse)
── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Get the data# Based on https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-12-01/readme.mdall_data <- opendatatoronto::search_packages("Daily Shelter Occupancy") %>% opendatatoronto::list_package_resources() %>% dplyr::filter(name %in%c("daily-shelter-occupancy-2017-csv","daily-shelter-occupancy-2018-csv", "daily-shelter-occupancy-2019-csv", "daily-shelter-occupancy-2020.csv")) %>%group_split(name) %>%# Don't totally understand how this worksmap_dfr(get_resource, .id ="file")write_csv(all_data, "inputs/raw_data.csv")
Let’s just have a quick look at the data.
all_data <-read_csv("inputs/raw_data.csv", col_types =c("iiccccccccccii")) # The col_types hieroglyphics above stand for integer, integer, character... etc.head(all_data)
The column names aren’t overly nice to type, there are a few columns that we’re not really going to use much, and finally a few of the columns have data that are less obvious than they should be. For instance, the ‘file’ tells us the year of the data, but because of the import settings it’s 1, 2… instead of 2017, 2018….
toronto_shelters <- all_data %>% janitor::clean_names() %>%# Make the column names easier to type. Thanks Sharla!mutate(file_year =case_when( file =="1"~2017, file =="2"~2018, file =="3"~2019, file =="4"~2020,TRUE~-1) ) %>%# Just make the column easier to deal withselect(-id, -file)
The main issue with the data is the dates. In 2017-2019 (inclusive) they appear to be year-month-day, but for 2020 it seems like month-day-year. The separator is also inconsistent between ‘-’ and ‘/’. I’ll first clean that up, check our guesses, and then get to the main issue. I’m going to draw on the lubridate package (Grolemund and Wickham 2011).
library(lubridate)toronto_shelters <- toronto_shelters %>%# 1st line removes times (probs don't actually need to do) and 2nd makes the separation consistentmutate(occupancy_date =str_remove(occupancy_date, "T[:digit:]{2}:[:digit:]{2}:[:digit:]{2}"),occupancy_date =str_replace_all(occupancy_date, "/", "-") ) %>%# Parsing differs between 2017-2019 and 2020. Last line is a catch-all - shouldn't get there.mutate(date =case_when( file_year =="2020"~mdy(occupancy_date, quiet =TRUE), file_year %in%c("2017", "2018", "2019") ~ymd(occupancy_date, quiet =TRUE),TRUE~ NA_Date_ ) ) %>%select(file_year, date, occupancy_date, organization_name:capacity)
Check content of day, month, and year
Let’s just check that my guess of the date orderings was at least plausible by looking at the distribution of year, month, and day bits.
toronto_shelters %>%filter(file_year %in%c(2017, 2018, 2019)) %>%count(one) %>%rename(Year = one, Number = n) %>% kableExtra::kbl(caption ="Count of entries by year for 2017-2019") %>% kableExtra::kable_styling()
Count of entries by year for 2017-2019
Year
Number
2017
38700
2018
37770
2019
39446
toronto_shelters %>%filter(file_year %in%c(2017, 2018, 2019)) %>%count(two) %>%rename(Month = two, Number = n) %>% kableExtra::kbl(caption ="Count of entries by month for 2017-2019") %>% kableExtra::kable_styling()
toronto_shelters %>%filter(file_year ==2020) %>%count(three) %>%rename(Year = three, Number = n) %>% kableExtra::kbl(caption ="Count of entries by year for 2020") %>% kableExtra::kable_styling()
Count of entries by year for 2020
Year
Number
2020
41061
That’s all looking fine. We’d know that we have issues if the distribution of the days wasn’t roughly uniform, or if we have values other than [1-12] in the month.
Check columns agree about the year
Let’s now also check that the year implied by the date matches the year implied by the file.
That’s also fine. And I’ll clean-up by removing the unnecessary columns.
One last thing - plot raw data
Everything seems fine, but it’s always important to ‘Plot. Your. Raw. Data.’, so before moving on, I should plot the raw data to see if there’s anything else going on. (Here, students seem to get confused what ‘raw’ means; I’m using it to refer to as close to the original dataset as possible, so no sums, or averages, etc, if possible. Not necessarily before any cleaning. Sometimes your data are too disperse for that so there will be an element of manipulation. The main point is that you, at the very least, need to plot the data that you’re going to be modelling.)
Let’s just plot the order. As this dataset has been put together by a human we’d expect that it’d be in order of date. Let’s just plot the date in the order it appears in the dataset (Figure @ref(fig:plotorder)).
😱😱😱 This is a bit of a ‘hacky’ graph but it illustrates the point which is that the data are not in order in the dataset. If they were in order, then we’d expect them to be along the diagonal.
It’s super weird that they’re not in order in the raw data. Above, I checked by splitting them into pieces (day, month, year) and the counts were okay. But the ‘hacky’ graph was pretty hacky, so let’s try to summarise the data a little and then have another look. We’ll get a count by date and the sector of the shelter.
# Based on Lisa Lendway: # https://github.com/llendway/tidy_tuesday_in_thirty/blob/main/2020_12_01_tidy_tuesday.Rmdtoronto_shelters_by_day <- toronto_shelters %>%# We only want rows with both occupancy and capacity tidyr::drop_na(occupancy, capacity) %>%# We want to know the occupancy by date and sectorgroup_by(date, sector) %>%summarise(occupancy =sum(occupancy),capacity =sum(capacity),usage = occupancy / capacity, .groups ='drop')head(toronto_shelters_by_day)
We are interested in availability of shelter spots in Toronto on the basis of sector for each day. Different sectors focus on different folks: Co-ed, Families, Men, Women, Youth. Now for each day for each sector we have a proportion (note: horrifyingly >1 is possible). In the notes to the data we’re told that the capacity in 2020 may not be accurate, so for this chart we’ll just focus on 2017-2019 (inclusive) (Figure @ref(fig:plotoccupancyrate)).
That one looks kind of okay, but we’ll again see the problem immediately when we plot the raw number occupied (we can bring this through to include 2020 as it’s not to do with capacity) (Figure @ref(fig:plotoccupancy)).
We can see that using modified data hides the problem. Let’s focus on 2017, as that’s where the biggest issue is and facet by month (Figure @ref(fig:sheltersin2017)).
toronto_shelters_by_day %>%filter(year(date) ==2017) %>%ggplot(aes(x =day(date), y = occupancy, color = sector)) +geom_point(aes(group = sector), alpha =0.3) +scale_y_continuous(limits =c(0, NA)) +labs(color ="Type",x ="Day",y ="Occupancy (number)",title ="Toronto shelters in 2017",subtitle ="Occupancy per day") +facet_wrap(vars(month(date, label =TRUE)),scales ="free_x") +theme_minimal() +scale_color_brewer(palette ="Set1")
Occupancy in Toronto shelters in 2017
Just to check, let’s plot the same for 2018 (Figure @ref(fig:sheltersin2018)).
This gives us an idea of what we ought to expect in 2017 - why should they be significantly different? To start, focus on January 2017 and see if that makes it clearer what is going on (Figure @ref(fig:sheltersinjan2017)).
Daily occupancy in Toronto shelters in February 2017
We’ve clearly got a problem with the first twelve days of the month. We noted at the start that when you look at the data it’s a bit odd in that it’s not in order. Let’s take another look at that by going back to the data as it was given to us (as opposed to the data by day that we’ve been using) (Figure @ref(fig:sheltersin2017byrow)).
toronto_shelters %>%mutate(counter =1:nrow(toronto_shelters)) %>%filter(year(date) ==2017) %>%ggplot(aes(x = counter, y = date)) +geom_point(alpha =0.3) +labs(x ="Row in the dataset",y ="Date of that row") +theme_minimal()
Date of each row in order in 2017
Although there’s no rule that says the dataset has to be in order of the date, if it were, then all the points would lie on the diagonal line. We have a lot of deviation from that. To get a sense of what we’re expecting let’s look at all four years (Figure @ref(fig:sheltersdatebyrow)).
toronto_shelters %>%mutate(counter =1:nrow(toronto_shelters)) %>%ggplot(aes(x = counter, y = date)) +geom_point(alpha =0.3) +facet_wrap(vars(year(date)),scales ="free") +labs(x ="Row in the dataset",y ="Date of that row") +theme_minimal()
Date of each row in order (2017-2020)
It looks like 2020 is as we’d expect. 2019 has a few odd situations, but not too many. 2018 has a small cluster early in the dataset and then possibly something systematic toward the middle. But it’s clear that 2017 has a large number of systematic issues.
In general, I think that in 2017 the first 12 days are the wrong way around, i.e we think it’s year-month-day, but it’s actually year-day-month, but there are exceptions. As a first pass, let’s just try to flip those first 12 days of each month and see if that helps. It’ll be fairly blunt, but hopefully gets us somewhere.
Now let’s take a look (Figure @ref(fig:sheltersdatebyrowadj)).
toronto_shelters %>%mutate(counter =1:nrow(toronto_shelters)) %>%filter(year(date) ==2017) %>%ggplot(aes(x = counter, y = changed_date)) +geom_point(alpha =0.3) +labs(x ="Row in the dataset",y ="Date of that row") +theme_minimal()
Date of each row in order in 2017 after adjustment
We can see that’s almost entirely taken care of the systematic differences. However it’s probably a little blunt. For instance, notice there are now no entries below the diagonal (Figure @ref(fig:sheltersdatebyrowadj2017)).
toronto_shelters_adjusted <- toronto_shelters %>%# We only want rows with occupancy tidyr::drop_na(occupancy, capacity) %>%# We want to know the occupancy by date and sectorgroup_by(changed_date, sector) %>%summarise(occupancy =sum(occupancy), .groups ='drop') toronto_shelters_adjusted %>%filter(year(changed_date) ==2017) %>%ggplot(aes(x =day(changed_date), y = occupancy, color = sector)) +geom_point(aes(group = sector), alpha =0.3) +scale_y_continuous(limits =c(0, NA)) +labs(color ="Type",x ="Changed day",y ="Occupancy (number)") +facet_wrap(vars(month(changed_date, label =TRUE)),scales ="free_x") +theme_minimal() +scale_color_brewer(palette ="Set1")
Toronto shelters daily occupancy in 2017 after adjustment
We could keep going here to try to get to the bottom of it, but the baby is going to wake up soon and I’ve got a history of wasting too much time on these types of things. One of the great things about the City of Toronto Data Portal is that each dataset has a publisher and a contact email. I’ll email them and will update this when they get back to me.
Model
…With a laptop, some free software, and a cup of coffee, I can examine what ought to seem like a staggering amount of information. …I sit here at home, surveying the scope of what’s being inflicted on people across the country and around the world as this disease spreads. … … People sometimes think (or complain) that working with quantitative data like this inures you to the reality of the human lives that lie behind the numbers. Numbers and measures are crude; they pick up the wrong things; they strip out the meaning of what’s happening to real people; they make it easy to ignore what can’t be counted. There’s something to those complaints. But it’s mostly a lazy critique. In practice, I find that far from distancing you from questions of meaning, quantitative data forces you to confront them. The numbers draw you in. Working with data like this is an unending exercise in humility, a constant compulsion to think through what you can and cannot see, and a standing invitation to understand what the measures really capture—what they mean, and for whom. …”
Kieran Healy, 2020, The Kitchen Counter Observatory, 21 May. https://kieranhealy.org/blog/archives/2020/05/21/the-kitchen-counter-observatory/
Let’s start by looking at the effect of COVID on occupancy. On 17 March Ontario declared a state of emergency, so let’s focus on the time around then (Figure @ref(fig:focuson2020) includes a dashed line at that point).
It’s clear that soon after the state of emergency was declared the usage of shelters began to decrease. Understandably the homeless, like all of us, want to socially distance to the extent possible and this likely means avoiding shelters. If you live in Toronto one thing that you’ll notice is that there are a lot of homeless people living in parks since COVID started. While this was fine in summer, the issue is that in winter it is well below 0C overnight.
We might like to see if we can estimate how many additional people are sleeping outside in January. We’ll do this by comparing the number that slept in the shelters in January 2021, compared with the number in the shelters in January 2020, which is before COVID hit Toronto. As Kieran Healy says, an exercise like this is confronting.
While it’s a stretch to say that if they’re not sleeping in a shelter then they must be sleeping outside, it’s also not clear where else they could be if they’re not in a shelter. It’s possible that they have found permanent housing and so are no longer using shelters, however I don’t know of any substantial change in public policy or resource allocation such that the whole difference is due to people finding permanent homes.
We’ll use the data that the city has so far released for 2021. The data provider warns us against using capacity, so we’ll focus on occupancy.
For each day we want to know the number in each category. We’re then going to compare this to this time last year.
toronto_shelters_by_day_current <- current_data %>%# We only want rows with occupancy tidyr::drop_na(occupancy) %>%# We want to know the occupancy by date and sectorgroup_by(date, sector) %>%summarise(occupancy =sum(occupancy), .groups ='drop') %>%mutate(month_day =paste(month(date), day(date), sep ="-")) %>%rename(occupancy_2021 = occupancy)head(toronto_shelters_by_day_current)
toronto_shelters_by_day_2020 <- toronto_shelters %>%# We only want rows with occupancy tidyr::drop_na(occupancy, capacity) %>%# We want to know the occupancy by date and sectorgroup_by(changed_date, sector) %>%summarise(occupancy =sum(occupancy), .groups ='drop') %>%filter(year(changed_date) ==2020) %>%mutate(month_day =paste(month(changed_date), day(changed_date), sep ="-")) %>%# Don't have to rename but it makes the join easierrename(date = changed_date, occupancy_2020 = occupancy)head(toronto_shelters_by_day_2020)
Now we’re going to combine the datasets so that for each day, say ‘4 January’, we know that in 2020 shelter usage was X and on that day in 2021 we know that shelter usage was Y (Table @ref(tab:jandaily)).
toronto_shelters_by_day_current <- toronto_shelters_by_day_current %>%left_join(toronto_shelters_by_day_2020, by =c("month_day"="month_day", "sector"="sector")) %>%rename(date = date.x) %>%select(date, sector, occupancy_2020, occupancy_2021) %>%mutate(difference = occupancy_2021 - occupancy_2020)toronto_shelters_by_day_current %>%rename(Date = date,Sector = sector,`Occupancy in 2020`= occupancy_2020,`Occupancy in 2021`= occupancy_2021, Difference = difference) %>% kableExtra::kbl(caption ="Comparison of shelter usage in January 2021 with January 2020 by day") %>% kableExtra::kable_styling()
Comparison of shelter usage in January 2021 with January 2020 by day
Date
Sector
Occupancy in 2020
Occupancy in 2021
Difference
2021-01-01
Co-ed
744
515
-229
2021-01-01
Families
2670
1128
-1542
2021-01-01
Men
1996
986
-1010
2021-01-01
Women
845
377
-468
2021-01-01
Youth
534
285
-249
2021-01-02
Co-ed
744
511
-233
2021-01-02
Families
2660
1122
-1538
2021-01-02
Men
1990
985
-1005
2021-01-02
Women
838
382
-456
2021-01-02
Youth
532
282
-250
2021-01-03
Co-ed
743
505
-238
2021-01-03
Families
2645
1125
-1520
2021-01-03
Men
1985
976
-1009
2021-01-03
Women
834
384
-450
2021-01-03
Youth
534
276
-258
2021-01-04
Co-ed
745
508
-237
2021-01-04
Families
2636
1138
-1498
2021-01-04
Men
1985
990
-995
2021-01-04
Women
848
387
-461
2021-01-04
Youth
535
276
-259
2021-01-05
Co-ed
744
510
-234
2021-01-05
Families
2621
1137
-1484
2021-01-05
Men
1991
984
-1007
2021-01-05
Women
844
383
-461
2021-01-05
Youth
529
271
-258
2021-01-06
Co-ed
742
510
-232
2021-01-06
Families
2607
1139
-1468
2021-01-06
Men
1991
987
-1004
2021-01-06
Women
845
388
-457
2021-01-06
Youth
534
267
-267
2021-01-07
Co-ed
742
509
-233
2021-01-07
Families
2611
1138
-1473
2021-01-07
Men
1986
979
-1007
2021-01-07
Women
848
389
-459
2021-01-07
Youth
535
268
-267
2021-01-08
Co-ed
742
509
-233
2021-01-08
Families
2618
1133
-1485
2021-01-08
Men
1998
979
-1019
2021-01-08
Women
848
386
-462
2021-01-08
Youth
536
269
-267
2021-01-09
Co-ed
743
506
-237
2021-01-09
Families
2638
1133
-1505
2021-01-09
Men
1999
979
-1020
2021-01-09
Women
840
386
-454
2021-01-09
Youth
533
260
-273
2021-01-10
Co-ed
745
500
-245
2021-01-10
Families
2627
1128
-1499
2021-01-10
Men
1999
971
-1028
2021-01-10
Women
838
383
-455
2021-01-10
Youth
535
263
-272
2021-01-11
Co-ed
741
502
-239
2021-01-11
Families
2616
1121
-1495
2021-01-11
Men
2003
971
-1032
2021-01-11
Women
835
383
-452
2021-01-11
Youth
538
254
-284
2021-01-12
Co-ed
740
503
-237
2021-01-12
Families
2634
1136
-1498
2021-01-12
Men
1996
965
-1031
2021-01-12
Women
844
383
-461
2021-01-12
Youth
536
262
-274
2021-01-13
Co-ed
741
500
-241
2021-01-13
Families
2629
1146
-1483
2021-01-13
Men
1993
956
-1037
2021-01-13
Women
842
378
-464
2021-01-13
Youth
532
260
-272
2021-01-14
Co-ed
739
498
-241
2021-01-14
Families
2635
1145
-1490
2021-01-14
Men
1994
915
-1079
2021-01-14
Women
842
372
-470
2021-01-14
Youth
537
265
-272
2021-01-15
Co-ed
741
506
-235
2021-01-15
Families
2655
1131
-1524
2021-01-15
Men
1983
920
-1063
2021-01-15
Women
847
371
-476
2021-01-15
Youth
539
266
-273
2021-01-16
Co-ed
744
511
-233
2021-01-16
Families
2654
1133
-1521
2021-01-16
Men
1988
918
-1070
2021-01-16
Women
845
370
-475
2021-01-16
Youth
533
264
-269
2021-01-17
Co-ed
742
513
-229
2021-01-17
Families
2649
1130
-1519
2021-01-17
Men
1992
923
-1069
2021-01-17
Women
844
366
-478
2021-01-17
Youth
532
264
-268
2021-01-18
Co-ed
743
507
-236
2021-01-18
Families
2639
1125
-1514
2021-01-18
Men
1999
923
-1076
2021-01-18
Women
844
371
-473
2021-01-18
Youth
535
264
-271
2021-01-19
Co-ed
744
504
-240
2021-01-19
Families
2631
1106
-1525
2021-01-19
Men
1995
931
-1064
2021-01-19
Women
849
379
-470
2021-01-19
Youth
536
263
-273
2021-01-20
Co-ed
742
503
-239
2021-01-20
Families
2629
1081
-1548
2021-01-20
Men
2001
941
-1060
2021-01-20
Women
840
371
-469
2021-01-20
Youth
536
263
-273
2021-01-21
Co-ed
745
501
-244
2021-01-21
Families
2647
1075
-1572
2021-01-21
Men
2007
941
-1066
2021-01-21
Women
839
367
-472
2021-01-21
Youth
534
263
-271
2021-01-22
Co-ed
744
505
-239
2021-01-22
Families
2648
1058
-1590
2021-01-22
Men
2014
949
-1065
2021-01-22
Women
843
367
-476
2021-01-22
Youth
531
267
-264
2021-01-23
Co-ed
740
508
-232
2021-01-23
Families
2653
1058
-1595
2021-01-23
Men
2008
959
-1049
2021-01-23
Women
842
362
-480
2021-01-23
Youth
530
273
-257
Let’s get some average statistics for January by sector (Table @ref(tab:janaverage)).
january_average <- toronto_shelters_by_day_current %>%group_by(sector) %>%summarise(Difference =mean(difference)) %>%mutate(Difference =as.integer(Difference))january_average %>%rename(Sector = sector,`Difference this January compared with last`= Difference) %>% kableExtra::kbl(caption ="Overall comparison of shelter usage in January 2021 with January 2020") %>% kableExtra::kable_styling()
Overall comparison of shelter usage in January 2021 with January 2020
Sector
Difference this January compared with last
Co-ed
-236
Families
-1516
Men
-1037
Women
-465
Youth
-267
Let’s look at a comparison graph (Figure @ref(fig:finalgraph)).
Comparison of shelter usage in January 2021 with January 2020
I estimate that in January, on average, there are roughly 3,517 additional people sleeping outside in below freezing conditions this year compared with last year. That is to say, shelter usage is that much below what it was, and I’m not sure where else they could go. In particular, on Friday night, when it was horrendously cold, I estimate there were 3,634 additional people likely sleeping outside, by comparing the usage of shelters on 22 January 2020 with 22 January 2021.
Even if I’m off by an order of magnitude, the city and province clearly should do more.
Acknowledgments
Thank you to Monica Alexander for helpful comments.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.”Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.”Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.