I look at Toronto shelter usage numbers between 2017 and January 2021. I document and adjust for a systematic error in the 2017 data. The data show that homelessness in Toronto is a large problem; essentially all shelters are almost always full. COVID changed the nature of the problem by reducing the number of people using shelters, however they must be sleeping somewhere. I compare January 2021 with January 2020 and estimate that following COVID there are now around an additional 3,500 people sleeping somewhere outside of shelters, possibly outdoors.
The extent of homelessness in Toronto was highlighted for the R community in December 2020 when the dataset was used in TidyTuesday. That dataset contained data for 2017 to 2019, inclusive. In this post I expand the dataset through to January 2021 to see what has happened since COVID. I also document and adjust for a systematic error in the 2017 dataset. I compare the usage of shelters this year with last year. I estimate that on Friday, 22 January 2021, when it was -11C with winds of 17km/h, there were roughly 3,634 additional people sleeping somewhere other than a shelter, compared with this time last year.
I’ll use the R statistical programming language (R Core Team 2020). The datasets are accessed via the opendatatoronto
package (Gelfand 2020). This package wraps around the City of Toronto’s Open Data Portal and allows the direct import of data rather than the need to visit the website. The website is great, but using the package enhances reproducibility. I’ll use the tidyverse
package to make data manipulation easier (Wickham et al. 2019).
# Get the data
# Based on https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-12-01/readme.md
all_data <-
opendatatoronto::search_packages("Daily Shelter Occupancy") %>%
opendatatoronto::list_package_resources() %>%
dplyr::filter(name %in% c("daily-shelter-occupancy-2017-csv",
"daily-shelter-occupancy-2018-csv",
"daily-shelter-occupancy-2019-csv",
"daily-shelter-occupancy-2020.csv")) %>%
group_split(name) %>% # Don't totally understand how this works
map_dfr(get_resource, .id = "file")
write_csv(all_data, "inputs/raw_data.csv")
Let’s just have a quick look at the data.
all_data <- read_csv("inputs/raw_data.csv",
col_types = c("iiccccccccccii"))
# The col_types hieroglyphics above stand for integer, integer, character... etc.
head(all_data)
# A tibble: 6 x 14
file `_id` OCCUPANCY_DATE ORGANIZATION_NA… SHELTER_NAME
<int> <int> <chr> <chr> <chr>
1 1 1 2017-01-01T00… COSTI Immigrant… COSTI Recep…
2 1 2 2017-01-01T00… Christie Ossing… Christie Os…
3 1 3 2017-01-01T00… Christie Ossing… Christie Os…
4 1 4 2017-01-01T00… Christie Refuge… Christie Re…
5 1 5 2017-01-01T00… City of Toronto Birchmount …
6 1 6 2017-01-01T00… City of Toronto Birkdale Re…
# … with 9 more variables: SHELTER_ADDRESS <chr>,
# SHELTER_CITY <chr>, SHELTER_PROVINCE <chr>,
# SHELTER_POSTAL_CODE <chr>, FACILITY_NAME <chr>,
# PROGRAM_NAME <chr>, SECTOR <chr>, OCCUPANCY <int>,
# CAPACITY <int>
The column names aren’t overly nice to type, there are a few columns that we’re not really going to use much, and finally a few of the columns have data that are less obvious than they should be. For instance, the ‘file’ tells us the year of the data, but because of the import settings it’s 1, 2… instead of 2017, 2018….
toronto_shelters <-
all_data %>%
janitor::clean_names() %>% # Make the column names easier to type. Thanks Sharla!
mutate(file_year =
case_when(
file == "1" ~ 2017,
file == "2" ~ 2018,
file == "3" ~ 2019,
file == "4" ~ 2020,
TRUE ~ -1)
) %>% # Just make the column easier to deal with
select(-id, -file)
The main issue with the data is the dates. In 2017-2019 (inclusive) they appear to be year-month-day, but for 2020 it seems like month-day-year. The separator is also inconsistent between ‘-’ and ‘/.’ I’ll first clean that up, check our guesses, and then get to the main issue. I’m going to draw on the lubridate
package (Grolemund and Wickham 2011).
library(lubridate)
toronto_shelters <-
toronto_shelters %>%
# 1st line removes times (probs don't actually need to do) and 2nd makes the separation consistent
mutate(occupancy_date = str_remove(occupancy_date, "T[:digit:]{2}:[:digit:]{2}:[:digit:]{2}"),
occupancy_date = str_replace_all(occupancy_date, "/", "-")
) %>%
# Parsing differs between 2017-2019 and 2020. Last line is a catch-all - shouldn't get there.
mutate(date = case_when(
file_year == "2020" ~ mdy(occupancy_date, quiet = TRUE),
file_year %in% c("2017", "2018", "2019") ~ ymd(occupancy_date, quiet = TRUE),
TRUE ~ NA_Date_
)
) %>%
select(file_year, date, occupancy_date, organization_name:capacity)
Let’s just check that my guess of the date orderings was at least plausible by looking at the distribution of year, month, and day bits.
toronto_shelters <-
toronto_shelters %>%
separate(occupancy_date, into = c('one', 'two', 'three'), sep = "-", remove = FALSE)
toronto_shelters %>%
filter(file_year %in% c(2017, 2018, 2019)) %>%
count(one) %>%
rename(Year = one, Number = n) %>%
kableExtra::kbl(caption = "Count of entries by year for 2017-2019") %>%
kableExtra::kable_styling()
Year | Number |
---|---|
2017 | 38700 |
2018 | 37770 |
2019 | 39446 |
toronto_shelters %>%
filter(file_year %in% c(2017, 2018, 2019)) %>%
count(two) %>%
rename(Month = two, Number = n) %>%
kableExtra::kbl(caption = "Count of entries by month for 2017-2019") %>%
kableExtra::kable_styling()
Month | Number |
---|---|
01 | 9747 |
02 | 8868 |
03 | 9912 |
04 | 9600 |
05 | 9950 |
06 | 9625 |
07 | 9850 |
08 | 9743 |
09 | 9430 |
10 | 9694 |
11 | 9497 |
12 | 10000 |
toronto_shelters %>%
filter(file_year %in% c(2017, 2018, 2019)) %>%
count(three) %>%
ggplot(aes(x = three, y = n)) +
geom_point() +
theme_minimal() +
labs(x = "Day",
y = "Number")
Figure 1: Distribution of days for 2017-2019
And again, but for 2020.
toronto_shelters %>%
filter(file_year == 2020) %>%
count(one) %>%
rename(Month = one, Number = n) %>%
kableExtra::kbl(caption = "Count of entries by month for 2020") %>%
kableExtra::kable_styling()
Month | Number |
---|---|
01 | 3503 |
02 | 3277 |
03 | 3555 |
04 | 3562 |
05 | 3671 |
06 | 3534 |
07 | 3601 |
08 | 3355 |
09 | 3210 |
10 | 3317 |
11 | 3195 |
12 | 3281 |
toronto_shelters %>%
filter(file_year == 2020) %>%
count(two) %>%
ggplot(aes(x = two, y = n)) +
geom_point() +
theme_minimal() +
labs(x = "Day",
y = "Number")
Figure 2: Distribution of days for 2020
toronto_shelters %>%
filter(file_year == 2020) %>%
count(three) %>%
rename(Year = three, Number = n) %>%
kableExtra::kbl(caption = "Count of entries by year for 2020") %>%
kableExtra::kable_styling()
Year | Number |
---|---|
2020 | 41061 |
That’s all looking fine. We’d know that we have issues if the distribution of the days wasn’t roughly uniform, or if we have values other than [1-12] in the month.
Let’s now also check that the year implied by the date matches the year implied by the file.
# A tibble: 0 x 18
# … with 18 variables: file_year <dbl>, date <date>,
# occupancy_date <chr>, one <chr>, two <chr>,
# three <chr>, organization_name <chr>,
# shelter_name <chr>, shelter_address <chr>,
# shelter_city <chr>, shelter_province <chr>,
# shelter_postal_code <chr>, facility_name <chr>,
# program_name <chr>, sector <chr>, occupancy <int>,
# capacity <int>, check_year <lgl>
toronto_shelters <-
toronto_shelters %>%
select(-occupancy_date, -one, -two, -three, -file_year)
That’s also fine. And I’ll clean-up by removing the unnecessary columns.
Everything seems fine, but it’s always important to ‘Plot. Your. Raw. Data.’ so before moving on, I should plot the raw data to see if there’s anything else going on. (Here, students seem to get confused what ‘raw’ means; I’m using it to refer to as close to the original dataset as possible, so no sums, or averages, etc, if possible. Not necessarily before any cleaning. Sometimes your data are too disperse for that so there will be an element of manipulation. The main point is that you, at the very least, need to plot the data that you’re going to be modelling.)
Let’s just plot the order. As this dataset has been put together by a human we’d expect that it’d be in order of date. Let’s just plot the date in the order it appears in the dataset (Figure 3).
toronto_shelters %>%
mutate(row_number = c(1:nrow(toronto_shelters))) %>%
ggplot(aes(x = row_number, y = date), alpha = 0.1) +
geom_point() +
theme_minimal() +
labs(x = "Row number",
y = "Date")
Figure 3: Comparison of row number with date
😱😱😱. This is a bit of a ‘hacky’ graph but it illustrates the point which is that the data are not in order in the dataset. If they were in order, then we’d expect them to be along the diagonal.
It’s super weird that they’re not in order in the raw data. Above, I checked by splitting them into pieces (day, month, year) and the counts were okay. But the ‘hacky’ graph was pretty hacky, so let’s try to summarise the data a little and then have another look. We’ll get a count by date and the sector of the shelter.
# Based on Lisa Lendway:
# https://github.com/llendway/tidy_tuesday_in_thirty/blob/main/2020_12_01_tidy_tuesday.Rmd
toronto_shelters_by_day <-
toronto_shelters %>%
# We only want rows with both occupancy and capacity
tidyr::drop_na(occupancy, capacity) %>%
# We want to know the occupancy by date and sector
group_by(date, sector) %>%
summarise(occupancy = sum(occupancy),
capacity = sum(capacity),
usage = occupancy / capacity, .groups = 'drop')
head(toronto_shelters_by_day)
# A tibble: 6 x 5
date sector occupancy capacity usage
<date> <chr> <int> <int> <dbl>
1 2017-01-01 Co-ed 530 582 0.911
2 2017-01-01 Families 1030 1150 0.896
3 2017-01-01 Men 1595 1706 0.935
4 2017-01-01 Women 641 697 0.920
5 2017-01-01 Youth 499 518 0.963
6 2017-01-02 Co-ed 458 582 0.787
We are interested in availability of shelter spots in Toronto on the basis of sector for each day. Different sectors focus on different folks: Co-ed, Families, Men, Women, Youth. Now for each day for each sector we have a proportion (note: horrifyingly >1 is possible). In the notes to the data we’re told that the capacity in 2020 may not be accurate, so for this chart we’ll just focus on 2017-2019 (inclusive) (Figure 4).
# Graph 2017-2019 (inc)
toronto_shelters_by_day %>%
filter(year(date) != "2020") %>%
ggplot(aes(x = date, y = usage, color = sector)) +
geom_point(aes(group = sector), alpha = 0.3) +
scale_y_continuous(limits = c(0, NA)) +
labs(color = "Type",
x = "Date",
y = "Occupancy rate") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
Figure 4: Occupancy rate per day in Toronto shelters
That one looks kind of okay, but we’ll again see the problem immediately when we plot the raw number occupied (we can bring this through to include 2020 as it’s not to do with capacity) (Figure 5).
toronto_shelters_by_day %>%
ggplot(aes(x = date, y = occupancy, color = sector)) +
geom_point(aes(group = sector), alpha = 0.3) +
scale_y_continuous(limits = c(0, NA)) +
labs(color = "Type",
x = "Date",
y = "Occupancy (number)") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
Figure 5: Occupancy per day in Toronto shelters
We can see that using modified data hides the problem. Let’s focus on 2017, as that’s where the biggest issue is and facet by month (Figure 6).
toronto_shelters_by_day %>%
filter(year(date) == 2017) %>%
ggplot(aes(x = day(date), y = occupancy, color = sector)) +
geom_point(aes(group = sector), alpha = 0.3) +
scale_y_continuous(limits = c(0, NA)) +
labs(color = "Type",
x = "Day",
y = "Occupancy (number)",
title = "Toronto shelters in 2017",
subtitle = "Occupancy per day") +
facet_wrap(vars(month(date, label = TRUE)),
scales = "free_x") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
Figure 6: Occupancy in Toronto shelters in 2017
Just to check, let’s plot the same for 2018 (Figure 7).
toronto_shelters_by_day %>%
filter(year(date) == 2018) %>%
ggplot(aes(x = day(date), y = occupancy, color = sector)) +
geom_point(aes(group = sector), alpha = 0.3) +
scale_y_continuous(limits = c(0, NA)) +
labs(color = "Type",
x = "Day",
y = "Occupancy (number)") +
facet_wrap(vars(month(date, label = TRUE)),
scales = "free_x") +
theme_minimal() +
scale_color_brewer(palette = "Dark2")
Figure 7: Daily occupancy in Toronto shelters in 2018
This gives us an idea of what we ought to expect in 2017 - why should they be significantly different? To start, focus on January 2017 and see if that makes it clearer what is going on (Figure 8).
toronto_shelters_by_day %>%
filter(year(date) == 2017) %>%
filter(month(date) == 1) %>%
ggplot(aes(x = day(date), y = occupancy, color = sector)) +
geom_point(aes(group = sector)) +
scale_y_continuous(limits = c(0, NA)) +
labs(color = "Type",
x = "Day",
y = "Occupancy (number)") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
Figure 8: Daily occupancy in Toronto shelters in January 2017
This perhaps gives us some idea of what is going on. Let’s just check February and see if it looks similar (Figure 9).
toronto_shelters_by_day %>%
filter(year(date) == 2017) %>%
filter(month(date) == 2) %>%
ggplot(aes(x = day(date), y = occupancy, color = sector)) +
geom_point(aes(group = sector)) +
scale_y_continuous(limits = c(0, NA)) +
labs(color = "Type",
x = "Day",
y = "Occupancy (number)") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
Figure 9: Daily occupancy in Toronto shelters in February 2017
We’ve clearly got a problem with the first twelve days of the month. We noted at the start that when you look at the data it’s a bit odd in that it’s not in order. Let’s take another look at that by going back to the data as it was given to us (as opposed to the data by day that we’ve been using) (Figure 10).
toronto_shelters %>%
mutate(counter = 1:nrow(toronto_shelters)) %>%
filter(year(date) == 2017) %>%
ggplot(aes(x = counter, y = date)) +
geom_point(alpha = 0.3) +
labs(x = "Row in the dataset",
y = "Date of that row") +
theme_minimal()
Figure 10: Date of each row in order in 2017
Although there’s no rule that says the dataset has to be in order of the date, if it were, then all the points would lie on the diagonal line. We have a lot of deviation from that. To get a sense of what we’re expecting let’s look at all four years (Figure 11).
toronto_shelters %>%
mutate(counter = 1:nrow(toronto_shelters)) %>%
ggplot(aes(x = counter, y = date)) +
geom_point(alpha = 0.3) +
facet_wrap(vars(year(date)),
scales = "free") +
labs(x = "Row in the dataset",
y = "Date of that row") +
theme_minimal()
Figure 11: Date of each row in order (2017-2020)
It looks like 2020 is as we’d expect. 2019 has a few odd situations, but not too many. 2018 has a small cluster early in the dataset and then possibly something systematic toward the middle. But it’s clear that 2017 has a large number of systematic issues.
In general, I think that in 2017 the first 12 days are the wrong way around, i.e we think it’s year-month-day, but it’s actually year-day-month, but there are exceptions. As a first pass, let’s just try to flip those first 12 days of each month and see if that helps. It’ll be fairly blunt, but hopefully gets us somewhere.
toronto_shelters <-
toronto_shelters %>%
mutate(
year = year(date),
month = month(date),
day = day(date),
date = as.character(date),
changed_date = if_else(
date %in% c("2017-02-01", "2017-03-01", "2017-04-01", "2017-05-01", "2017-06-01",
"2017-07-01", "2017-08-01", "2017-09-01", "2017-10-01", "2017-11-01",
"2017-12-01", "2017-01-02", "2017-03-02", "2017-04-02", "2017-05-02",
"2017-06-02", "2017-07-02", "2017-08-02", "2017-09-02", "2017-10-02",
"2017-11-02", "2017-12-02", "2017-01-03", "2017-02-03", "2017-04-03",
"2017-05-03", "2017-06-03", "2017-07-03", "2017-08-03", "2017-09-03",
"2017-10-03", "2017-11-03", "2017-12-03", "2017-01-04", "2017-02-04",
"2017-03-04", "2017-05-04", "2017-06-04", "2017-07-04", "2017-08-04",
"2017-09-04", "2017-10-04", "2017-11-04", "2017-12-04", "2017-01-05",
"2017-02-05", "2017-03-05", "2017-04-05", "2017-06-05", "2017-07-05",
"2017-08-05", "2017-09-05", "2017-10-05", "2017-11-05", "2017-12-05",
"2017-01-06", "2017-02-06", "2017-03-06", "2017-04-06", "2017-05-06",
"2017-07-06", "2017-08-06", "2017-09-06", "2017-10-06", "2017-11-06",
"2017-12-06", "2017-01-07", "2017-02-07", "2017-03-07", "2017-04-07",
"2017-05-07", "2017-06-07", "2017-08-07", "2017-09-07", "2017-10-07",
"2017-11-07", "2017-12-07", "2017-01-08", "2017-02-08", "2017-03-08",
"2017-04-08", "2017-05-08", "2017-06-08", "2017-07-08", "2017-09-08",
"2017-10-08", "2017-11-08", "2017-12-08", "2017-01-09", "2017-02-09",
"2017-03-09", "2017-04-09", "2017-05-09", "2017-06-09", "2017-07-09",
"2017-08-09", "2017-10-09", "2017-11-09", "2017-12-09", "2017-01-10",
"2017-02-10", "2017-03-10", "2017-04-10", "2017-05-10", "2017-06-10",
"2017-07-10", "2017-08-10", "2017-09-10", "2017-11-10", "2017-12-10",
"2017-01-11", "2017-02-11", "2017-03-11", "2017-04-11", "2017-05-11",
"2017-06-11", "2017-07-11", "2017-08-11", "2017-09-11", "2017-10-11",
"2017-12-11", "2017-01-12", "2017-02-12", "2017-03-12", "2017-04-12",
"2017-05-12", "2017-06-12", "2017-07-12", "2017-08-12", "2017-09-12",
"2017-10-12", "2017-11-12"),
paste(year, day, month, sep = "-"),
paste(year, month, day, sep = "-"),
),
changed_date = ymd(changed_date)
) %>%
select(-year, -month, -day)
Now let’s take a look (Figure 12).
toronto_shelters %>%
mutate(counter = 1:nrow(toronto_shelters)) %>%
filter(year(date) == 2017) %>%
ggplot(aes(x = counter, y = changed_date)) +
geom_point(alpha = 0.3) +
labs(x = "Row in the dataset",
y = "Date of that row") +
theme_minimal()
Figure 12: Date of each row in order in 2017 after adjustment
We can see that’s almost entirely taken care of the systematic differences. However it’s probably a little blunt. For instance, notice there are now no entries below the diagonal (Figure 13).
toronto_shelters_adjusted <-
toronto_shelters %>%
# We only want rows with occupancy
tidyr::drop_na(occupancy, capacity) %>%
# We want to know the occupancy by date and sector
group_by(changed_date, sector) %>%
summarise(occupancy = sum(occupancy), .groups = 'drop')
toronto_shelters_adjusted %>%
filter(year(changed_date) == 2017) %>%
ggplot(aes(x = day(changed_date), y = occupancy, color = sector)) +
geom_point(aes(group = sector), alpha = 0.3) +
scale_y_continuous(limits = c(0, NA)) +
labs(color = "Type",
x = "Changed day",
y = "Occupancy (number)") +
facet_wrap(vars(month(changed_date, label = TRUE)),
scales = "free_x") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
Figure 13: Toronto shelters daily occupancy in 2017 after adjustment
We could keep going here to try to get to the bottom of it, but the baby is going to wake up soon and I’ve got a history of wasting too much time on these types of things. One of the great things about the City of Toronto Data Portal is that each dataset has a publisher and a contact email. I’ll email them and will update this when they get back to me.
…With a laptop, some free software, and a cup of coffee, I can examine what ought to seem like a staggering amount of information. …I sit here at home, surveying the scope of what’s being inflicted on people across the country and around the world as this disease spreads. … … People sometimes think (or complain) that working with quantitative data like this inures you to the reality of the human lives that lie behind the numbers. Numbers and measures are crude; they pick up the wrong things; they strip out the meaning of what’s happening to real people; they make it easy to ignore what can’t be counted. There’s something to those complaints. But it’s mostly a lazy critique. In practice, I find that far from distancing you from questions of meaning, quantitative data forces you to confront them. The numbers draw you in. Working with data like this is an unending exercise in humility, a constant compulsion to think through what you can and cannot see, and a standing invitation to understand what the measures really capture—what they mean, and for whom. …"
Kieran Healy, 2020, The Kitchen Counter Observatory, 21 May. https://kieranhealy.org/blog/archives/2020/05/21/the-kitchen-counter-observatory/
Let’s start by looking at the effect of COVID on occupancy. On 17 March Ontario declared a state of emergency, so let’s focus on the time around then (Figure 14 includes a dashed line at that point).
toronto_shelters_adjusted %>%
filter(year(changed_date) == 2020) %>%
ggplot(aes(x = changed_date, y = occupancy, color = sector)) +
geom_point(aes(group = sector), alpha = 0.3) +
geom_vline(xintercept = ymd("2020-03-17"), linetype = 'dotted')+
scale_y_continuous(limits = c(0, NA)) +
labs(color = "Type",
x = "Changed date",
y = "Occupancy (number)") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
Figure 14: Shelter usage in 2020
It’s clear that soon after the state of emergency was declared the usage of shelters began to decrease. Understandably the homeless, like all of us, want to socially distance to the extent possible and this likely means avoiding shelters. If you live in Toronto one thing that you’ll notice is that there are a lot of homeless people living in parks since COVID started. While this was fine in summer, the issue is that in winter it is well below 0C overnight.
We might like to see if we can estimate how many additional people are sleeping outside in January. We’ll do this by comparing the number that slept in the shelters in January 2021, compared with the number in the shelters in January 2020, which is before COVID hit Toronto. As Kieran Healy says, an exercise like this is confronting.
While it’s a stretch to say that if they’re not sleeping in a shelter then they must be sleeping outside, it’s also not clear where else they could be if they’re not in a shelter. It’s possible that they have found permanent housing and so are no longer using shelters, however I don’t know of any substantial change in public policy or resource allocation such that the whole difference is due to people finding permanent homes.
We’ll use the data that the city has so far released for 2021. The data provider warns us against using capacity, so we’ll focus on occupancy.
current_data <- opendatatoronto::get_resource("29852011-c9c2-4b6d-a79a-b718f5ed1ae2")
write_csv(current_data, "inputs/raw_data-January.csv")
current_data <- read_csv("inputs/raw_data-January.csv",
col_types = c("icccicccicccc"))
head(current_data)
# A tibble: 6 x 13
`_id` OCCUPANCY_DATE SECTOR SHELTER_POSTAL_… CAPACITY
<int> <chr> <chr> <chr> <int>
1 1.91e6 2021-01-01T05… Women M5S 2P1 8
2 1.91e6 2021-01-01T05… Famil… M5S 2P1 174
3 1.91e6 2021-01-01T05… Famil… M5S 2P1 140
4 1.91e6 2021-01-01T05… Famil… M5S 2P1 360
5 1.91e6 2021-01-01T05… Co-ed M5S 2P1 8
6 1.91e6 2021-01-01T05… Women M5S 2P1 8
# … with 8 more variables: SHELTER_PROVINCE <chr>,
# FACILITY_NAME <chr>, SHELTER_NAME <chr>,
# OCCUPANCY <int>, ORGANIZATION_NAME <chr>,
# SHELTER_ADDRESS <chr>, SHELTER_CITY <chr>,
# PROGRAM_NAME <chr>
Just want to do the same basic clean up of the dates as before.
current_data <-
current_data %>%
janitor::clean_names() %>% # Make the column names easier to type. Thanks Sharla!
mutate(occupancy_date = str_remove(occupancy_date, "T[:digit:]{2}:[:digit:]{2}:[:digit:]{2}"),
) %>%
mutate(date = ymd(occupancy_date, quiet = TRUE))
head(current_data)
# A tibble: 6 x 14
id occupancy_date sector shelter_postal_… capacity
<int> <chr> <chr> <chr> <int>
1 1.91e6 2021-01-01 Women M5S 2P1 8
2 1.91e6 2021-01-01 Famil… M5S 2P1 174
3 1.91e6 2021-01-01 Famil… M5S 2P1 140
4 1.91e6 2021-01-01 Famil… M5S 2P1 360
5 1.91e6 2021-01-01 Co-ed M5S 2P1 8
6 1.91e6 2021-01-01 Women M5S 2P1 8
# … with 9 more variables: shelter_province <chr>,
# facility_name <chr>, shelter_name <chr>,
# occupancy <int>, organization_name <chr>,
# shelter_address <chr>, shelter_city <chr>,
# program_name <chr>, date <date>
For each day we want to know the number in each category. We’re then going to compare this to this time last year.
toronto_shelters_by_day_current <-
current_data %>%
# We only want rows with occupancy
tidyr::drop_na(occupancy) %>%
# We want to know the occupancy by date and sector
group_by(date, sector) %>%
summarise(occupancy = sum(occupancy), .groups = 'drop') %>%
mutate(month_day = paste(month(date), day(date), sep = "-")) %>%
rename(occupancy_2021 = occupancy)
head(toronto_shelters_by_day_current)
# A tibble: 6 x 4
date sector occupancy_2021 month_day
<date> <chr> <int> <chr>
1 2021-01-01 Co-ed 515 1-1
2 2021-01-01 Families 1128 1-1
3 2021-01-01 Men 986 1-1
4 2021-01-01 Women 377 1-1
5 2021-01-01 Youth 285 1-1
6 2021-01-02 Co-ed 511 1-2
toronto_shelters_by_day_2020 <-
toronto_shelters %>%
# We only want rows with occupancy
tidyr::drop_na(occupancy, capacity) %>%
# We want to know the occupancy by date and sector
group_by(changed_date, sector) %>%
summarise(occupancy = sum(occupancy), .groups = 'drop') %>%
filter(year(changed_date) == 2020) %>%
mutate(month_day = paste(month(changed_date), day(changed_date), sep = "-")) %>%
# Don't have to rename but it makes the join easier
rename(date = changed_date,
occupancy_2020 = occupancy)
head(toronto_shelters_by_day_2020)
# A tibble: 6 x 4
date sector occupancy_2020 month_day
<date> <chr> <int> <chr>
1 2020-01-01 Co-ed 744 1-1
2 2020-01-01 Families 2670 1-1
3 2020-01-01 Men 1996 1-1
4 2020-01-01 Women 845 1-1
5 2020-01-01 Youth 534 1-1
6 2020-01-02 Co-ed 744 1-2
Now we’re going to combine the datasets so that for each day, say ‘4 January,’ we know that in 2020 shelter usage was X and on that day in 2021 we know that shelter usage was Y (Table 5).
toronto_shelters_by_day_current <-
toronto_shelters_by_day_current %>%
left_join(toronto_shelters_by_day_2020, by = c("month_day" = "month_day", "sector" = "sector")) %>%
rename(date = date.x) %>%
select(date, sector, occupancy_2020, occupancy_2021) %>%
mutate(difference = occupancy_2021 - occupancy_2020)
toronto_shelters_by_day_current %>%
rename(Date = date,
Sector = sector,
`Occupancy in 2020` = occupancy_2020,
`Occupancy in 2021` = occupancy_2021,
Difference = difference) %>%
kableExtra::kbl(caption = "Comparison of shelter usage in January 2021 with January 2020 by day") %>%
kableExtra::kable_styling()
Date | Sector | Occupancy in 2020 | Occupancy in 2021 | Difference |
---|---|---|---|---|
2021-01-01 | Co-ed | 744 | 515 | -229 |
2021-01-01 | Families | 2670 | 1128 | -1542 |
2021-01-01 | Men | 1996 | 986 | -1010 |
2021-01-01 | Women | 845 | 377 | -468 |
2021-01-01 | Youth | 534 | 285 | -249 |
2021-01-02 | Co-ed | 744 | 511 | -233 |
2021-01-02 | Families | 2660 | 1122 | -1538 |
2021-01-02 | Men | 1990 | 985 | -1005 |
2021-01-02 | Women | 838 | 382 | -456 |
2021-01-02 | Youth | 532 | 282 | -250 |
2021-01-03 | Co-ed | 743 | 505 | -238 |
2021-01-03 | Families | 2645 | 1125 | -1520 |
2021-01-03 | Men | 1985 | 976 | -1009 |
2021-01-03 | Women | 834 | 384 | -450 |
2021-01-03 | Youth | 534 | 276 | -258 |
2021-01-04 | Co-ed | 745 | 508 | -237 |
2021-01-04 | Families | 2636 | 1138 | -1498 |
2021-01-04 | Men | 1985 | 990 | -995 |
2021-01-04 | Women | 848 | 387 | -461 |
2021-01-04 | Youth | 535 | 276 | -259 |
2021-01-05 | Co-ed | 744 | 510 | -234 |
2021-01-05 | Families | 2621 | 1137 | -1484 |
2021-01-05 | Men | 1991 | 984 | -1007 |
2021-01-05 | Women | 844 | 383 | -461 |
2021-01-05 | Youth | 529 | 271 | -258 |
2021-01-06 | Co-ed | 742 | 510 | -232 |
2021-01-06 | Families | 2607 | 1139 | -1468 |
2021-01-06 | Men | 1991 | 987 | -1004 |
2021-01-06 | Women | 845 | 388 | -457 |
2021-01-06 | Youth | 534 | 267 | -267 |
2021-01-07 | Co-ed | 742 | 509 | -233 |
2021-01-07 | Families | 2611 | 1138 | -1473 |
2021-01-07 | Men | 1986 | 979 | -1007 |
2021-01-07 | Women | 848 | 389 | -459 |
2021-01-07 | Youth | 535 | 268 | -267 |
2021-01-08 | Co-ed | 742 | 509 | -233 |
2021-01-08 | Families | 2618 | 1133 | -1485 |
2021-01-08 | Men | 1998 | 979 | -1019 |
2021-01-08 | Women | 848 | 386 | -462 |
2021-01-08 | Youth | 536 | 269 | -267 |
2021-01-09 | Co-ed | 743 | 506 | -237 |
2021-01-09 | Families | 2638 | 1133 | -1505 |
2021-01-09 | Men | 1999 | 979 | -1020 |
2021-01-09 | Women | 840 | 386 | -454 |
2021-01-09 | Youth | 533 | 260 | -273 |
2021-01-10 | Co-ed | 745 | 500 | -245 |
2021-01-10 | Families | 2627 | 1128 | -1499 |
2021-01-10 | Men | 1999 | 971 | -1028 |
2021-01-10 | Women | 838 | 383 | -455 |
2021-01-10 | Youth | 535 | 263 | -272 |
2021-01-11 | Co-ed | 741 | 502 | -239 |
2021-01-11 | Families | 2616 | 1121 | -1495 |
2021-01-11 | Men | 2003 | 971 | -1032 |
2021-01-11 | Women | 835 | 383 | -452 |
2021-01-11 | Youth | 538 | 254 | -284 |
2021-01-12 | Co-ed | 740 | 503 | -237 |
2021-01-12 | Families | 2634 | 1136 | -1498 |
2021-01-12 | Men | 1996 | 965 | -1031 |
2021-01-12 | Women | 844 | 383 | -461 |
2021-01-12 | Youth | 536 | 262 | -274 |
2021-01-13 | Co-ed | 741 | 500 | -241 |
2021-01-13 | Families | 2629 | 1146 | -1483 |
2021-01-13 | Men | 1993 | 956 | -1037 |
2021-01-13 | Women | 842 | 378 | -464 |
2021-01-13 | Youth | 532 | 260 | -272 |
2021-01-14 | Co-ed | 739 | 498 | -241 |
2021-01-14 | Families | 2635 | 1145 | -1490 |
2021-01-14 | Men | 1994 | 915 | -1079 |
2021-01-14 | Women | 842 | 372 | -470 |
2021-01-14 | Youth | 537 | 265 | -272 |
2021-01-15 | Co-ed | 741 | 506 | -235 |
2021-01-15 | Families | 2655 | 1131 | -1524 |
2021-01-15 | Men | 1983 | 920 | -1063 |
2021-01-15 | Women | 847 | 371 | -476 |
2021-01-15 | Youth | 539 | 266 | -273 |
2021-01-16 | Co-ed | 744 | 511 | -233 |
2021-01-16 | Families | 2654 | 1133 | -1521 |
2021-01-16 | Men | 1988 | 918 | -1070 |
2021-01-16 | Women | 845 | 370 | -475 |
2021-01-16 | Youth | 533 | 264 | -269 |
2021-01-17 | Co-ed | 742 | 513 | -229 |
2021-01-17 | Families | 2649 | 1130 | -1519 |
2021-01-17 | Men | 1992 | 923 | -1069 |
2021-01-17 | Women | 844 | 366 | -478 |
2021-01-17 | Youth | 532 | 264 | -268 |
2021-01-18 | Co-ed | 743 | 507 | -236 |
2021-01-18 | Families | 2639 | 1125 | -1514 |
2021-01-18 | Men | 1999 | 923 | -1076 |
2021-01-18 | Women | 844 | 371 | -473 |
2021-01-18 | Youth | 535 | 264 | -271 |
2021-01-19 | Co-ed | 744 | 504 | -240 |
2021-01-19 | Families | 2631 | 1106 | -1525 |
2021-01-19 | Men | 1995 | 931 | -1064 |
2021-01-19 | Women | 849 | 379 | -470 |
2021-01-19 | Youth | 536 | 263 | -273 |
2021-01-20 | Co-ed | 742 | 503 | -239 |
2021-01-20 | Families | 2629 | 1081 | -1548 |
2021-01-20 | Men | 2001 | 941 | -1060 |
2021-01-20 | Women | 840 | 371 | -469 |
2021-01-20 | Youth | 536 | 263 | -273 |
2021-01-21 | Co-ed | 745 | 501 | -244 |
2021-01-21 | Families | 2647 | 1075 | -1572 |
2021-01-21 | Men | 2007 | 941 | -1066 |
2021-01-21 | Women | 839 | 367 | -472 |
2021-01-21 | Youth | 534 | 263 | -271 |
2021-01-22 | Co-ed | 744 | 505 | -239 |
2021-01-22 | Families | 2648 | 1058 | -1590 |
2021-01-22 | Men | 2014 | 949 | -1065 |
2021-01-22 | Women | 843 | 367 | -476 |
2021-01-22 | Youth | 531 | 267 | -264 |
2021-01-23 | Co-ed | 740 | 508 | -232 |
2021-01-23 | Families | 2653 | 1058 | -1595 |
2021-01-23 | Men | 2008 | 959 | -1049 |
2021-01-23 | Women | 842 | 362 | -480 |
2021-01-23 | Youth | 530 | 273 | -257 |
Let’s get some average statistics for January by sector (Table 6).
january_average <-
toronto_shelters_by_day_current %>%
group_by(sector) %>%
summarise(Difference = mean(difference)) %>%
mutate(Difference = as.integer(Difference))
january_average %>%
rename(Sector = sector,
`Difference this January compared with last` = Difference) %>%
kableExtra::kbl(caption = "Overall comparison of shelter usage in January 2021 with January 2020") %>%
kableExtra::kable_styling()
Sector | Difference this January compared with last |
---|---|
Co-ed | -236 |
Families | -1516 |
Men | -1037 |
Women | -465 |
Youth | -267 |
Let’s look at a comparison graph (Figure 15).
toronto_shelters_by_day_current %>%
select(-difference) %>%
pivot_longer(cols = c("occupancy_2020", "occupancy_2021"),
names_to = "year",
values_to = "number"
) %>%
mutate(year = if_else(year == 'occupancy_2020', '2020', '2021')) %>%
ggplot(aes(x = date, y = number, color = year)) +
geom_point() +
facet_wrap(vars(sector)) +
theme_minimal() +
labs(x = "Date",
y = "Occupancy in shelters (#)",
color = "Year") +
scale_color_brewer(palette = "Set1")
Figure 15: Comparison of shelter usage in January 2021 with January 2020
I estimate that in January, on average, there are roughly 3,517 additional people sleeping outside in below freezing conditions this year compared with last year. That is to say, shelter usage is that much below what it was, and I’m not sure where else they could go. In particular, on Friday night, when it was horrendously cold, I estimate there were 3,634 additional people likely sleeping outside, by comparing the usage of shelters on 22 January 2020 with 22 January 2021.
Even if I’m off by an order of magnitude, the city and province clearly should do more.
Thank you to Monica Alexander for helpful comments.