*Disclaimer: None of the contents of this blog post should be constrewed as tax advice. If you have any questions about your taxes, speak to an expert.*

It is tax time in Canada, which means that there are a ton of numbers to look over and details to review. One of those details is my public transit spending. Canadians can claim their public transit spending on Line 364–Public transit amount. The credit was discontinued last year in a surprisingly hypocritical move by our “pro-environment” government. This year is the last year that the credit can be claimed, for the period from January 1, 2017 to June 30, 2017.

If you are a Metropass user, it is super easy to calculate your transit fare. You simply need to add up the cost of each Metropass you purchased for the period. It has been a very long time since I have used a Metropass, as I usually end up spending less than the $116.75 that a Post-Secondary Metropass costs. Normally, day-use purchases such as tickets or tokens do not qualify for the Public Transit Credit. Presto cards, however, do qualify, based on some specific criteria laid out on the CRA website.

Presto users can claim the cost of any series of at least 32 one-way trips taken on a single transit service within a 31-day (or shorter) period. To do so, one must first download their Transit Usage Report. In the post, I will poke around my own TRU in R to see if I can maximize the amount that I can claim. The data in this report is real, but I have removed the location information for a smidgen of personal privacy. I should also note that I first opened my TRU in LibreOffice to clean up the date and amount column, which weren’t R-friendly.

As this is an R post. First, I will load the libraries I will be using:

```
library(dplyr)
library(tidyr)
library(lubridate)
library(zoo)
```

Now I will read in my TRU from the `.csv`

file, clean up the date using `lubridate`

, and filter to the TTC, which is the only agency that I use enough to qualify for the tax credit.

```
dat <- readr::read_csv("../../static/files/CONOR_TUR_2018.csv", col_types = "Dccd")
dat <- dat %>%
mutate(Date = ymd(Date)) %>%
filter(`Transit Agency` == "Toronto Transit Commission")
head(dat)
## # A tibble: 6 x 4
## Date `Transit Agency` Type Amount
## <date> <chr> <chr> <dbl>
## 1 2017-01-01 Toronto Transit Commission Fare Payment -3.00
## 2 2017-01-02 Toronto Transit Commission Fare Payment -3.00
## 3 2017-01-02 Toronto Transit Commission Fare Payment -3.00
## 4 2017-01-03 Toronto Transit Commission Fare Payment -3.00
## 5 2017-01-03 Toronto Transit Commission Fare Payment -3.00
## 6 2017-01-04 Toronto Transit Commission Fare Payment -3.00
```

First, let’s see if I was right to forego Metropasses last year:

```
dat %>% group_by(Yearmon = as.yearmon(Date)) %>% select(Yearmon, Amount) %>% summarize_all(sum)
## # A tibble: 9 x 2
## Yearmon Amount
## <S3: yearmon> <dbl>
## 1 Jan 2017 - 66.0
## 2 Feb 2017 -126
## 3 Mar 2017 -120
## 4 Apr 2017 - 84.0
## 5 Aug 2017 - 42.0
## 6 Sep 2017 -102
## 7 Oct 2017 -108
## 8 Nov 2017 -132
## 9 Dec 2017 -108
```

Oops, it looks like I should have bought a Metropass in February, March, and November of last year, as it would have been cheaper than what I paid in individual ($3) fares. Shoot!

All is not lost, however, because I can still claim some of that transit spending back on my tax return. The TRU has each transaction listed individually, but I’m interested in how many transactions I make over a given time period. As such, I will first add up any transactions made on the same day of the week, and also pad the list with complete dates from January 1 to June 30.

```
dat <- dat %>% group_by(Date) %>% select(Date, Amount) %>% summarize(trips = n(), amount = sum(Amount)) %>% ungroup
dat <- dat %>% complete(Date = full_seq(Date, period = 1), fill = list(trips = 0, amount = 0)) %>% filter(Date <= "2017-06-30")
dat[94:102,] # Just as an example
## # A tibble: 9 x 3
## Date trips amount
## <date> <dbl> <dbl>
## 1 2017-04-04 2.00 -6.00
## 2 2017-04-05 0 0
## 3 2017-04-06 2.00 -6.00
## 4 2017-04-07 2.00 -6.00
## 5 2017-04-08 0 0
## 6 2017-04-09 0 0
## 7 2017-04-10 2.00 -6.00
## 8 2017-04-11 1.00 -3.00
## 9 2017-04-12 2.00 -6.00
```

The CRA doesn’t specify that the 31-day periods that we should examine have to line up with the start of the month, so I’ll check for rolling 31-day periods with start dates on January 1 through January 31.

```
out <- NULL
for (i in 1:31) {
# Calculate cumulative trips and fares over rolling 31-day periods
result <- tibble(trips = rollapplyr(dat$trips[i:nrow(dat)], width = 31, by = 31,
FUN = sum, partial = TRUE),
amount = rollapplyr(dat$amount[i:nrow(dat)], width = 31, by = 31,
FUN = sum, partial = TRUE))
result <- result %>% filter(trips >= 32) %>% summarize_all(sum)
out <- bind_rows(out, result)
}
head(out)
## # A tibble: 6 x 2
## trips amount
## <dbl> <dbl>
## 1 85.0 -255
## 2 82.0 -246
## 3 82.0 -246
## 4 84.0 -252
## 5 84.0 -252
## 6 82.0 -246
```

Let’s see what my maximum claim might be.

```
out$trips[which(out$trips == max(out$trips))[1]]
## [1] 118
out$amount[which(out$trips == max(out$trips))[1]]
## [1] -354
```

I can claim 118 trips, which cost a total of $354, if I consider my rolling 31-day periods to have started on January 21.

Now, the above assumes that the 31-day periods mandated by the CRA be continuous, e.g. January 1 to 31, February 1 to March 2, March 3 to April 2, etc. This isn’t actually specified, anywhere on their CRA website, though. I thought it might be interesting to instead look at combinations of independent 31-day periods with *no overlap*. That means that I can try to maximize my claim be windowing my 31-day periods appropriately, while not double-claiming any single day. Let’s take a look.

First, I will get cumulative totals for *all* rolling 31-day periods from January 1 through the end of my data, and get the list of dates contained in each period. Since my periods are right-aligned, the dates will be \(T_i - 30 \ldots T_i\).

```
dat2 <-mutate(dat,
cum_trips = rollapplyr(trips, width = 31, FUN = sum, na.rm = TRUE,
partial = TRUE, align = "right"),
cum_amount = rollapplyr(amount, width = 31, FUN = sum, na.rm = TRUE,
partial = TRUE, align = "right"),
dates = lapply(dat$Date,
function (x) { seq.Date(from = x - 30, to = x, by = 1)})
)
# Filter only those periods that have 32 or more trips.
dat2 <- filter(dat2, cum_trips >= 32)
head(dat2)
## # A tibble: 6 x 6
## Date trips amount cum_trips cum_amount dates
## <date> <dbl> <dbl> <dbl> <dbl> <list>
## 1 2017-02-16 2.00 -6.00 33.0 - 99.0 <date [31]>
## 2 2017-02-17 2.00 -6.00 35.0 -105 <date [31]>
## 3 2017-02-18 1.00 -3.00 36.0 -108 <date [31]>
## 4 2017-02-19 0 0 36.0 -108 <date [31]>
## 5 2017-02-20 0 0 36.0 -108 <date [31]>
## 6 2017-02-21 1.00 -3.00 37.0 -111 <date [31]>
```

Now I am interested in finding out how these periods can be combined. As such, I will craft a function that generates combinations of \(m\) such periods, and selects only those that have \(31 * m\) unique dates, i.e. no overlapping dates. In my case, I will look for combinations of between 1 and 4 periods. There are two reasons for using a maximum of four periods: first, I was overseas from April 26 to August, so I know that it is futile to look for more 31-day periods from January 1 to June 30; second, and more pragmatically, at \(m = 5\) the combination process becomes very slow.

```
get_combs <- function(m) {
print(paste("Getting combinations for", m, "31-day period(s)."))
# Get list of combinations of our date lists
combs <- combn(dat2$dates, m, simplify = FALSE)
# Flatten the second levels (so we have a list of dates, and not a list of lists)
combs <- lapply(combs, unlist)
# Get the lengths of unique values in each combination
lengths_unique <- (lapply(combs, function (x) {length(unique(x))}))
# Choose combinations that have as many unique values as days
combs[lengths_unique == m*31]
}
combs <- do.call("c", lapply(1:4, function (x) {get_combs(x)}))
## [1] "Getting combinations for 1 31-day period(s)."
## [1] "Getting combinations for 2 31-day period(s)."
## [1] "Getting combinations for 3 31-day period(s)."
## [1] "Getting combinations for 4 31-day period(s)."
```

Now that I have combinations of 31-day periods with no overlap, I want to get some values back.

```
get_values <- function(x) {
c(trips = sum(dat$trips[dat$Date %in% x], na.rm = TRUE),
cost = sum(dat$amount[dat$Date %in% x]))
}
values <- do.call("bind_rows", lapply(combs, get_values))
head(values)
## # A tibble: 6 x 2
## trips cost
## <dbl> <dbl>
## 1 33.0 - 99.0
## 2 35.0 -105
## 3 36.0 -108
## 4 36.0 -108
## 5 36.0 -108
## 6 37.0 -111
```

So, did all of that hard work pay off? Let’s see:

```
values$trips[values$trips == max(values$trips)][1]
## [1] 118
values$cost[values$trips == max(values$trips)][1]
## [1] -354
```

Oops, it doesn’t look like it! It seems that I had already captured the maximum number of trips in my previous method, so this experiment with combinations proved to be futile. My claim is still at 118 trips with a total cost of $354.

What’s next? Well, I supposed another idea would be to compare all the possible periods of 31 days *or less* that also contain 32 trips or more. ~~For me, I think I’ll call it quits with the claim of $354, after all, this exercise is probably going to cost me more in lost marking time than it is worth on my tax return!~~

**Update 2018/02/25:** It seems I can’t rest until I solve something. I had a couple of thoughts as I tried to sleep last night on how I could implement this test. Here is what I came up with.

First, I will find all rolling periods of between 15 and 31 days that contain at least 32 one-way trips.

```
get_rolls <- function(days) {
mutate(dat,
cum_trips = rollapplyr(trips, width = days, FUN = sum, na.rm = TRUE,
partial = TRUE, align = "right"),
cum_amount = rollapplyr(amount, width = days, FUN = sum, na.rm = TRUE,
partial = TRUE, align = "right"),
dates = lapply(dat$Date,
function (x) { seq.Date(from = x - (days - 1), to = x, by = 1)}),
period_size = rep(days, nrow(dat))) %>% filter(cum_trips >= 32)
}
rolls <- do.call("bind_rows", lapply(15:31, function (x) {get_rolls(x)}))
```

Now, I have 526 periods with 32 trips or more that I could combine. Unfortunately, I won’t actually be able to do this! Let’s digress for a minute and look at the number of possible combinations for 526 periods using Ramanujan’s approximation. The number of possible combinations of \(k\) items from a set of \(n\) can be calculated as:

```
# Code from: https://stackoverflow.com/questions/40527010/r-how-can-i-calculate-large-numbers-in-n-choose-k
ramanujan <- function(n){
n*log(n) - n + log(n*(1 + 4*n*(1+2*n)))/6 + log(pi)/2
}
ncombs <- function(n,k){
round(exp(ramanujan(n) - ramanujan(k) - ramanujan(n-k)))
}
```

Therefore, the approximate number of possible combinations of three periods is:

```
ncombs(nrow(rolls), 3)
## [1] 24117431
```

While that number isn’t *huge*, it doesn’t take long for the combinations to fill the 8 gigs of RAM on my laptop and cause the machine to lock up. A more advanced approach is necessary, but also out of my wheelhouse. For now, as a proof of concept, I’ll just run the combinations with \(k\) of 1 (528 periods) and 2 (\(1.3808\times 10^{5}\) combinations) using a slightly tweaked version of the formula I used above.

```
get_combs_from_rolls <- function(n) {
print(paste("Getting combinations for", n, "period(s)."))
combs <- combn(rolls$dates, n, simplify = FALSE)
combs <- lapply(combs, unlist)
lengths_unique <- (lapply(combs, function (x) {length(unique(x))}))
# Choose combinations that have as many unique values as days
combs[lengths_unique == lengths(combs)]
}
combs <- do.call("c", lapply(1:2, function (x) {get_combs_from_rolls(x)}))
## [1] "Getting combinations for 1 period(s)."
## [1] "Getting combinations for 2 period(s)."
```

Now let’s get the values.

```
values <- do.call("bind_rows", lapply(combs, get_values))
values$trips[values$trips == max(values$trips)][1]
## [1] 87
values$cost[values$trips == max(values$trips)][1]
## [1] -261
```

So, looks like a top combination of two periods gives me 87 trips at a total cost of $261. While this number is still lower than the numbers I calculated above (and therefore, less useful to me on my tax return), I am pretty sure that this method could theoretically produce the absolute maximum return based on finding the best combination of periods that have at least 32 trips. I do admit, however, that the difference, if any, from the quick and easy method at the top of this post is probably so minute that it would likely make zero real-world difference on one’s tax return. At least I learned a lot about the `apply`

family of functions in R!