15 Vector tools

You are reading the work-in-progress second edition of R for Data Science. This chapter is currently currently a dumping ground for ideas, and we don’t recommend reading it. You can find the polished first edition at https://r4ds.had.co.nz.

15.1 Introduction

%in%

c()

library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#>  ggplot2 3.3.5           purrr   0.3.4     
#>  tibble  3.1.5           dplyr   1.0.7     
#>  tidyr   1.1.4           stringr 1.4.0.9000
#>  readr   2.0.2           forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()
library(nycflights13)

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

15.2 Counts

  • Counts: You’ve seen n(), which takes no arguments, and returns the size of the current group. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).

    # Which destinations have the most carriers?
    not_cancelled %>% 
      group_by(dest) %>% 
      summarise(carriers = n_distinct(carrier)) %>% 
      arrange(desc(carriers))
    #> # A tibble: 104 × 2
    #>   dest  carriers
    #>   <chr>    <int>
    #> 1 ATL          7
    #> 2 BOS          7
    #> 3 CLT          7
    #> 4 ORD          7
    #> 5 TPA          7
    #> 6 AUS          6
    #> # … with 98 more rows

    Counts are so useful that dplyr provides a simple helper if all you want is a count:

    not_cancelled %>% 
      count(dest)
    #> # A tibble: 104 × 2
    #>   dest      n
    #>   <chr> <int>
    #> 1 ABQ     254
    #> 2 ACK     264
    #> 3 ALB     418
    #> 4 ANC       8
    #> 5 ATL   16837
    #> 6 AUS    2411
    #> # … with 98 more rows

    Just like with group_by(), you can also provide multiple variables to count().

    not_cancelled %>% 
      count(carrier, dest)
    #> # A tibble: 312 × 3
    #>   carrier dest      n
    #>   <chr>   <chr> <int>
    #> 1 9E      ATL      56
    #> 2 9E      AUS       2
    #> 3 9E      AVL      10
    #> 4 9E      BNA     452
    #> 5 9E      BOS     853
    #> 6 9E      BTV       2
    #> # … with 306 more rows

    You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:

    not_cancelled %>% 
      count(tailnum, wt = distance)
    #> # A tibble: 4,037 × 2
    #>   tailnum      n
    #>   <chr>    <dbl>
    #> 1 D942DN    3418
    #> 2 N0EGMQ  239143
    #> 3 N10156  109664
    #> 4 N102UW   25722
    #> 5 N103US   24619
    #> 6 N104UW   24616
    #> # … with 4,031 more rows

15.3 Window functions

  • Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). They are most useful in conjunction with group_by(), which you’ll learn about shortly.

    (x <- 1:10)
    #>  [1]  1  2  3  4  5  6  7  8  9 10
    lag(x)
    #>  [1] NA  1  2  3  4  5  6  7  8  9
    lead(x)
    #>  [1]  2  3  4  5  6  7  8  9 10 NA
  • Ranking: there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.

    y <- c(1, 2, 2, NA, 3, 4)
    min_rank(y)
    #> [1]  1  2  2 NA  4  5
    min_rank(desc(y))
    #> [1]  5  3  3 NA  2  1

    If min_rank() doesn’t do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), ntile(). See their help pages for more details.

    row_number(y)
    #> [1]  1  2  3 NA  4  5
    dense_rank(y)
    #> [1]  1  2  2 NA  3  4
    percent_rank(y)
    #> [1] 0.00 0.25 0.25   NA 0.75 1.00
    cume_dist(y)
    #> [1] 0.2 0.6 0.6  NA 0.8 1.0
  • Measures of position: first(x), nth(x, 2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:

    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(
        first_dep = first(dep_time), 
        last_dep = last(dep_time)
      )
    #> `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
    #> # A tibble: 365 × 5
    #> # Groups:   year, month [12]
    #>    year month   day first_dep last_dep
    #>   <int> <int> <int>     <int>    <int>
    #> 1  2013     1     1       517     2356
    #> 2  2013     1     2        42     2354
    #> 3  2013     1     3        32     2349
    #> 4  2013     1     4        25     2358
    #> 5  2013     1     5        14     2357
    #> 6  2013     1     6        16     2355
    #> # … with 359 more rows

    These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:

    not_cancelled %>% 
      group_by(year, month, day) %>% 
      mutate(r = min_rank(desc(dep_time))) %>% 
      filter(r %in% range(r))
    #> # A tibble: 770 × 20
    #> # Groups:   year, month, day [365]
    #>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    #>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    #> 1  2013     1     1      517            515         2      830            819
    #> 2  2013     1     1     2356           2359        -3      425            437
    #> 3  2013     1     2       42           2359        43      518            442
    #> 4  2013     1     2     2354           2359        -5      413            437
    #> 5  2013     1     3       32           2359        33      504            442
    #> 6  2013     1     3     2349           2359       -10      434            445
    #> # … with 764 more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
    #> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
    #> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, r <int>

15.3.1 dplyr

flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
  • Find the worst members of each group:

    flights_sml %>% 
      group_by(year, month, day) %>%
      filter(rank(desc(arr_delay)) < 10)
    #> # A tibble: 3,306 × 7
    #> # Groups:   year, month, day [365]
    #>    year month   day dep_delay arr_delay distance air_time
    #>   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
    #> 1  2013     1     1       853       851      184       41
    #> 2  2013     1     1       290       338     1134      213
    #> 3  2013     1     1       260       263      266       46
    #> 4  2013     1     1       157       174      213       60
    #> 5  2013     1     1       216       222      708      121
    #> 6  2013     1     1       255       250      589      115
    #> # … with 3,300 more rows
  • Find all groups bigger than a threshold:

    popular_dests <- flights %>% 
      group_by(dest) %>% 
      filter(n() > 365)
    popular_dests
    #> # A tibble: 332,577 × 19
    #> # Groups:   dest [77]
    #>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    #>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
    #> 1  2013     1     1      517            515         2      830            819
    #> 2  2013     1     1      533            529         4      850            830
    #> 3  2013     1     1      542            540         2      923            850
    #> 4  2013     1     1      544            545        -1     1004           1022
    #> 5  2013     1     1      554            600        -6      812            837
    #> 6  2013     1     1      554            558        -4      740            728
    #> # … with 332,571 more rows, and 11 more variables: arr_delay <dbl>,
    #> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
    #> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  • Standardise to compute per group metrics:

    popular_dests %>% 
      filter(arr_delay > 0) %>% 
      mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
      select(year:day, dest, arr_delay, prop_delay)
    #> # A tibble: 131,106 × 6
    #> # Groups:   dest [77]
    #>    year month   day dest  arr_delay prop_delay
    #>   <int> <int> <int> <chr>     <dbl>      <dbl>
    #> 1  2013     1     1 IAH          11  0.000111 
    #> 2  2013     1     1 IAH          20  0.000201 
    #> 3  2013     1     1 MIA          33  0.000235 
    #> 4  2013     1     1 ORD          12  0.0000424
    #> 5  2013     1     1 FLL          19  0.0000938
    #> 6  2013     1     1 ORD           8  0.0000283
    #> # … with 131,100 more rows

A grouped filter is a grouped mutate followed by an ungrouped filter. I generally avoid them except for quick and dirty manipulations: otherwise it’s hard to check that you’ve done the manipulation correctly.

Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries). You can learn more about useful window functions in the corresponding vignette: vignette("window-functions").

15.3.2 Exercises

  1. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

  2. Which plane (tailnum) has the worst on-time record?

  3. What time of day should you fly if you want to avoid delays as much as possible?

  4. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

  5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.

  6. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

  7. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.