28  Iteration

You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at https://r4ds.had.co.nz.

28.1 Introduction

In this chapter, you’ll tools for iteration, repeatedly performing the same action on different objects. You’ve already learned a number of special purpose tools for iteration:

Now it’s time to learn some more general tools. Tools for iteration can quickly become very abstract, but in this chapter we’ll keep things concrete to make as easy as possible to learn the basics. We’re going to focus on three related tools for three related tasks: modifying multiple columns, reading multiple files, and saving multiple objects. We’ll conclude with a brief discussion of for-loops, an important iteration technique that we deliberately don’t cover here, and provide a few pointers for learning more.

28.1.1 Prerequisites

This chapter relies on features only found in purrr 1.0.0, which is still in development. If you want to live life on the edge you can get the dev version with devtools::install_github("tidyverse/purrr").

In this chapter, we’ll focus on tools provided by dplyr and purrr, both core members of the tidyverse. You’ve seen dplyr before, but purrr is new. We’re going to use just a couple of purrr functions from in this chapter, but it’s a great package to skill as you improve your programming skills.

This chapter also relies on a function that hasn’t yet been implemented for dplyr:

pick <- function(cols) {
  across({{ cols }})
}

28.2 Modifying multiple columns

Imagine you have this simple tibble:

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

And you want to compute the median of every column. You could do it with copy-and-paste:

df |> summarise(
  a = median(a),
  b = median(b),
  c = median(c),
  d = median(d),
  n = n()
)
#> # A tibble: 1 × 5
#>        a      b       c     d     n
#>    <dbl>  <dbl>   <dbl> <dbl> <int>
#> 1 -0.246 -0.287 -0.0567 0.144    10

But that breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of variables. Instead you can use across():

df |> summarise(
  across(a:d, median),
  n = n()
)
#> # A tibble: 1 × 5
#>        a      b       c     d     n
#>    <dbl>  <dbl>   <dbl> <dbl> <int>
#> 1 -0.246 -0.287 -0.0567 0.144    10

across() has three particularly important arguments, which we’ll discuss in detail in the following sections. You’ll use the first two every time you use across():

  • The first argument, .cols, specifies which columns you want to iterate over. It uses tidy-select syntax, just like select().
  • The second argument, .fns, specifies what to do with each column.

The .names argument gives you control over the output names, and is particularly useful when you use across() with mutate(). We’ll also discuss two important variations, if_any() and if_all(), which work with filter().

28.2.1 Selecting columns with .cols

The first argument to across() selects the columns to transform. This argument uses the same specifications as select(), Section 4.3.2, so you can use functions like starts_with() and ends_with() to select variables based on their name. Grouping columns are automatically ignored because they’re carried along for the ride by the dplyr verb.

There are two additional selection techniques that are particularly useful for across(): everything() and where(). everything() is straightforward: it selects every (non-grouping) column:

df <- tibble(
  grp = sample(2, 10, replace = TRUE),
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df |> 
  group_by(grp) |> 
  summarise(across(everything(), median))
#> # A tibble: 2 × 5
#>     grp       a       b     c     d
#>   <int>   <dbl>   <dbl> <dbl> <dbl>
#> 1     1 -0.0935 -0.0163 0.363 0.364
#> 2     2  0.312  -0.0576 0.208 0.565

where() allows you to select columns based on their type:

  • where(is.numeric) selects all numeric columns.
  • where(is.character) selects all string columns.
  • where(is.Date) selects all date columns.
  • where(is.POSIXct) selects all date-time columns.
  • where(is.logical) selects all logical columns.
df <- tibble(
  x1 = 1:3,
  x2 = runif(3),
  y1 = sample(letters, 3),
  y2 = c("banana", "apple", "egg")
)

df |> 
  summarise(across(where(is.numeric), mean))
#> # A tibble: 1 × 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     2 0.370

df |> 
  summarise(across(where(is.character), str_flatten))
#> # A tibble: 1 × 2
#>   y1    y2            
#>   <chr> <chr>         
#> 1 kjh   bananaappleegg

Just like other selectors, you can combine these with Boolean algebra. For example, !where(is.numeric) selects all non-numeric columns and starts_with("a") & where(is.logical) selects all logical columns whose name starts with “a”.

28.2.2 Defining the action with .fns

The second argument to across() defines how each column will be transformed. In simple cases, this will just be the name of existing function, but you might want to supply additional arguments or perform multiple transformations, as described below.

Lets motivate this problem with an simple example: what happens if we have some missing values in our data? median() will preserve those missing values giving us a suboptimal output:

rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
  sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
}

df <- tibble(
  a = rnorm_na(5, 1),
  b = rnorm_na(5, 1),
  c = rnorm_na(5, 2),
  d = rnorm(5)
)
df |> 
  summarise(
    across(a:d, median),
    n = n()
  )
#> # A tibble: 1 × 5
#>       a     b     c     d     n
#>   <dbl> <dbl> <dbl> <dbl> <int>
#> 1    NA    NA    NA 0.704     5

It’d be nice to be able to pass along na.rm = TRUE to median() to remove these missing values. To do so, instead of calling median() directly, we need to create a new function that calls median() with the correct arguments:

df |> 
  summarise(
    across(a:d, function(x) median(x, na.rm = TRUE)),
    n = n()
  )
#> # A tibble: 1 × 5
#>       a      b      c     d     n
#>   <dbl>  <dbl>  <dbl> <dbl> <int>
#> 1 0.429 -0.721 -0.796 0.704     5

This is a little verbose, so R comes with a handy shortcut: for this sort of throw away function1, you can replace function with \:

df |> 
  summarise(
    across(a:d, \(x) median(x, na.rm = TRUE)),
    n = n()
  )

In either case, across() effectively expands to the following code:

df |> summarise(
  a = median(a, na.rm = TRUE),
  b = median(b, na.rm = TRUE),
  c = median(c, na.rm = TRUE),
  d = median(d, na.rm = TRUE),
  n = n()
)

When we remove the missing values from the median(), it would be nice to know just how many values we were removing. We find that out by supplying two functions to across(): one to compute the median and the other to count the missing values. You can supply multiple functions with a named list:

df |> 
  summarise(
    across(a:d, list(
      median = \(x) median(x, na.rm = TRUE),
      n_miss = \(x) sum(is.na(x))
    )),
    n = n()
  )
#> # A tibble: 1 × 9
#>   a_median a_n_miss b_median b_n_miss c_median c_n_miss d_median d_n_miss     n
#>      <dbl>    <int>    <dbl>    <int>    <dbl>    <int>    <dbl>    <int> <int>
#> 1    0.429        1   -0.721        1   -0.796        2    0.704        0     5

If you look carefully, you might intuit that the columns are named using using a glue specification (Section 15.3.2) like {.col}_{.fn} where .col is the name of the original column and .fn is the name of the function. That’s not a coincidence! As you’ll learn in the next section, you can use .names argument to supply your own glue spec.

28.2.3 Column names

The result of across() is named according to the specification provided in the .names variable. We could specify our own if we wanted the name of the function to come first2:

df |> 
  summarise(
    across(
      a:d, 
      list(
        median = \(x) median(x, na.rm = TRUE),
        n_miss = \(x) sum(is.na(x))
      ), 
      .names = "{.fn}_{.col}"
    ),
    n = n(),
  )
#> # A tibble: 1 × 9
#>   median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d     n
#>      <dbl>    <int>    <dbl>    <int>    <dbl>    <int>    <dbl>    <int> <int>
#> 1    0.429        1   -0.721        1   -0.796        2    0.704        0     5

The .names argument is particularly important when you use across() with mutate(). By default the output of across() is given the same names as the inputs. This means that across() inside of mutate() will replace existing columns:

df |> 
  mutate(
    across(a:d, \(x) coalesce(x, 0))
  )
#> # A tibble: 5 × 4
#>        a      b      c      d
#>    <dbl>  <dbl>  <dbl>  <dbl>
#> 1  0     -0.463  0      2.13 
#> 2 -0.382 -0.980  0      0.704
#> 3  0.434  0     -1.06   0.715
#> 4  1.06   1.21  -0.796 -1.09 
#> 5  0.424 -1.28  -0.785  0.402

If you’d like to instead create new columns, you can use the .names argument give the output new names:

df |> 
  mutate(
    across(a:d, \(x) x * 2, .names = "{.col}_double")
  )
#> # A tibble: 5 × 8
#>        a      b      c      d a_double b_double c_double d_double
#>    <dbl>  <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 NA     -0.463 NA      2.13    NA       -0.925    NA       4.27 
#> 2 -0.382 -0.980 NA      0.704   -0.764   -1.96     NA       1.41 
#> 3  0.434 NA     -1.06   0.715    0.868   NA        -2.11    1.43 
#> 4  1.06   1.21  -0.796 -1.09     2.13     2.43     -1.59   -2.18 
#> 5  0.424 -1.28  -0.785  0.402    0.848   -2.56     -1.57    0.803

28.2.4 Filtering

across() is a great match for summarise() and mutate() but it’s not such a great fit for filter() because you usually string together calls to multiple functions either with | or &. So dplyr provides two variants of across() called if_any() and if_all():

df |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
#> # A tibble: 3 × 4
#>        a      b     c     d
#>    <dbl>  <dbl> <dbl> <dbl>
#> 1 NA     -0.463 NA    2.13 
#> 2 -0.382 -0.980 NA    0.704
#> 3  0.434 NA     -1.06 0.715
# same as:
df |> filter(if_any(a:d, is.na))
#> # A tibble: 3 × 4
#>        a      b     c     d
#>    <dbl>  <dbl> <dbl> <dbl>
#> 1 NA     -0.463 NA    2.13 
#> 2 -0.382 -0.980 NA    0.704
#> 3  0.434 NA     -1.06 0.715

df |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
#> # A tibble: 0 × 4
#> # … with 4 variables: a <dbl>, b <dbl>, c <dbl>, d <dbl>
# same as:
df |> filter(if_all(a:d, is.na))
#> # A tibble: 0 × 4
#> # … with 4 variables: a <dbl>, b <dbl>, c <dbl>, d <dbl>

28.2.5 across() in functions

across() is particularly useful to program with because it allows you to operate with multiple variables. For example, Jacob Scott uses this little helper to expand all date variables into year, month, and day variables:

expand_dates <- function(df) {
  df |> 
    mutate(
      across(
        where(lubridate::is.Date), 
        list(year = year, month = month, day = mday)
      )
    )
}

It also makes it easy to supply multiple variables in a single argument because the first argument uses tidy-select. You just need to remember to embrace that argument. For example, this function will compute the means of numeric variables by default. But by supplying the second argument you can choose to summarize just selected variables:

summarise_means <- function(df, summary_vars = where(is.numeric)) {
  df |> 
    summarise(
      across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
      n = n()
    )
}
diamonds |> 
  group_by(clarity) |> 
  summarise_means()
#> # A tibble: 8 × 9
#>   clarity carat depth table price     x     y     z     n
#>   <ord>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 I1      1.28   62.7  58.3 3924.  6.76  6.71  4.21   741
#> 2 SI2     1.08   61.8  57.9 5063.  6.40  6.40  3.95  9194
#> 3 SI1     0.850  61.9  57.7 3996.  5.89  5.89  3.64 13065
#> 4 VS2     0.764  61.7  57.4 3925.  5.66  5.66  3.49 12258
#> 5 VS1     0.727  61.7  57.3 3839.  5.57  5.58  3.44  8171
#> 6 VVS2    0.596  61.7  57.0 3284.  5.22  5.23  3.22  5066
#> # … with 2 more rows

diamonds |> 
  group_by(clarity) |> 
  summarise_means(c(carat, x:z))
#> # A tibble: 8 × 6
#>   clarity carat     x     y     z     n
#>   <ord>   <dbl> <dbl> <dbl> <dbl> <int>
#> 1 I1      1.28   6.76  6.71  4.21   741
#> 2 SI2     1.08   6.40  6.40  3.95  9194
#> 3 SI1     0.850  5.89  5.89  3.64 13065
#> 4 VS2     0.764  5.66  5.66  3.49 12258
#> 5 VS1     0.727  5.57  5.58  3.44  8171
#> 6 VVS2    0.596  5.22  5.23  3.22  5066
#> # … with 2 more rows

28.2.6 Vs pivot_longer()

Before we go on, it’s worth pointing out an interesting connection between across() and pivot_longer(). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, we could rewrite our multiple summary across() as:

df |> 
  pivot_longer(a:d) |> 
  group_by(name) |> 
  summarise(
    median = median(value, na.rm = TRUE),
    n_miss = sum(is.na(value))
  )
#> # A tibble: 4 × 3
#>   name  median n_miss
#>   <chr>  <dbl>  <int>
#> 1 a      0.429      1
#> 2 b     -0.721      1
#> 3 c     -0.796      2
#> 4 d      0.704      0

This is a useful technique to know about because sometimes you’ll hit a problem that’s not currently possible to solve with across(): when you have groups of variables that you want to compute with simultaneously. For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:

df3 <- tibble(
  a_val = rnorm(10),
  a_w = runif(10),
  b_val = rnorm(10),
  b_w = runif(10),
  c_val = rnorm(10),
  c_w = runif(10),
  d_val = rnorm(10),
  d_w = runif(10)
)

There’s currently no way to do this with across()3, but it’s relatively straightforward with pivot_longer():

df3_long <- df3 |> 
  pivot_longer(
    everything(), 
    names_to = c("group", ".value"), 
    names_sep = "_"
  )
df3_long
#> # A tibble: 40 × 3
#>   group    val     w
#>   <chr>  <dbl> <dbl>
#> 1 a      0.404 0.678
#> 2 b      1.74  0.650
#> 3 c     -0.921 0.261
#> 4 d     -0.953 0.327
#> 5 a      2.04  0.665
#> 6 b     -1.64  0.815
#> # … with 34 more rows

df3_long |> 
  group_by(group) |> 
  summarise(mean = weighted.mean(val, w))
#> # A tibble: 4 × 2
#>   group    mean
#>   <chr>   <dbl>
#> 1 a      0.109 
#> 2 b      0.585 
#> 3 c     -0.746 
#> 4 d     -0.0142

If needed, you could pivot_wider() this back to the original form.

28.2.7 Exercises

  1. Compute the number of unique values in each column of palmerpenguins::penguins.

  2. Compute the mean of every column in mtcars.

  3. Group diamonds by cut, clarity, and color then count the number of observations and the mean of each numeric variable.

  4. What happens if you use a list of functions, but don’t name them? How is the output named?

  5. It is possible to use across() inside filter() where it’s equivalent to if_all(). Can you explain why?

  6. Adjust expand_dates() to automatically remove the date columns after they’ve been expanded. Do you need to embrace any arguments?

  7. Explain what each step of the pipeline in this function does. What special feature of where() are we taking advantage of?

    show_missing <- function(df, group_vars, summary_vars = everything()) {
      df |> 
        group_by(pick({{ group_vars }})) |> 
        summarise(
          across({{ summary_vars }}, \(x) sum(is.na(x))),
          .groups = "drop"
        ) |>
        select(where(\(x) any(x > 0)))
    }
    nycflights13::flights |> show_missing(c(year, month, day))

28.3 Reading multiple files

In the previous section, you learn how to use dplyr::across() to repeat a transformation on multiple columns. In this section, you’ll learn how to use purrr::map() to read every file in a directly. Let’s start with a little motivation: imagine you have a directory full of excel spreadsheets4 you want to read. You could do it with copy and paste:

data2019 <- readr::read_excel("data/y2019.xls")
data2020 <- readr::read_excel("data/y2020.xls")
data2021 <- readr::read_excel("data/y2021.xls")
data2022 <- readr::read_excel("data/y2022.xls")

And then use dplyr::bind_rows() to combine them all together:

data <- bind_rows(data2019, data2020, data2021, data2022)

You can imagine that this would get tedious quickly, especially if you had 400 files, not four. So in the following sections, you’ll learn how to automate this sort of task. There are basic steps: use dir() list all the files, then use purrr::map() to read each of them into a list, then use purrr::list_rbind() to combine them into a single data frame. We’ll then discuss how you can handle situations of increasing heterogeneity, where you can’t do exactly the same thing to every file.

28.3.1 Listing files in a directory

dir() lists the files in a directory. You’ll almost always use three arguments:

  • path, the first argument is the directory to look in.

  • pattern is a regular expression that the file names must match. The most common pattern is something like \\.xlsx$ or \\.csv$ to match an extension, but you can use whatever you need to extract the data files from anything else living in that directory.

  • full.names determines whether or not the directory name should be included in the output. You almost always want this to be TRUE.

To make our motivating example concrete, this book contains a folder with 12 excel spreadsheets containining data from the gapminder package. Each file contains one years for data for 142 countries. We can list them all with the appropriate call to dir():

paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
paths
#>  [1] "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx"
#>  [3] "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx"
#>  [5] "data/gapminder/1972.xlsx" "data/gapminder/1977.xlsx"
#>  [7] "data/gapminder/1982.xlsx" "data/gapminder/1987.xlsx"
#>  [9] "data/gapminder/1992.xlsx" "data/gapminder/1997.xlsx"
#> [11] "data/gapminder/2002.xlsx" "data/gapminder/2007.xlsx"

28.3.2 purrr::map() and list_rbind()

Now that we have these 12 paths, we could call read_excel() 12 times to get 12 data frames. In general, we won’t know how files there are to read, so instead of saving each data frame to its own variable, we’ll put them all into a list, something like this:

list(
  readxl::read_excel("data/gapminder/1952.xls"),
  readxl::read_excel("data/gapminder/1957.xls"),
  readxl::read_excel("data/gapminder/1962.xls"),
  ...,
  readxl::read_excel("data/gapminder/2007.xls")
)

Now that’s just as tedious to type as before, but we can use a shortcut: purrr::map(). map() is similar to across(), but instead of doing something to each column in a data frame, it does something to each element of a vector. map(x, f) is shorthand for:

list(
  f(x[[1]]),
  f(x[[2]]),
  ...,
  f(x[[n]])
)

So we can use map() get a list of 12 data frames:

files <- map(paths, readxl::read_excel)
length(files)
#> [1] 12

files[[1]]
#> # A tibble: 142 × 5
#>   country     continent lifeExp      pop gdpPercap
#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
#> 1 Afghanistan Asia         28.8  8425333      779.
#> 2 Albania     Europe       55.2  1282697     1601.
#> 3 Algeria     Africa       43.1  9279525     2449.
#> 4 Angola      Africa       30.0  4232095     3521.
#> 5 Argentina   Americas     62.5 17876956     5911.
#> 6 Australia   Oceania      69.1  8691212    10040.
#> # … with 136 more rows

(This is another data structure that doesn’t display particularly compactly with str() so you might want to load into RStudio and inspect it with View()).

Now we can use purrr::list_rbind() to combine that list of data frames into a single data frame:

list_rbind(files)
#> # A tibble: 1,704 × 5
#>   country     continent lifeExp      pop gdpPercap
#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
#> 1 Afghanistan Asia         28.8  8425333      779.
#> 2 Albania     Europe       55.2  1282697     1601.
#> 3 Algeria     Africa       43.1  9279525     2449.
#> 4 Angola      Africa       30.0  4232095     3521.
#> 5 Argentina   Americas     62.5 17876956     5911.
#> 6 Australia   Oceania      69.1  8691212    10040.
#> # … with 1,698 more rows

Or we could do both steps at once in pipeline:

paths |> 
  map(readxl::read_excel) |> 
  list_rbind()

What if we want to pass in extra arguments to read_excel()? We use the same technique that we used with across(). For example, it’s often useful to peak at the first few row of the data with n_max = 1:

paths |> 
  map(\(path) readxl::read_excel(path, n_max = 1)) |> 
  list_rbind()
#> # A tibble: 12 × 5
#>   country     continent lifeExp      pop gdpPercap
#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
#> 1 Afghanistan Asia         28.8  8425333      779.
#> 2 Afghanistan Asia         30.3  9240934      821.
#> 3 Afghanistan Asia         32.0 10267083      853.
#> 4 Afghanistan Asia         34.0 11537966      836.
#> 5 Afghanistan Asia         36.1 13079460      740.
#> 6 Afghanistan Asia         38.4 14880372      786.
#> # … with 6 more rows

This makes it clear that something is missing: there’s no year column because that value is recorded in the path, not the individual files. We’ll tackle that problem next.

28.3.3 Data in the path

Sometimes the name of the file is itself data. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things.

Firstly, we name the vector of paths. The easiest way to do this is with the set_names() function, which can take a function. Here we use basename() to extract just the file name from the full path:

paths <- paths |> set_names(basename) 
paths
#>                  1952.xlsx                  1957.xlsx 
#> "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx" 
#>                  1962.xlsx                  1967.xlsx 
#> "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx" 
#>                  1972.xlsx                  1977.xlsx 
#> "data/gapminder/1972.xlsx" "data/gapminder/1977.xlsx" 
#>                  1982.xlsx                  1987.xlsx 
#> "data/gapminder/1982.xlsx" "data/gapminder/1987.xlsx" 
#>                  1992.xlsx                  1997.xlsx 
#> "data/gapminder/1992.xlsx" "data/gapminder/1997.xlsx" 
#>                  2002.xlsx                  2007.xlsx 
#> "data/gapminder/2002.xlsx" "data/gapminder/2007.xlsx"

Those paths are automatically carried along by all the map functions, so the list of data frames will have those same names:

paths |> 
  map(readxl::read_excel) |> 
  names()

Then we use the names_to argument list_rbind() to tell it to save the names to a new column called year, then use readr::parse_number() to turn it into a number.

paths |> 
  set_names(basename) |> 
  map(readxl::read_excel) |> 
  list_rbind(names_to = "year") |> 
  mutate(year = parse_number(year))
#> # A tibble: 1,704 × 6
#>    year country     continent lifeExp      pop gdpPercap
#>   <dbl> <chr>       <chr>       <dbl>    <dbl>     <dbl>
#> 1  1952 Afghanistan Asia         28.8  8425333      779.
#> 2  1952 Albania     Europe       55.2  1282697     1601.
#> 3  1952 Algeria     Africa       43.1  9279525     2449.
#> 4  1952 Angola      Africa       30.0  4232095     3521.
#> 5  1952 Argentina   Americas     62.5 17876956     5911.
#> 6  1952 Australia   Oceania      69.1  8691212    10040.
#> # … with 1,698 more rows

In more complicated other cases, there might be another variable stored in the directory name, or maybe the file name contains multiple bits of data. In that case, use set_names() (without any arguments) to record the full path, and then use tidyr::separate_by() and friends to turn them into useful columns.

paths |> 
  set_names() |> 
  map(readxl::read_excel) |> 
  list_rbind(names_to = "year") |> 
  separate(
    year, 
    into = c(NA, "directory", "file", "ext"), 
    sep = "[/.]"
  )
#> # A tibble: 1,704 × 8
#>   directory file  ext   country     continent lifeExp      pop gdpPercap
#>   <chr>     <chr> <chr> <chr>       <chr>       <dbl>    <dbl>     <dbl>
#> 1 gapminder 1952  xlsx  Afghanistan Asia         28.8  8425333      779.
#> 2 gapminder 1952  xlsx  Albania     Europe       55.2  1282697     1601.
#> 3 gapminder 1952  xlsx  Algeria     Africa       43.1  9279525     2449.
#> 4 gapminder 1952  xlsx  Angola      Africa       30.0  4232095     3521.
#> 5 gapminder 1952  xlsx  Argentina   Americas     62.5 17876956     5911.
#> 6 gapminder 1952  xlsx  Australia   Oceania      69.1  8691212    10040.
#> # … with 1,698 more rows

28.3.4 Save your work

Now that you’ve done all this hard work to get to a nice tidy data frame, it’s a great time to save your work:

gapminder <- paths |> 
  set_names(basename) |> 
  map(readxl::read_excel) |> 
  list_rbind(names_to = "year") |> 
  mutate(year = parse_number(year))

write_csv(gapminder, "gapminder.csv")

If you’re working in a project, I’d suggest calling the file that does this sort of data prep work something like 0-cleanup.R. The 0 in the file name suggests that this should be run before anything else.

If your input data files change of over time, you might consider learning a tool like targets to set up your data cleaning code to automatically re-run when ever one of the input files is modified.

28.3.5 Many simple iterations

Here we’ve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, you’ll need to do some additional tidying, and you have basic basic options: you can do one round of iteration with a complex function, or do a multiple rounds of iteration with multiple simple functions. In our experience most folks reach first for one complex iteration, but you’re often better by doing multiple simple iterations.

For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is write a function that takes a file and does all those steps and call map() once:

process_file <- function(path) {
  df <- read_csv(path)
  
  df |> 
    filter(!is.na(id)) |> 
    mutate(id = tolower(id)) |> 
    pivot_longer(jan:dec, names_to = "month")
}

paths |> 
  map(process_file) |> 
  list_rbind()

Another approach is to read all the files and combine them together first. Then you only need to

paths |> 
  map(read_csv) |> 
  list_rbind() |> 
  filter(!is.na(id)) |> 
  mutate(id = tolower(id)) |> 
  pivot_longer(jan:dec, names_to = "month")

We recommend the second approach because it stops you getting fixated on getting the first file right because moving on to the rest. By considering all of the data when doing tidying and cleaning, you’re more likely to think holistically and end up with a higher quality result.

28.3.6 Heterogeneous data

Unfortunately it’s sometime not possible to go from map() straight to list_rbind() because the data frames are so heterogeneous that list_rbind() either fails or yields a data frame that’s not very useful. In that case, it’s still useful to start by loading all of the files:

files <- paths |> 
  map(readxl::read_excel) 

Then a very useful strategy is to convert the structure of the data frames to data so that you can explore using your data science skills. One way to do so is with this handy df_types function that returns a tibble with one row for each column:

df_types <- function(df) {
  tibble(
    col_name = names(df), 
    col_type = map_chr(df, vctrs::vec_ptype_full),
    n_miss = map_int(df, \(x) sum(is.na(x)))
  )
}

df_types(starwars)
#> # A tibble: 14 × 3
#>   col_name   col_type  n_miss
#>   <chr>      <chr>      <int>
#> 1 name       character      0
#> 2 height     integer        6
#> 3 mass       double        28
#> 4 hair_color character      5
#> 5 skin_color character      0
#> 6 eye_color  character      0
#> # … with 8 more rows
df_types(nycflights13::flights)
#> # A tibble: 19 × 3
#>   col_name       col_type n_miss
#>   <chr>          <chr>     <int>
#> 1 year           integer       0
#> 2 month          integer       0
#> 3 day            integer       0
#> 4 dep_time       integer    8255
#> 5 sched_dep_time integer       0
#> 6 dep_delay      double     8255
#> # … with 13 more rows

You can then apply this function all of the files, and maybe do some pivoting to make it easy to see where there are differences.

files |> 
  map(df_types) |> 
  list_rbind(names_to = "file_name") |> 
  select(-n_miss) |> 
  pivot_wider(names_from = col_name, values_from = col_type)
#> # A tibble: 12 × 6
#>   file_name country   continent lifeExp pop    gdpPercap
#>       <int> <chr>     <chr>     <chr>   <chr>  <chr>    
#> 1         1 character character double  double double   
#> 2         2 character character double  double double   
#> 3         3 character character double  double double   
#> 4         4 character character double  double double   
#> 5         5 character character double  double double   
#> 6         6 character character double  double double   
#> # … with 6 more rows

If the files have heterogeneous formats you might need to do more processing before you can successfully merge them. Unfortunately we’re now going to leave you to figure that out on your own, but you might want to read about map_if() and map_at(). map_if() allows you to selectively modify elements of a list based on their values; map_at() allows you to selectively modify elements based on their names.

28.3.7 Handling failures

Sometimes the structure of your data might be sufficiently wild that you can’t even read all the files with a single command. And then you’ll encounter one of the downsides of map: is that it succeeds or fails as a whole. map() will either successfully read all of the files in a directory or fail with an error. This is annoying: why does one failure prevent you from accessing all the other successes?

Luckily, purrr comes with a helper to tackle this problem: possibly(). When you wrap a function in possible, a failure with instead return a NULL. list_rbind() automatically ignores NULLs, so the following code will always succeed:

files <- paths |> 
  map(possibly(\(path) readxl::read_excel(path), NULL))

data <- files |> list_rbind()

Now comes the hard part of figuring out why they failed and what do to about it. Start by getting the paths that failed:

failed <- map_vec(files, is.null)
paths[failed]
#> named character(0)

Then call the import function again for each failure and figure out what went wrong.

28.4 Saving multiple outputs

In the last section, you learned about map(), which is useful for reading multiple files into a single object. In this section, we’ll now explore the opposite: how can you take one or more R objects and save them to one or more files? We’ll explore this challenge using three examples:

  • Saving multiple data frames into one database.
  • Saving multiple data frames into multiple csv files.
  • Saving multiple plots to multiple .png files.

28.4.1 Writing to a database

Sometimes when working with many files at once, it’s not possible to fit all your data into memory at once. If you can’t map(files, read_csv) how can you work with your data? One approach is to put it all into a database and then use dbplyr to access just the subsets that you need.

If you’re lucky, the database package will provide a handy function that will take a vector of paths and load them all into the database. This is the case with duckdb’s duckdb_read_csv():

con <- DBI::dbConnect(duckdb::duckdb())
duckdb::duckdb_read_csv(con, "gapminder", paths)

But here we don’t have csv files, we have excel spreadsheets. So we’re going to have to do it “by hand”. And you can use this same pattern for databases that don’t have a handy function for loading many csv files.

We need to start by creating a table that will fill in with data. The easiest way to do this is by creating template for the existing data. So we begin by loading a single row from one file and adding the year to it:

template <- readxl::read_excel(paths[[1]], n_max = 1)
template$year <- 1952
template
#> # A tibble: 1 × 6
#>   country     continent lifeExp     pop gdpPercap  year
#>   <chr>       <chr>       <dbl>   <dbl>     <dbl> <dbl>
#> 1 Afghanistan Asia         28.8 8425333      779.  1952

Now we can connect to the database, and DBI::dbCreateTable() to turn our template into database table:

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)

dbCreateTable() doesn’t use the data in template, just variable names and types. So if we inspect the gapminder table now you’ll see that it’s empty but it has the variables we need:

con |> tbl("gapminder")
#> # Source:   table<gapminder> [0 x 6]
#> # Database: DuckDB 0.5.1 [unknown@Linux 5.15.0-1020-azure:R 4.2.1/:memory:]
#> # … with 6 variables: country <chr>, continent <chr>, lifeExp <dbl>, pop <dbl>,
#> #   gdpPercap <dbl>, year <dbl>

Next, we need a function that takes a single file path and reads it into R, and adds it to the gapminder table, the job of DBI::dbAppendTable():

append_file <- function(path) {
  df <- readxl::read_excel(path)
  df$year <- parse_number(basename(path))
  
  DBI::dbAppendTable(con, "gapminder", df)
}

Now we need to call append_csv() once for path. That’s certainly possible with map():

paths |> map(append_file)

But we don’t actually care about the output, so instead of map() it’s slightly nicer to use walk(). walk() does exactly the same thing as map() but throws the output away:

paths |> walk(append_file)

Now if we can see we have all the data in our table:

con |> 
  tbl("gapminder") |> 
  count(year)
#> # Source:   SQL [?? x 2]
#> # Database: DuckDB 0.5.1 [unknown@Linux 5.15.0-1020-azure:R 4.2.1/:memory:]
#>    year     n
#>   <dbl> <dbl>
#> 1  1952   142
#> 2  1957   142
#> 3  1962   142
#> 4  1967   142
#> 5  1972   142
#> 6  1977   142
#> # … with more rows

28.4.2 Writing csv files

The same basic principle applies if we want to write multiple csv files, one for each group. Let’s imagine that we want to take the ggplot2::diamonds data and save our one csv file for each clarity. First we need to make those individual datasets. One way to do that is with dplyr’s group_split():

by_clarity <- diamonds |> 
  group_by(clarity) |> 
  group_split()

This produces a list of length 8, containing one tibble for each unique value of clarity:

length(by_clarity)
#> [1] 8

by_clarity[[1]]
#> # A tibble: 741 × 10
#>   carat cut       color clarity depth table price     x     y     z
#>   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  0.32 Premium   E     I1       60.9    58   345  4.38  4.42  2.68
#> 2  1.17 Very Good J     I1       60.2    61  2774  6.83  6.9   4.13
#> 3  1.01 Premium   F     I1       61.8    60  2781  6.39  6.36  3.94
#> 4  1.01 Fair      E     I1       64.5    58  2788  6.29  6.21  4.03
#> 5  0.96 Ideal     F     I1       60.7    55  2801  6.37  6.41  3.88
#> 6  1.04 Premium   G     I1       62.2    58  2801  6.46  6.41  4   
#> # … with 735 more rows

If we were going to save these data frames by hand, we might write something like:

write_csv(by_clarity[[1]], "diamonds-I1.csv")
write_csv(by_clarity[[2]], "diamonds-SI2.csv")
write_csv(by_clarity[[3]], "diamonds-SI1.csv")
...
write_csv(by_clarity[[8]], "diamonds-IF.csv")

This is a little different to our previous uses of map() because there are two arguments changing, not just one. That means that we’ll need to use map2() instead of map().

But before we can use map2() we need to figure out the names for those files. The most general way to do so is to use dplyr::group_key() to get the unique values of the grouping variables, then use mutate() and str_glue() to make a path:

keys <- diamonds |> 
  group_by(clarity) |> 
  group_keys()
keys
#> # A tibble: 8 × 1
#>   clarity
#>   <ord>  
#> 1 I1     
#> 2 SI2    
#> 3 SI1    
#> 4 VS2    
#> 5 VS1    
#> 6 VVS2   
#> # … with 2 more rows

paths <- keys |> 
  mutate(path = str_glue("diamonds-{clarity}.csv")) |> 
  pull()
paths
#> diamonds-I1.csv
#> diamonds-SI2.csv
#> diamonds-SI1.csv
#> diamonds-VS2.csv
#> diamonds-VS1.csv
#> diamonds-VVS2.csv
#> diamonds-VVS1.csv
#> diamonds-IF.csv

This feels a bit fiddly here because we’re only working with a single group, but you can imagine this is very powerful when you’re grouping by multiple variables.

Now that we have all the pieces in place, we can eliminate the need to copy and paste by running walk2():

walk2(by_clarity, paths, write_csv)

This is shorthand for:

write_csv(by_clarity[[1]], paths[[1]])
write_csv(by_clarity[[2]], paths[[2]])
write_csv(by_clarity[[3]], paths[[3]])
...
write_csv(by_clarity[[8]], paths[[8]])

28.4.3 Saving plots

We can take the same basic approach to create many plots. We’re jumping the gun here a bit because you won’t learn how to save a single plot until Section 30.7, but hopefully you’ll get the basic idea.

Let’s assume you’ve already split up the data using group_split(). Now you can use map() to create a list of many plots5:

plots <- by_clarity |>
  map(\(df) ggplot(df, aes(carat)) + geom_histogram(binwidth = 0.01))

(If this was a more complicated plot you’d use a named function so there’s more room for all the details.)

Then you create the file names:

paths <- keys |> 
  mutate(path = str_glue("clarity-{clarity}.png")) |> 
  pull()
paths
#> clarity-I1.png
#> clarity-SI2.png
#> clarity-SI1.png
#> clarity-VS2.png
#> clarity-VS1.png
#> clarity-VVS2.png
#> clarity-VVS1.png
#> clarity-IF.png

Then use walk2() with ggsave() to save each plot:

walk2(paths, plots, \(path, plot) ggsave(path, plot, width = 6, height = 6))

This is short hand for:

ggsave(paths[[1]], plots[[1]], width = 6, height = 6)
ggsave(paths[[2]], plots[[2]], width = 6, height = 6)
ggsave(paths[[3]], plots[[3]], width = 6, height = 6)
...
ggsave(paths[[8]], plots[[8]], width = 6, height = 6)

28.4.4 Exercises

  1. Imagine you have a table of student data containing (amongst other variables) school_name and student_id. Sketch out what code you’d write if you want to save all the information for each student in file called {student_id}.csv in the {school} directory.

28.5 For loops

Before we finish up this chapter, we have a duty to mention another important technique for iteration in R, the for loop. for loops are powerful and general tool that you definitely need to learn as you become a more experienced R programmer. But we skip them here because, as you’ve seen, you can solve a whole bunch of useful problems just by learning across(), map(), and walk2(). If you’d like to learn more about for loops, https://adv-r.hadley.nz/control-flow.html#loops is one place to start.

Some people will tell you to avoid for loops because they are slow. They’re wrong! (Well at least they’re rather out of date, as for loops haven’t been slow for many years.) The chief benefit of using functions like map() is not speed, but clarity: once you’ve mastered the basic idea, they make your code easier to write and to read.

28.6 Summary

In this chapter you learn iteration tools to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problems to fixing any number of problems. Once you’ve mastered the techniques in this chapter, we highly recommend learning more by reading https://purrr.tidyverse.org and the Functionals chapter of Advanced R.

This chapter concludes the programming section of the book. You’ve now learned the basics of programming in R. You know now the data types that underpin all of the objects you work with, and have two powerful techniques (functions and iteration) for reducing the duplication in your code. We hope you’ve got a taste for how programming can help your analyses, and you’ve made a solid start on your journey to become not just a data scientist who uses R, but a data science who can program in R.


  1. These are often called anonymous functions because you don’t give them a name with <-.↩︎

  2. You can’t currently change the order of the columns, but you could reorder them after the fact using relocate() or similar.↩︎

  3. Maybe there will be one day, but currently we don’t see how.↩︎

  4. If you instead had a directory of csv files with the same format, you can use the technique from Section 8.4.↩︎

  5. You can print plots to get a crude animation — you’ll get one plot for each element of plots.↩︎