20  Missing values

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

20.1 Introduction

You’ve already learned the basics of missing values earlier in the the book. You first saw them in Section 4.4.2 where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section 14.2.2. Now we’ll come back to them in more depth, so you can learn more of the details.

We’ll start by discussing some general tools for working with missing values recorded as NAs. We’ll then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit. We’ll finish off with a related discussion of empty groups, caused by factor levels that don’t appear in the data.

20.1.1 Prerequisites

The functions for working will missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.

20.2 Explicit missing values

To begin, let’s explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an NA.

20.2.1 Last observation carried forward

A common use for missing values is as a data entry convenience. Sometimes data that has been entered by hand, missing values indicate that the value in the previous row has been repeated:

treatment <- tribble(
  ~person,           ~treatment, ~response,
  "Derrick Whitmore", 1,         7,
  NA,                 2,         10,
  NA,                 3,         NA,
  "Katherine Burke",  1,         4
)

You can fill in these missing values with tidyr::fill(). It works like select(), taking a set of columns:

treatment |>
  fill(everything())
#> # A tibble: 4 × 3
#>   person           treatment response
#>   <chr>                <dbl>    <dbl>
#> 1 Derrick Whitmore         1        7
#> 2 Derrick Whitmore         2       10
#> 3 Derrick Whitmore         3       10
#> 4 Katherine Burke          1        4

This treatment is sometimes called “last observation carried forward”, or locf for short. You can use the direction argument to fill in missing values that have been generated in more exotic ways.

20.2.2 Fixed values

Some times missing values represent some fixed and known value, mostly commonly 0. You can use dplyr::coalesce() to replace them:

x <- c(1, 4, 5, 7, NA)
coalesce(x, 0)
#> [1] 1 4 5 7 0

You could use mutate() together with across() to apply this treatment to (say) every numeric column in a data frame:

df |> 
  mutate(across(where(is.numeric), coalesce, 0))

20.2.3 Sentinel values

Sometimes you’ll hit the opposite problem where some conrete value actually represents as a missing value. This typically arises in data generated by older software that doesn’t have a proper way to represent missing values, so it must instead use some special value like 99 or -999.

If possible, handle this when reading in the data, for example, by using the na argument to readr::read_csv(). If you discover the problem later, or your data source doesn’t provide a way to handle on it read, you can use dplyr::na_if():

x <- c(1, 4, 5, 7, -99)
na_if(x, -99)
#> [1]  1  4  5  7 NA

You could apply this transformation to every numeric column in a data frame with the following code.

df |> 
  mutate(across(where(is.numeric), na_if, -99))

20.2.4 NaN

Before we continue, there’s one special type of missing value that you’ll encounter from time-to-time: a NaN (pronounced “nan”), or not a number. It’s not that important to know about because it generally behaves just like NA:

x <- c(NA, NaN)
x * 10
#> [1]  NA NaN
x == 1
#> [1] NA NA
is.na(x)
#> [1] TRUE TRUE

In the rare case you need to distinguish an NA from a NaN, you can use is.nan(x).

You’ll generally encounter a NaN when you perform a mathematical operation that has an indeterminate result:

0 / 0 
#> [1] NaN
0 * Inf
#> [1] NaN
Inf - Inf
#> [1] NaN
sqrt(-1)
#> Warning in sqrt(-1): NaNs produced
#> [1] NaN

20.3 Implicit missing values

So far we’ve talked about missing values that are explicitly missing, i.e. you can see an NA in your data. But missing values can also be implicitly missing, if an entire row of data is simply absent from the data. Let’s illustrate the difference with a simple data set that records the price of some stock each quarter:

stocks <- tibble(
  year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
  qtr   = c(   1,    2,    3,    4,    2,    3,    4),
  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

This dataset has two missing observations:

  • The price in the fourth quarter of 2020 is explicitly missing, because its value is NA.

  • The price for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.

One way to think about the difference is with this Zen-like koan:

An explicit missing value is the presence of an absence.

An implicit missing value is the absence of a presence.

Sometimes you want to make implicit missings explicit in order to have something physical to work with. In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them. The following sections discuss some tools for moving between implicit and explicit missingness.

20.3.1 Pivoting

You’ve already seen one tool that can make implicit missings explicit and vice versa: pivoting. Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value. For example, if we pivot stocks to put the quarter in the columns, both missing become values explicit:

stocks |>
  pivot_wider(
    names_from = qtr, 
    values_from = price
  )
#> # A tibble: 2 × 5
#>    year   `1`   `2`   `3`   `4`
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  2020  1.88  0.59  0.35 NA   
#> 2  2021 NA     0.92  0.17  2.66

By default, making data longer preserves explicit missing values, but if they are structural missing values that only exist because the data is not tidy, you can drop them (make them implicit) by setting drop_na = TRUE. See the examples in Section 6.2 for more details.

20.3.2 Complete

tidyr::complete() allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of year and qtr should exist in the stocks data:

stocks |>
  complete(year, qtr)
#> # A tibble: 8 × 3
#>    year   qtr price
#>   <dbl> <dbl> <dbl>
#> 1  2020     1  1.88
#> 2  2020     2  0.59
#> 3  2020     3  0.35
#> 4  2020     4 NA   
#> 5  2021     1 NA   
#> 6  2021     2  0.92
#> # … with 2 more rows

Typically, you’ll call complete() with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the stocks dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for year:

stocks |>
  complete(year = 2019:2021, qtr)
#> # A tibble: 12 × 3
#>    year   qtr price
#>   <dbl> <dbl> <dbl>
#> 1  2019     1 NA   
#> 2  2019     2 NA   
#> 3  2019     3 NA   
#> 4  2019     4 NA   
#> 5  2020     1  1.88
#> 6  2020     2  0.59
#> # … with 6 more rows

If the range of a variable is correct, but not all values are present, you could use full_seq(x, 1) to generate all values from min(x) to max(x) spaced out by 1.

In some cases, the complete set of observations can’t be generated by a simple combination of variables. In that case, you can do manually what complete() does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with dplyr::full_join().

20.3.3 Joins

This brings us to another important way of revealing implicitly missing observations: joins. Often you can only know that values are missing when from one dataset when you go to join it to another. dplyr::anti_join() is particularly useful at revealing these values. The following example shows how two anti_join()s reveals that we’re missing information for four airports and 722 planes.

library(nycflights13)

flights |> 
  distinct(faa = dest) |> 
  anti_join(airports)
#> Joining, by = "faa"
#> # A tibble: 4 × 1
#>   faa  
#>   <chr>
#> 1 BQN  
#> 2 SJU  
#> 3 STT  
#> 4 PSE

flights |> 
  distinct(tailnum) |> 
  anti_join(planes)
#> Joining, by = "tailnum"
#> # A tibble: 722 × 1
#>   tailnum
#>   <chr>  
#> 1 N3ALAA 
#> 2 N3DUAA 
#> 3 N542MQ 
#> 4 N730MQ 
#> 5 N9EAMQ 
#> 6 N532UA 
#> # … with 716 more rows

The default behavior of joins is to succeed if observations in x don’t have a match in y. If you’re worried about this, and you have dplyr 1.1.0 or newer, you can use the new unmatched = "error" argument to tell joins to error if any rows in x don’t have a match in y.

20.3.4 Exercises

  1. Can you find any relationship between the carrier and the rows that appear to be missing from planes?

20.4 Factors and empty groups

A final type of missingness is the empty group, a group that doesn’t contain any observations, which can arise when working with factors. For example, imagine we have a dataset that contains some health information about people:

health <- tibble(
  name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
  smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
  age = c(34L, 88L, 75L, 47L, 56L),
)

And we want to count the number of smokers with dplyr::count():

health |> count(smoker)
#> # A tibble: 1 × 2
#>   smoker     n
#>   <fct>  <int>
#> 1 no         5

This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request count() to keep all the groups, even those not seen in the data by using .drop = FALSE:

health |> count(smoker, .drop = FALSE)
#> # A tibble: 2 × 2
#>   smoker     n
#>   <fct>  <int>
#> 1 yes        0
#> 2 no         5

The same principle applies to ggplot2’s discrete axes, which will also drop levels that don’t have any values. You can force them to display with by supplying drop = FALSE to the appropriate discrete axis:

ggplot(health, aes(smoker)) +
  geom_bar() +
  scale_x_discrete()

ggplot(health, aes(smoker)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)

A bar chart with a single value on the x-axis, "no".

The same bar chart as the last plot, but now with two values on the x-axis, "yes" and "no". There is no bar for the "yes" category.

The same problem comes up more generally with dplyr::group_by(). And again you can use .drop = FALSE to preserve all factor levels:

health |> 
  group_by(smoker, .drop = FALSE) |> 
  summarise(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age)
  )
#> Warning in min(age): no non-missing arguments to min; returning Inf
#> Warning in max(age): no non-missing arguments to max; returning -Inf
#> # A tibble: 2 × 6
#>   smoker     n mean_age min_age max_age sd_age
#>   <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
#> 1 yes        0      NaN     Inf    -Inf   NA  
#> 2 no         5       60      34      88   21.6

We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors. There’s an important distinction between empty vectors, which have length 0, and missing values, which each have length 1.

# A vector containing two missing values
x1 <- c(NA, NA)
length(x1)
#> [1] 2

# A vector containing nothing
x2 <- numeric()
length(x2)
#> [1] 0

All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see mean(age) returning NaN because mean(age) = sum(age)/length(age) which here is 0/0. max() and min() return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you’ll get the minimum or maximum of the new data1.

A sometimes simpler approach is to perform the summary and then make the implicit missings explicit with complete().

health |> 
  group_by(smoker) |> 
  summarise(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age)
  ) |> 
  complete(smoker)
#> # A tibble: 2 × 6
#>   smoker     n mean_age min_age max_age sd_age
#>   <fct>  <int>    <dbl>   <int>   <int>  <dbl>
#> 1 yes       NA       NA      NA      NA   NA  
#> 2 no         5       60      34      88   21.6

The main drawback of this approach is that you get an NA for the count, even though you know that it should be zero.


  1. In other words, min(c(x, y)) is always equal to min(min(x), min(y)).↩︎