12  Tibbles

You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at https://r4ds.had.co.nz.

12.1 Introduction

Throughout this book we work with “tibbles” instead of R’s traditional data.frame. Tibbles are data frames, but they tweak some older behaviors to make your life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It’s difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the tibble package, which provides opinionated data frames that make working in the tidyverse a little easier. In most places, I’ll use the term tibble and data frame interchangeably; when I want to draw particular attention to R’s built-in data frame, I’ll call them data.frames.

If this chapter leaves you wanting to learn more about tibbles, you might enjoy vignette("tibble").

12.1.1 Prerequisites

In this chapter we’ll explore the tibble package, part of the core tidyverse.

12.2 Creating tibbles

If you need to make a tibble “by hand”, you can use tibble() or tribble(). tibble() works by assembling individual vectors:

x <- c(1, 2, 5)
y <- c("a", "b", "h")

tibble(x, y)
#> # A tibble: 3 × 2
#>       x y    
#>   <dbl> <chr>
#> 1     1 a    
#> 2     2 b    
#> 3     5 h

You can also optionally name the inputs, provide data inline with c(), and perform computation:

tibble(
  x1 = x,
  x2 = c(10, 15, 25),
  y = sqrt(x1^2 + x2^2)
)
#> # A tibble: 3 × 3
#>      x1    x2     y
#>   <dbl> <dbl> <dbl>
#> 1     1    10  10.0
#> 2     2    15  15.1
#> 3     5    25  25.5

Every column in a data frame or tibble must be same length, so you’ll get an error if the lengths are different:

tibble(
  x = c(1, 5),
  y = c("a", "b", "c")
)
#> Error:
#> ! Tibble columns must have compatible sizes.
#> • Size 2: Existing data.
#> • Size 3: Column `y`.
#> ℹ Only values of size one are recycled.

As the error suggests, individual values will be recycled to the same length as everything else:

tibble(
  x = 1:5,
  y = "a",
  z = TRUE
)
#> # A tibble: 5 × 3
#>       x y     z    
#>   <int> <chr> <lgl>
#> 1     1 a     TRUE 
#> 2     2 a     TRUE 
#> 3     3 a     TRUE 
#> 4     4 a     TRUE 
#> 5     5 a     TRUE

Another way to create a tibble is with tribble(), which short for transposed tibble. tribble() is customized for data entry in code: column headings start with ~ and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:

tribble(
  ~x, ~y, ~z,
  "a", 2, 3.6,
  "b", 1, 8.5
)
#> # A tibble: 2 × 3
#>   x         y     z
#>   <chr> <dbl> <dbl>
#> 1 a         2   3.6
#> 2 b         1   8.5

Finally, if you have a regular data.frame you can turn it into to a tibble with as_tibble():

as_tibble(mtcars)
#> # A tibble: 32 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
#> # … with 26 more rows

The inverse of as_tibble() is as.data.frame(); it converts a tibble back into a regular data.frame.

12.3 Non-syntactic names

It’s possible for a tibble to have column names that are not valid R variable names, names that are non-syntactic. For example, the variables might not start with a letter or they might contain unusual characters like a space. To refer to these variables, you need to surround them with backticks, `:

tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb
#> # A tibble: 1 × 3
#>   `:)`  ` `   `2000`
#>   <chr> <chr> <chr> 
#> 1 smile space number

You’ll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.

12.4 Tibbles vs. data.frame

There are two main differences in the usage of a tibble vs. a classic data.frame: printing and subsetting. If these difference cause problems when working with older packages, you can turn a tibble back to a regular data frame with as.data.frame().

12.4.1 Printing

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature inspired by str():

tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)
#> # A tibble: 1,000 × 5
#>   a                   b              c     d e    
#>   <dttm>              <date>     <int> <dbl> <chr>
#> 1 2022-05-22 01:07:25 2022-05-28     1 0.368 n    
#> 2 2022-05-22 19:12:35 2022-06-02     2 0.612 l    
#> 3 2022-05-22 13:36:14 2022-06-12     3 0.415 p    
#> 4 2022-05-22 02:57:31 2022-06-11     4 0.212 m    
#> 5 2022-05-21 23:21:48 2022-06-08     5 0.733 i    
#> 6 2022-05-22 10:22:45 2022-06-04     6 0.460 n    
#> # … with 994 more rows

Where possible, tibbles also use color to draw your eye to important differences. One of the most important distinctions is between the string "NA" and the missing value, NA:

tibble(x = c("NA", NA))
#> # A tibble: 2 × 1
#>   x    
#>   <chr>
#> 1 NA   
#> 2 <NA>

Tibbles are designed to avoid overwhelming your console when you print large data frames. But sometimes you need more output than the default display. There are a few options that can help.

First, you can explicitly print() the data frame and control the number of rows (n) and the width of the display. width = Inf will display all columns:

library(nycflights13)

flights |> 
  print(n = 10, width = Inf)
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#>    arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
#>        <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
#>  1        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
#>  2        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
#>  3        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
#>  4       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
#>  5       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
#>  6        12 UA        1696 N39463  EWR    ORD        150      719     5     58
#>  7        19 B6         507 N516JB  EWR    FLL        158     1065     6      0
#>  8       -14 EV        5708 N829AS  LGA    IAD         53      229     6      0
#>  9        -8 B6          79 N593JB  JFK    MCO        140      944     6      0
#> 10         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0
#>    time_hour          
#>    <dttm>             
#>  1 2013-01-01 05:00:00
#>  2 2013-01-01 05:00:00
#>  3 2013-01-01 05:00:00
#>  4 2013-01-01 05:00:00
#>  5 2013-01-01 06:00:00
#>  6 2013-01-01 05:00:00
#>  7 2013-01-01 06:00:00
#>  8 2013-01-01 06:00:00
#>  9 2013-01-01 06:00:00
#> 10 2013-01-01 06:00:00
#> # … with 336,766 more rows

You can also control the default print behavior by setting options:

  • options(tibble.print_max = n, tibble.print_min = m): if more than n rows, print only m rows. Use options(tibble.print_min = Inf) to always show all rows.

  • Use options(tibble.width = Inf) to always print all columns, regardless of the width of the screen.

You can see a complete list of options by looking at the package help with package?tibble.

A final option is to use RStudio’s built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations.

flights |> View()

12.4.2 Extracting variables

So far all the tools you’ve learned have worked with complete data frames. If you want to pull out a single variable, you can use dplyr::pull():

tb <- tibble(
  id = LETTERS[1:5],
  x1  = 1:5,
  y1  = 6:10
)

tb |> pull(x1) # by name
#> [1] 1 2 3 4 5
tb |> pull(1)  # by position
#> [1] "A" "B" "C" "D" "E"

pull() also takes an optional name argument that specifies the column to be used as names for a named vector, which you’ll learn about in Chapter Chapter 30.

tb |> pull(x1, name = id)
#> A B C D E 
#> 1 2 3 4 5

You can also use the base R tools $ and [[. [[ can extract by name or position; $ only extracts by name but is a little less typing.

# Extract by name
tb$x1
#> [1] 1 2 3 4 5
tb[["x1"]]
#> [1] 1 2 3 4 5

# Extract by position
tb[[1]]
#> [1] "A" "B" "C" "D" "E"

Compared to a data.frame, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.

# Tibbles complain a lot:
tb$x
#> Warning: Unknown or uninitialised column: `x`.
#> NULL
tb$z
#> Warning: Unknown or uninitialised column: `z`.
#> NULL

# Data frame use partial matching and don't complain if a column doesn't exist
df <- as.data.frame(tb)
df$x
#> [1] 1 2 3 4 5
df$z
#> NULL

For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.

12.4.3 Subsetting

Lastly, there are some important differences when using [. With data.frames, [ sometimes returns a data.frame, and sometimes returns a vector, which is a common source of bugs. With tibbles, [ always returns another tibble. This can sometimes cause problems when working with older code. If you hit one of those functions, just use as.data.frame() to turn your tibble back to a data.frame.

12.4.4 Exercises

  1. How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data.frame).

  2. Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data.frame behaviors cause you frustration?

    df <- data.frame(abc = 1, xyz = "a")
    df$x
    df[, "xyz"]
    df[, c("abc", "xyz")]
  3. If you have the name of a variable stored in an object, e.g. var <- "mpg", how can you extract the reference variable from a tibble?

  4. Practice referring to non-syntactic names in the following data frame by:

    1. Extracting the variable called 1.
    2. Plotting a scatterplot of 1 vs 2.
    3. Creating a new column called 3 which is 2 divided by 1.
    4. Renaming the columns to one, two and three.
    annoying <- tibble(
      `1` = 1:10,
      `2` = `1` * 2 + rnorm(length(`1`))
    )
  5. What does tibble::enframe() do? When might you use it?

  6. What option controls how many additional column names are printed at the footer of a tibble?