27  Vectors

27.1 Introduction

So far we’ve talked about individual data types individual like numbers, strings, factors, tibbles and more. Now it’s time to learn more about how they fit together into a holistic structure. Relatively little immediate benefit but a necessary foundation for building your programming knowledge.

In this chapter we’ll explore the vector data type, the type that underlies pretty much all objects that we use to store data in R.

27.1.1 Prerequisites

The focus of this chapter is on base R data structures, so it isn’t essential to load any packages. We will, however, use a handful of functions from the purrr package to avoid some inconsistencies in base R.

27.2 Vectors

There are two fundamental types of vectors:

  1. Atomic vectors, of which there are six types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors. Raw and complex are rarely used during data analysis, so we won’t discuss them here.

  2. Lists, which are sometimes called recursive vectors because lists can contain other lists.

The chief difference between atomic vectors and lists is that atomic vectors are homogeneous (every element is the same type), while lists can be heterogeneous (every element can be a different type). Figure 27.1 summarizes the interrelationships.

A diagram that uses nested sets to show how R's vector types are related. There are two types at the top level: vectors and NULL. Inside vectors there are two types: atomic and list. Inside atomic there are three types: logical, numeric, and character. Inside numeric there are two types: integer, and double.

Figure 27.1: The hierarchy of R’s vector types.

27.2.1 Properties

Every vector has two key properties:

  1. Its type, which is one of logical, integer, double, character, list etc. You can determine this with typeof().

    typeof(letters)
    #> [1] "character"
    typeof(1:10)
    #> [1] "integer"
    typeof(2.5)
    #> [1] "double"

    Sometimes you want to do different things based on the type of vector. One option is to use typeof(). Another is to use a test function which returns a TRUE or FALSE. Base R provides many functions like is.vector() and is.atomic(), but they often return surprising results. Instead, it’s safer to use the is_* functions provided by purrr, which correspond exactly to Figure 27.1.

  2. Its length, which you can determine with length().

    x <- list("a", "b", 1:10)
    length(x)
    #> [1] 3

Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create S3 vectors which build on additional behavior. You’ve seen three S3 vectors in this book: factors, dates, and date-times. We’ll come back those in Section 27.4.

27.2.2 Atomic vectors

While technically speaking there are six types of atomic vector, in principle we only worry about three: logical vectors, numeric vectors, and character vectors.

  • Logical vectors were the subject of Chapter 13. They’re the simplest type of atomic vector because they can take only three possible values: FALSE, TRUE, and NA.
  • Numeric vectors were the subject of Chapter 14. Numeric vectors can either be integers or doubles. We lump them together in this book because there are few important differences when doing data analysis. The one important difference was discussed in Section 13.2.1: doubles are fundamentally approximations because they floating point numbers that can not always be precisely represented with a fixed amount of memory.
  • Character vectors were the subject of Chapter 15. They’re the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain any amount of data.

27.2.3 Lists

Lists are a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures, as you saw in Chapter 24. You create a list with list().

Unlike atomic vectors, list() can contain a mix of objects:

y <- list("a", 1L, 1.5, TRUE)
str(y)
#> List of 4
#>  $ : chr "a"
#>  $ : int 1
#>  $ : num 1.5
#>  $ : logi TRUE

Lists can even contain other lists!

z <- list(list(1, 2), list(3, 4))
str(z)
#> List of 2
#>  $ :List of 2
#>   ..$ : num 1
#>   ..$ : num 2
#>  $ :List of 2
#>   ..$ : num 3
#>   ..$ : num 4

27.2.4 Missing values and NULL

Note that each type of atomic vector has its own missing value:

NA            # logical
#> [1] NA
NA_integer_   # integer
#> [1] NA
NA_real_      # double
#> [1] NA
NA_character_ # character
#> [1] NA

This is usually unimportant because NA will almost always be automatically converted to the correct type.

There’s one other related object: NULL. NULL is often used to represent the absence of a vector (as opposed to NA which is used to represent the absence of a value in a vector). NULL typically behaves like a vector of length 0. NULL is sort of the equivalent of a missing value inside a list.

27.2.5 Names

All types of vectors can be named. You can name them during creation with c() or list():

x <- c(x = 1, y = 2, z = 4)
x
#> x y z 
#> 1 2 4

It’s important to notice this display, because it can be surprising at first. str() is always a great tool to check the object is structured as you expect.

str(x)
#>  Named num [1:3] 1 2 4
#>  - attr(*, "names")= chr [1:3] "x" "y" "z"

Or after the fact with purrr::set_names():

x <- list(1, 2, 3)
x |> 
  set_names(c("a", "b", "c")) |> 
  str()
#> List of 3
#>  $ a: num 1
#>  $ b: num 2
#>  $ c: num 3

You can also pass set_names() a function. This is particularly useful if you have a character vector. And we’ll see an important use for it in Section 28.3.3.

x <- c("a", "b", "c")
x |> set_names(str_to_upper)
#>   A   B   C 
#> "a" "b" "c"

Named vectors are most useful for subsetting, described next.

27.2.6 Coercion

There are two ways to convert, or coerce, one type of vector to another:

  1. Explicit coercion happens when you call a function like as.logical(), as.integer(), as.double(), or as.character(). Whenever you find yourself using explicit coercion, you should always check whether you can make the fix upstream, so that the vector never had the wrong type in the first place. For example, you may need to tweak your readr col_types specification.

  2. Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector. For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.

Because explicit coercion is used relatively rarely, and is largely easy to understand, we’ll focus on implicit coercion here. Just beware using them on lists; if you need to get a list into a simple vector, put it inside a data frame and use the tools from Chapter 24.

as.character(list(1, 2, 3))
#> [1] "1" "2" "3"
as.character(list(1, list(2, list(3))))
#> [1] "1"                "list(2, list(3))"

You’ve already seen the most important type of implicit coercion: using a logical vector in a numeric context. In this case TRUE is converted to 1 and FALSE converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:

x <- sample(20, 100, replace = TRUE)
y <- x > 10
sum(y)  # how many are greater than 10?
#> [1] 38
mean(y) # what proportion are greater than 10?
#> [1] 0.38

It’s also important to understand what happens when you try and create a vector containing multiple types with c(): logical < integer < double < character < list. Generally rather too flexible.

typeof(c(TRUE, 1L))
#> [1] "integer"
typeof(c(1L, 1.5))
#> [1] "double"
typeof(c(1.5, "a"))
#> [1] "character"

27.2.7 Exercises

  1. Carefully read the documentation of is.vector(). What does it actually test for? Why does is.atomic() not agree with the definition of atomic vectors above?

  2. Describe the difference between is.finite(x) and !is.infinite(x).

  3. A logical vector can take 3 possible values. How many possible values can an integer vector take? How many possible values can a double take? Use Google to do some research.

  4. Brainstorm at least four functions that allow you to convert a double to an integer. How do they differ? Be precise.

  5. What functions from the readr package allow you to turn a string into logical, integer, and double vector?

  6. Compare and contrast setNames() with purrr::set_names().

  7. Draw the following lists as nested sets:

    1. list(a, b, list(c, d), list(e, f))
    2. list(list(list(list(list(list(a))))))

27.3 Subsetting

There are three subsetting tools in base R: [, [[, and $. [ selects a vector; [[ selects a single value, and $ selects a single number based on named. We’ll see how they apply to atomic vectors and lists. And then how they combine to provide an alternative to filter() and select() for working with data frames.

To explain more complicated list manipulation functions, it’s helpful to have a visual representation of lists and vectors. For example, take these three lists:

x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))

We’ll draw them as follows:

There are three principles:

  1. Lists have rounded corners. Atomic vectors have square corners.

  2. Children are drawn inside their parent, and have a slightly darker background to make it easier to see the hierarchy.

  3. The orientation of the children (i.e. rows or columns) isn’t important, so we’ll pick a row or column orientation to either save space or illustrate an important property in the example.

To learn more about the applications of subsetting, reading the “Subsetting” chapter of Advanced R: http://adv-r.had.co.nz/Subsetting.html#applications.

27.3.1 Atomic vectors

[ is the subsetting function, and is called like x[a]. There are four types of things that you can subset a vector with:

  1. A numeric vector containing only integers. The integers must either be all positive, all negative, or zero.

    Subsetting with positive integers keeps the elements at those positions:

    x <- c("one", "two", "three", "four", "five")
    x[c(3, 2, 5)]
    #> [1] "three" "two"   "five"

    By repeating a position, you can actually make a longer output than input. (This makes subsetting a bit of a misnomer).

    x[c(1, 1, 5, 5, 5, 2)]
    #> [1] "one"  "one"  "five" "five" "five" "two"

    Negative values drop the elements at the specified positions:

    x[c(-1, -3, -5)]
    #> [1] "two"  "four"

    It’s an error to mix positive and negative values:

    x[c(1, -1)]
    #> Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts

    The error message mentions subsetting with zero, which returns no values:

    x[0]
    #> character(0)

    This is not useful very often, but it can be helpful if you want to create unusual data structures to test your functions with.

  2. Subsetting with a logical vector keeps all values corresponding to a TRUE value. This is most often useful in conjunction with the comparison functions.

    x <- c(10, 3, NA, 5, 8, 1, NA)
    
    # All non-missing values of x
    x[!is.na(x)]
    #> [1] 10  3  5  8  1
    
    # All even (or missing!) values of x
    x[x %% 2 == 0]
    #> [1] 10 NA  8 NA
  3. If you have a named vector, you can subset it with a character vector:

    x <- c(abc = 1, def = 2, xyz = 5)
    x[c("xyz", "def")]
    #> xyz def 
    #>   5   2

    Like with positive integers, you can also use a character vector to duplicate individual entries.

  4. The simplest type of subsetting is nothing, x[], which returns the complete x. This is not useful for subsetting vectors, but as well see shortly it is useful when subsetting 2d structures like tibbles.

There is an important variation of [ called [[. [[ only ever extracts a single element, and always drops names. It’s a good idea to use it whenever you want to make it clear that you’re extracting a single item, as in a for loop. The distinction between [ and [[ is most important for lists, as we’ll see shortly.

27.3.2 Lists

There are three ways to subset a list, which we’ll illustrate with a list named a:

a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
  • [ extracts a sub-list. The result will always be a list.

    str(a[1:2])
    #> List of 2
    #>  $ a: int [1:3] 1 2 3
    #>  $ b: chr "a string"
    str(a[4])
    #> List of 1
    #>  $ d:List of 2
    #>   ..$ : num -1
    #>   ..$ : num -5

    Like with vectors, you can subset with a logical, integer, or character vector.

  • [[ extracts a single component from a list. It removes a level of hierarchy from the list.

    str(a[[1]])
    #>  int [1:3] 1 2 3
    str(a[[4]])
    #> List of 2
    #>  $ : num -1
    #>  $ : num -5
  • $ is a shorthand for extracting named elements of a list. It works similarly to [[ except that you don’t need to use quotes.

    a$a
    #> [1] 1 2 3
    a[["a"]]
    #> [1] 1 2 3

The distinction between [ and [[ is really important for lists, because [[ drills down into the list while [ returns a new, smaller list. Compare the code and output above with the visual representation in Figure 27.2.

Figure 27.2: Subsetting a list, visually.

The difference between [ and [[ is very important, but it’s easy to get confused. To help you remember, let me show you an unusual pepper shaker in ?fig-pepper-1.If this pepper shaker is your list pepper, then, pepper[1] is a pepper shaker containing a single pepper packet, as in Figure 27.4. pepper[2] would look the same, but would contain the second packet. pepper[1:2] would be a pepper shaker containing two pepper packets. pepper[[1]] would extract the pepper packet itself, as in Figure 27.5.

A photo of a glass pepper shaker. Instead of the pepper shaker containing pepper, it contains many packets of pepper.

Figure 27.3: A pepper shaker that Hadley once found in his hotel room.

A photo of the glass pepper shaker containing just one packet of pepper.

Figure 27.4: pepper[1]

A single packet of pepper.

Figure 27.5: pepper[[1]]

27.3.3 Data frames

1d subsetting behaves like a list. 2d behaves like a combination of subsetting rows and columns.

27.3.4 Exercises

  1. Create functions that take a vector as input and return:

    1. The last value. Should you use [ or [[?
    2. The elements at even numbered positions.
    3. Every element except the last value.
    4. Only even numbers (and no missing values).
  2. Why is x[-which(x > 0)] not the same as x[x <= 0]?

  3. What happens when you subset with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?

  4. What happens if you subset a tibble as if you’re subsetting a list? What are the key differences between a list and a tibble?

27.4 Attributes and S3 vectors

Any vector can contain arbitrary additional metadata through its attributes. You can think of attributes as named list of vectors that can be attached to any object. You can get and set individual attribute values with attr() or see them all at once with attributes().

x <- 1:10
attr(x, "greeting")
#> NULL
attr(x, "greeting") <- "Hi!"
attr(x, "farewell") <- "Bye!"
attributes(x)
#> $greeting
#> [1] "Hi!"
#> 
#> $farewell
#> [1] "Bye!"

There are three very important attributes that are used to implement fundamental parts of R:

  1. Names are used to name the elements of a vector.
  2. Dimensions (dims, for short) make a vector behave like a matrix or array.
  3. Class is used to implement the S3 object oriented system.

You’ve seen names above, and we won’t cover dimensions because we don’t use matrices in this book.

  • Factors (factor) are built on top of integer vectors.
  • Dates (date) are built on top of double vectors.
  • Date-times (POSIXct) are built on top of double vectors.

27.4.1 Class

It remains to describe the class, which controls how generic functions work. Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input. A detailed discussion of object oriented programming is beyond the scope of this book, but you can read more about it in Advanced R at http://adv-r.had.co.nz/OO-essentials.html#s3.

Here’s what a typical generic function looks like:

as.Date
#> function (x, ...) 
#> UseMethod("as.Date")
#> <bytecode: 0x55bdcef02270>
#> <environment: namespace:base>

The call to “UseMethod” means that this is a generic function, and it will call a specific method, a function, based on the class of the first argument. (All methods are functions; not all functions are methods). You can list all the methods for a generic with methods():

methods("as.Date")
#> [1] as.Date.character   as.Date.default     as.Date.factor     
#> [4] as.Date.numeric     as.Date.POSIXct     as.Date.POSIXlt    
#> [7] as.Date.vctrs_sclr* as.Date.vctrs_vctr*
#> see '?methods' for accessing help and source code

For example, if x is a character vector, as.Date() will call as.Date.character(); if it’s a factor, it’ll call as.Date.factor().

You can see the specific implementation of a method with getS3method():

getS3method("as.Date", "default")
#> function (x, ...) 
#> {
#>     if (inherits(x, "Date")) 
#>         x
#>     else if (is.null(x)) 
#>         .Date(numeric())
#>     else if (is.logical(x) && all(is.na(x))) 
#>         .Date(as.numeric(x))
#>     else stop(gettextf("do not know how to convert '%s' to class %s", 
#>         deparse1(substitute(x)), dQuote("Date")), domain = NA)
#> }
#> <bytecode: 0x55bdd324f788>
#> <environment: namespace:base>
getS3method("as.Date", "numeric")
#> function (x, origin, ...) 
#> {
#>     if (missing(origin)) {
#>         if (!length(x)) 
#>             return(.Date(numeric()))
#>         if (!any(is.finite(x))) 
#>             return(.Date(x))
#>         stop("'origin' must be supplied")
#>     }
#>     as.Date(origin, ...) + x
#> }
#> <bytecode: 0x55bdd3251750>
#> <environment: namespace:base>

The most important S3 generic is print(): it controls how the object is printed when you type its name at the console. Other important generics are the subsetting functions [, [[, and $.

27.4.2 Factors

Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:

x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
typeof(x)
#> [1] "integer"
attributes(x)
#> $levels
#> [1] "ab" "cd" "ef"
#> 
#> $class
#> [1] "factor"

27.4.3 Dates and date-times

Dates in R are numeric vectors that represent the number of days since 1 January 1970.

x <- as.Date("1971-01-01")
unclass(x)
#> [1] 365

typeof(x)
#> [1] "double"
attributes(x)
#> $class
#> [1] "Date"

Date-times are numeric vectors with class POSIXct that represent the number of seconds since 1 January 1970. (In case you were wondering, “POSIXct” stands for “Portable Operating System Interface”, calendar time.)

x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)
#> [1] 3600
#> attr(,"tzone")
#> [1] "UTC"

typeof(x)
#> [1] "double"
attributes(x)
#> $class
#> [1] "POSIXct" "POSIXt" 
#> 
#> $tzone
#> [1] "UTC"

The tzone attribute is optional. It controls how the time is printed, not what absolute time it refers to.

attr(x, "tzone") <- "US/Pacific"
x
#> [1] "1969-12-31 17:00:00 PST"

attr(x, "tzone") <- "US/Eastern"
x
#> [1] "1969-12-31 20:00:00 EST"

There is another type of date-times called POSIXlt. These are built on top of named lists:

y <- as.POSIXlt(x)
typeof(y)
#> [1] "list"
attributes(y)
#> $names
#>  [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"  
#>  [9] "isdst"  "zone"   "gmtoff"
#> 
#> $class
#> [1] "POSIXlt" "POSIXt" 
#> 
#> $tzone
#> [1] "US/Eastern" "EST"        "EDT"

POSIXlts are rare inside the tidyverse. They do crop up in base R, because they are needed to extract specific components of a date, like the year or month. Since lubridate provides helpers for you to do this instead, you don’t need them. POSIXct’s are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a regular date time with lubridate::as_datetime().

27.5 Other types

27.5.1 Tibbles

Tibbles are augmented lists: they have class “tbl_df” + “tbl” + “data.frame”, and names (column) and row.names attributes:

tb <- tibble::tibble(x = 1:5, y = 5:1)
typeof(tb)
#> [1] "list"
attributes(tb)
#> $class
#> [1] "tbl_df"     "tbl"        "data.frame"
#> 
#> $row.names
#> [1] 1 2 3 4 5
#> 
#> $names
#> [1] "x" "y"

The difference between a tibble and a list is that all the elements of a data frame must be vectors with the same length. All functions that work with tibbles enforce this constraint.

Traditional data.frames have a very similar structure:

df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
#> [1] "list"
attributes(df)
#> $names
#> [1] "x" "y"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1 2 3 4 5

The main difference is the class. The class of tibble includes “data.frame” which means tibbles inherit the regular data frame behaviour by default.

27.5.2 Exercises

  1. What does hms::hms(3600) return? How does it print? What primitive type is the augmented vector built on top of? What attributes does it use?

  2. Try and make a tibble that has columns with different lengths. What happens?