8  Data import

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

8.1 Introduction

Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, you’ll learn how to read plain-text rectangular files into R. Here, we’ll only scratch the surface of data import, but many of the principles will translate to other forms of data, which we’ll come back to in ?sec-wrangle.

8.1.1 Prerequisites

In this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse.

8.2 Getting started

Most of readr’s functions are concerned with turning flat files into data frames:

  • read_csv() reads comma delimited files, read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place), read_tsv() reads tab delimited files, and read_delim() reads in files with any delimiter.

  • read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions(). read_table() reads a common variation of fixed width files where columns are separated by white space.

  • read_log() reads Apache style log files. (But also check out webreadr which is built on top of read_log() and provides many more helpful tools.)

These functions all have similar syntax: once you’ve mastered one, you can use the others with ease. For the rest of this chapter we’ll focus on read_csv(). Not only are csv files one of the most common forms of data storage, but once you understand read_csv(), you can easily apply your knowledge to all the other functions in readr.

8.3 Reading data from a file

Here is what a simple CSV file with a row for column names (also commonly referred to as the header row) and six rows of data looks like.

#> Student ID,Full Name,favourite.food,mealPlan,AGE
#> 1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
#> 2,Barclay Lynn,French fries,Lunch only,5
#> 3,Jayendra Lyne,N/A,Breakfast and lunch,7
#> 4,Leon Rossini,Anchovies,Lunch only,
#> 5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
#> 6,Güvenç Attila,Ice cream,Lunch only,6

Note that the ,s separate the columns. Table 8.1 shows a representation of the same data as a table.

Table 8.1: Data from the students.csv file as a table.
Student ID Full Name favourite.food mealPlan AGE
1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 Barclay Lynn French fries Lunch only 5
3 Jayendra Lyne N/A Breakfast and lunch 7
4 Leon Rossini Anchovies Lunch only NA
5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 Güvenç Attila Ice cream Lunch only 6

The first argument to read_csv() is the most important: it’s the path to the file to read.

students <- read_csv("data/students.csv")
#> Rows: 6 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): Full Name, favourite.food, mealPlan, AGE
#> dbl (1): Student ID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

When you run read_csv() it prints out a message that tells you how many rows (excluding the header row) and columns the data has along with the delimiter used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about how to retrieve the full column specification as well as how to quiet this message. This message is an important part of readr, which we’ll come back to in Section 22.2 on parsing a file.

You can also supply an inline csv file. This is useful for experimenting with readr and for creating reproducible examples to share with others:

read_csv("a,b,c
1,2,3
4,5,6")
#> # A tibble: 2 × 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

In both cases read_csv() uses the first line of the data for the column names, which is a very common convention. There are two cases where you might want to tweak this behavior:

  1. Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with (e.g.) #.

    read_csv("The first line of metadata
      The second line of metadata
      x,y,z
      1,2,3", skip = 2)
    #> # A tibble: 1 × 3
    #>       x     y     z
    #>   <dbl> <dbl> <dbl>
    #> 1     1     2     3
    
    read_csv("# A comment I want to skip
      x,y,z
      1,2,3", comment = "#")
    #> # A tibble: 1 × 3
    #>       x     y     z
    #>   <dbl> <dbl> <dbl>
    #> 1     1     2     3
  2. The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn:

    read_csv("1,2,3\n4,5,6", col_names = FALSE)
    #> # A tibble: 2 × 3
    #>      X1    X2    X3
    #>   <dbl> <dbl> <dbl>
    #> 1     1     2     3
    #> 2     4     5     6

    ("\n" is a convenient shortcut for adding a new line. You’ll learn more about it and other types of string escape in Chapter 16.)

    Alternatively you can pass col_names a character vector which will be used as the column names:

    read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
    #> # A tibble: 2 × 3
    #>       x     y     z
    #>   <dbl> <dbl> <dbl>
    #> 1     1     2     3
    #> 2     4     5     6

Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file:

read_csv("a,b,c\n1,2,.", na = ".")
#> # A tibble: 1 × 3
#>       a     b c    
#>   <dbl> <dbl> <lgl>
#> 1     1     2 NA

This is all you need to know to read ~75% of CSV files that you’ll encounter in practice. You can also easily adapt what you’ve learned to read tab separated files with read_tsv() and fixed width files with read_fwf(). To read in more challenging files, you’ll need to learn more about how readr parses each column, turning them into R vectors.

8.3.1 First steps

Let’s take another look at the students data. In the favourite.food column, there are a bunch of food items and then the character string N/A, which should have been an real NA that R will recognize as “not available”. This is something we can address using the na argument.

students <- read_csv("data/students.csv", na = c("N/A", ""))

students
#> # A tibble: 6 × 5
#>   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
#>          <dbl> <chr>            <chr>              <chr>               <chr>
#> 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2            2 Barclay Lynn     French fries       Lunch only          5    
#> 3            3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6            6 Güvenç Attila    Ice cream          Lunch only          6

Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. For example, the column names in the students file we read in are formatted in non-standard ways. You might consider renaming them one by one with dplyr::rename() or you might use the janitor::clean_names() function turn them all into snake case at once.1 This function takes in a data frame and returns a data frame with variable names converted to snake case.

library(janitor)
students |>
  clean_names()
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan           age  
#>        <dbl> <chr>            <chr>              <chr>               <chr>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2          2 Barclay Lynn     French fries       Lunch only          5    
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6          6 Güvenç Attila    Ice cream          Lunch only          6

Another common task after reading in data is to consider variable types. For example, meal_type is a categorical variable with a known set of possible values. In R, factors can be used to work with categorical variables. We can convert this variable to a factor using the factor() function. You’ll learn more about factors in Chapter 18.

students <- students |>
  clean_names() |>
  mutate(meal_plan = factor(meal_plan))

students
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan           age  
#>        <dbl> <chr>            <chr>              <fct>               <chr>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2          2 Barclay Lynn     French fries       Lunch only          5    
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6          6 Güvenç Attila    Ice cream          Lunch only          6

Note that the values in the meal_type variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (<chr>) to factor (<fct>).

Before you move on to analyzing these data, you’ll probably want to fix the age column as well: currently it’s a character variable because of the one observation that is typed out as five instead of a numeric 5. We discuss the details of fixing this issue in Chapter 23 in further detail.

8.3.2 Compared to base R

If you’ve used R before, you might wonder why we’re not using read.csv(). There are a few good reasons to favor readr functions over the base equivalents:

  • They are typically much faster (~10x) than their base equivalents. Long running jobs have a progress bar, so you can see what’s happening. If you’re looking for raw speed, try data.table::fread(). It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.

  • They produce tibbles, and they don’t use row names or munge the column names. These are common sources of frustration with the base R functions.

  • They are more reproducible. Base R functions inherit some behavior from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.

8.3.3 Exercises

  1. What function would you use to read a file where fields were separated with “|”?

  2. Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

  3. What are the most important arguments to read_fwf()?

  4. Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By default, read_csv() assumes that the quoting character will be ". What argument to read_csv() do you need to specify to read the following text into a data frame?

    "x,y\n1,'a,b'"
  5. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

    read_csv("a,b\n1,2,3\n4,5,6")
    read_csv("a,b,c\n1,2\n1,2,3,4")
    read_csv("a,b\n\"1")
    read_csv("a,b\n1,2\na,b")
    read_csv("a;b\n1;3")

8.4 Reading data from multiple files

Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv for January, 02-sales.csv for February, and 03-sales.csv for March. With read_csv() you can read these data in at once and stack them on top of each other in a single data frame.

sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
read_csv(sales_files, id = "file")
#> Rows: 19 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): month
#> dbl (4): year, brand, item, n
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 19 × 6
#>   file              month    year brand  item     n
#>   <chr>             <chr>   <dbl> <dbl> <dbl> <dbl>
#> 1 data/01-sales.csv January  2019     1  1234     3
#> 2 data/01-sales.csv January  2019     1  8721     9
#> 3 data/01-sales.csv January  2019     1  1822     2
#> 4 data/01-sales.csv January  2019     2  3333     1
#> 5 data/01-sales.csv January  2019     2  2156     9
#> 6 data/01-sales.csv January  2019     2  3987     6
#> # … with 13 more rows

With the additional id parameter we have added a new column called file to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.

If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the dir_ls() function from the fs package to find the files for you by matching a pattern in the file names.

library(fs)
sales_files <- dir_ls("data", glob = "*sales.csv")
sales_files
#> data/01-sales.csv data/02-sales.csv data/03-sales.csv

8.5 Writing to a file

readr also comes with two useful functions for writing data back to disk: write_csv() and write_tsv(). Both functions increase the chances of the output file being read back in correctly by:

  • Always encoding strings in UTF-8.

  • Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.

If you want to export a csv file to Excel, use write_excel_csv() — this writes a special character (a “byte order mark”) at the start of the file which tells Excel that you’re using the UTF-8 encoding.

The most important arguments are x (the data frame to save), and file (the location to save it). You can also specify how missing values are written with na, and if you want to append to an existing file.

write_csv(students, "students.csv")

Now let’s read that csv file back in. Note that the type information is lost when you save to csv:

students
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan           age  
#>        <dbl> <chr>            <chr>              <fct>               <chr>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2          2 Barclay Lynn     French fries       Lunch only          5    
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
write_csv(students, "students-2.csv")
read_csv("students-2.csv")
#> # A tibble: 6 × 5
#>   student_id full_name        favourite_food     meal_plan           age  
#>        <dbl> <chr>            <chr>              <chr>               <chr>
#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
#> 2          2 Barclay Lynn     French fries       Lunch only          5    
#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
#> 6          6 Güvenç Attila    Ice cream          Lunch only          6

This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two alternatives:

  1. write_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS:

    write_rds(students, "students.rds")
    read_rds("students.rds")
    #> # A tibble: 6 × 5
    #>   student_id full_name        favourite_food     meal_plan           age  
    #>        <dbl> <chr>            <chr>              <fct>               <chr>
    #> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
    #> 2          2 Barclay Lynn     French fries       Lunch only          5    
    #> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
    #> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
    #> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
    #> 6          6 Güvenç Attila    Ice cream          Lunch only          6
  2. The feather package implements a fast binary file format that can be shared across programming languages:

    library(feather)
    write_feather(students, "students.feather")
    read_feather("students.feather")
    #> # A tibble: 6 × 5
    #>   student_id full_name        favourite_food     meal_plan             age
    #>        <dbl> <chr>            <chr>              <fct>               <dbl>
    #> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
    #> 2          2 Barclay Lynn     French fries       Lunch only              5
    #> 3          3 Jayendra Lyne    NA                 Breakfast and lunch     7
    #> 4          4 Leon Rossini     Anchovies          Lunch only             NA
    #> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
    #> 6          6 Güvenç Attila    Ice cream          Lunch only              6

Feather tends to be faster than RDS and is usable outside of R. RDS supports list-columns (which you’ll learn about in Chapter 25; feather currently does not.


  1. The janitor package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses |>.↩︎