Importing from Text and Excel

Sam Mason

Learning goals

Following this lecture, you should be able to:

  1. Report the absolute and relative paths of a file

  2. Import data from delimited text files and excel files stored locally and online

Packages

library(tidyverse)
  • The {tibble} package will be used to create toy datasets

  • The {readr} package will be used to import data from flat files

Data import

  • Data import: the process of getting the data into R and RStudio

    • From files (.csv, .tsv, .xlxs)

    • From web APIs

    • From web scraping

    • From relational databases (SQL)

    • From parquet files (“big data”; Apache Arrow)

Common data file formats

  • Most of the flat, rectangular datasets that you’ll come across in the wild use only a few different file formats

Delimited text files

  • Comma-separated values (.csv)

  • Tab-separated values (.tsv)

Excel files

  • Built on XML framework (a hierarchical tree scheme)

  • Requires a separate R package ({readxl})

Local file paths

  • Every file on your computer (and online) has a unique address called a path
  • Local files paths are made up of three parts:
    1. Start at the drive (C:\ in Windows, / in macOS and Linux)
    2. A directory (sequence of nested folders)
    3. A file name and extension (e.g., .csv, .xlsx, etc.)
  • In R we always treat file paths as strings
    • "/Users/sam.mason/Desktop/file.txt"

Absolute paths

  • A file’s full address (from drive to file name and extension) is called its absolute path

  • "/Users/sam.mason/Documents/Teaching/data-wrangling/dsc210-slides/10_file_import/file.txt"

  • Absolute paths will always get you to the file you’re after, but they’re tedious and prone to spelling errors

Tip: File paths in Windows

File paths in Windows generally use the backslash “\” path delimeter instead of the forward slash “/”. In R (and most other programming languages) the backslash character has a very special meaning (more on that when talk about regular expressions) when used within a string. Windows users will find that they may need to specify paths using double backslahes to avoid issues (e.g. "C:\\Users\\Sam").

Relative paths

  • We can define a more compact file address by giving the path relative to the current working directory

  • When you start a fresh R session the working directory is (by default) set to "C:\Users\<username>" in Windows and "/Users/<username>" in macOS. This is called the “home” directory ("~")

  • Once you save your script/notebook, the working directory changes to the folder that you saved to

getwd()
[1] "/Users/sam.mason/Documents/teaching/data_wrangling/dsc210-slides/10_file_import"
  • The entirety of the current working directory can represented in a path using "."
list.files(path = ".")
[1] "_extensions"              "10_data"                 
[3] "10_file_import_files"     "10_file_import.html"     
[5] "10_file_import.qmd"       "10_file_import.rmarkdown"
[7] "10_images"               

In-class exercises

To be completed in Posit Cloud using the “In-class exercises: Data import” project.

Note: Posit Cloud

When we use Posit Cloud in our browsers, we’re really just logging into a computer (called a “server”) somewhere that runs RStudio. All of the Posit Cloud servers are Linux machines, so the drive is given by “/”.

  1. Posit Cloud plops you in a folder called “project”. What is the absolute path of the “project” folder?
  2. What is the working directory of this Posit Cloud project?
  3. Give the relative path of the “squirrel_census.csv” file
  4. Find the file called “file.txt” and report its relative path
  1. “/cloud/project”
  2. “/cloud/project”
  3. “./data/squirrel_census.csv”
  4. “./top_sneaky/do_not_enter/empty_folder/you_saw_through_my_bluff/file.txt”

Structure of .csv files

  • .csv files use commas (“,”) to delimit columns, and end-of-line escapes (“\n”) to denote rows

Reading in .csv files

  • We read .csv files into R using readr::read_csv()

The raw data

Importing into R

read_csv(file = "./10_data/example_1.csv")
# A tibble: 5 × 4
  col_a col_b col_c col_d
  <dbl> <chr> <dbl> <lgl>
1     1 a       1.2 TRUE 
2     2 b       3.4 FALSE
3     3 c       5.6 NA   
4     4 d       7.8 TRUE 
5     5 e       9   FALSE
  • R first sees all columns as characters and then parses each into more appropriate vector types

Tip: Importing as tibble

The readr::read_csv() function will always create a tibble. The almost identically named base::read.csv() will create a base R data frame object instead.

Controlling column parsing

  • The read_csv() function usually does a great job at parsing columns properly, but sometimes we want more control

read_csv("./10_data/example_2.csv")
# A tibble: 5 × 4
  region country        year gdp           
  <lgl>  <chr>         <dbl> <chr>         
1 NA     Canada         2022 $2,221,218.00 
2 NA     Cuba           2020 $107,352.00   
3 NA     Greenland      2020 $3,130.00     
4 NA     Mexico         2022 $1,322,740.00 
5 NA     United States  2022 $25,346,805.00
read_csv("./10_data/example_2.csv",
         col_types = list(
           year = col_integer(),
           gdp = col_number()),
         na = "")
1
Manually set column types using a named list
2
The year column should be parsed as an integer
3
The gdp column should be pasted as a numeric
4
Convert only empty cells to NA
# A tibble: 5 × 4
  region country        year      gdp
  <chr>  <chr>         <int>    <dbl>
1 NA     Canada         2022  2221218
2 NA     Cuba           2020   107352
3 NA     Greenland      2020     3130
4 NA     Mexico         2022  1322740
5 NA     United States  2022 25346805

Dealing with metadata lines

  • Let’s read in an unfamiliar .csv file and hope for the best!
read_csv("./10_data/example_3.csv") |> print(width = 75)
# A tibble: 9 × 5
  `title:`       `“Highest-grossing films”`               ...3  ...4  ...5 
  <chr>          <chr>                                    <chr> <chr> <chr>
1 source:        "\"https://en.wikipedia.org/wiki/List_o… <NA>  <NA>  <NA> 
2 date accessed: "2022-09-27"                             <NA>  <NA>  <NA> 
3 <NA>            <NA>                                    <NA>  <NA>  <NA> 
4 rank           "peak"                                   title gross year 
5 1              "1"                                      Avat… $2,8… 2009 
6 2              "1"                                      Aven… $2,7… 2019 
7 3              "1"                                      Tita… $2,1… 1997 
8 4              "3"                                      Star… $2,0… 2015 
9 5              "4"                                      Aven… $2,0… 2018 

read_csv("./10_data/example_3.csv",
         col_types = list(
           gross = col_number()),
         skip = 4) # skip the first four lines
# A tibble: 5 × 5
   rank  peak title                             gross  year
  <dbl> <dbl> <chr>                             <dbl> <dbl>
1     1     1 Avatar                       2877897339  2009
2     2     1 Avengers: Endgame            2797501328  2019
3     3     1 Titanic                      2187535296  1997
4     4     3 Star Wars: The Force Awakens 2068223624  2015
5     5     4 Avengers: Infinity War       2048359754  2018

Specifying column names

  • Sometimes .csv files don’t include column names, or include bad column names that we want to change
read_csv("./10_data/example_4.csv",
         col_names = FALSE) |> # to prevent values from becoming names
  print(n = 5)
# A tibble: 10 × 7
  X1               X2           X3    X4          X5    X6     X7
  <chr>            <chr>        <chr> <chr>    <dbl> <dbl>  <dbl>
1 10/10/1949 20:30 san marcos   tx    cylinder  2700  29.9  -97.9
2 10/10/1949 21:00 lackland afb tx    light     7200  29.4  -98.6
3 10/10/1956 21:00 edna         tx    circle      20  29.0  -96.6
4 10/10/1960 20:00 kaneohe      hi    light      900  21.4 -158. 
5 10/10/1961 19:00 bristol      tn    sphere     300  36.6  -82.2
# ℹ 5 more rows
read_csv("./10_data/example_4.csv",
         col_names = c("date_time", "city", "state", "shape",
                       "duration_s", "latitude", "longitude")) |>
  print(n = 5)
# A tibble: 10 × 7
  date_time        city         state shape    duration_s latitude longitude
  <chr>            <chr>        <chr> <chr>         <dbl>    <dbl>     <dbl>
1 10/10/1949 20:30 san marcos   tx    cylinder       2700     29.9     -97.9
2 10/10/1949 21:00 lackland afb tx    light          7200     29.4     -98.6
3 10/10/1956 21:00 edna         tx    circle           20     29.0     -96.6
4 10/10/1960 20:00 kaneohe      hi    light           900     21.4    -158. 
5 10/10/1961 19:00 bristol      tn    sphere          300     36.6     -82.2
# ℹ 5 more rows

Importing from a URL

  • A URL gives the path of a file located on a remote server
read_csv("https://raw.githubusercontent.com/hadley/data-baby-names/master/baby-names.csv")
# A tibble: 258,000 × 4
    year name    percent sex  
   <dbl> <chr>     <dbl> <chr>
 1  1880 John     0.0815 boy  
 2  1880 William  0.0805 boy  
 3  1880 James    0.0501 boy  
 4  1880 Charles  0.0452 boy  
 5  1880 George   0.0433 boy  
 6  1880 Frank    0.0274 boy  
 7  1880 Joseph   0.0222 boy  
 8  1880 Thomas   0.0214 boy  
 9  1880 Henry    0.0206 boy  
10  1880 Robert   0.0204 boy  
# ℹ 257,990 more rows

Reading in .tsv files

  • Data files using the .tsv format have columns delimited by tabs
read_tsv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-24/christmas_lyrics.tsv") |>
  select(title, line, lyric)
# A tibble: 7,525 × 3
   title             line lyric                                     
   <chr>            <dbl> <chr>                                     
 1 JINGLE BELL ROCK     1 Jingle bell, jingle bell, jingle bell rock
 2 JINGLE BELL ROCK     2 Jingle bells swing and jingle bells ring  
 3 JINGLE BELL ROCK     3 Snowin' and blowin' up bushels of fun     
 4 JINGLE BELL ROCK     4 Now, the jingle hop has begun             
 5 JINGLE BELL ROCK     5 Jingle bell, jingle bell, jingle bell rock
 6 JINGLE BELL ROCK     6 Jingle bells chime in jingle bell time    
 7 JINGLE BELL ROCK     7 Dancin' and prancin' in Jingle Bell Square
 8 JINGLE BELL ROCK     8 In the frosty air                         
 9 JINGLE BELL ROCK     9 What a bright time, it's the right time   
10 JINGLE BELL ROCK    10 To rock the night away                    
# ℹ 7,515 more rows

In-class exercises

To be completed in Posit Cloud using the “In-class exercises: Data import” project.

  1. Go to the Tidy Tuesday GitHub repository
  2. Open the “data” folder
  3. Find a dataset, organized into folder by year and week
  4. Open the dataset by clicking on it
  5. Click the “Raw” button in the upper right-hand corner of the data viewer
  6. Copy the URL
  7. Import the data you found on GitHub into the Posit Cloud project using the URL

An example using the data from “tidytuesday/data/2024-05-28/planting_2021.csv”

plantings <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/refs/heads/master/data/2024/2024-05-28/planting_2021.csv")

Reading files with other delimiters

  • We use read_delim() to handle all other delimited text files

    • All previously discussed arguments behave the same

    • Must define the delim = argument

    • Possible delimiters include ;, :, |, and ~

RStudio’s data import GUI

A really helpful tool for loading in unfamiliar datasets one at a time

Reading in .xlsx files

  • We use read_excel() from the {readxl} package to import .xls and .xlsx files

    • All previously discussed arguments behave the same

    • The sheet = argument is an integer or string that specifies the sheet to read in

    • In research, the first sheet of an excel file usually contains metadata

Writing data frames to file

  • We use write_csv() and write_delim() from the {readr} package to create local delimited text files from data frames
library(palmerpenguins)
write_csv(x = penguins, # the R data object
          file = "/Users/sam.mason/Desktop/penguins.csv")

write_delim(x = penguins,
            file = "/Users/sam.mason/Desktop/penguins.txt",
            delim = "|")
  • Think carefully about your choice of delimiter

    • Don’t use , if your values are sentences

    • Don’t use : if your values are times