Following this lecture, you should be able to:
Report the absolute and relative paths of a file
Import data from delimited text files and excel files stored locally and online
The {tibble}
package will be used to create toy datasets
The {readr}
package will be used to import data from flat files
Data import: the process of getting the data into R and RStudio
From files (.csv, .tsv, .xlxs)
From web APIs
From web scraping
From relational databases (SQL)
From parquet files (“big data”; Apache Arrow)
Delimited text files
Comma-separated values (.csv)
Tab-separated values (.tsv)
Excel files
Built on XML framework (a hierarchical tree scheme)
Requires a separate R package ({readxl}
)
C:\
in Windows, /
in macOS and Linux)"/Users/sam.mason/Desktop/file.txt"
A file’s full address (from drive to file name and extension) is called its absolute path
"/Users/sam.mason/Documents/Teaching/data-wrangling/dsc210-slides/10_file_import/file.txt"
Absolute paths will always get you to the file you’re after, but they’re tedious and prone to spelling errors
Tip: File paths in Windows
File paths in Windows generally use the backslash “\” path delimeter instead of the forward slash “/”. In R (and most other programming languages) the backslash character has a very special meaning (more on that when talk about regular expressions) when used within a string. Windows users will find that they may need to specify paths using double backslahes to avoid issues (e.g. "C:\\Users\\Sam"
).
We can define a more compact file address by giving the path relative to the current working directory
When you start a fresh R session the working directory is (by default) set to "C:\Users\<username>"
in Windows and "/Users/<username>"
in macOS. This is called the “home” directory ("~"
)
Once you save your script/notebook, the working directory changes to the folder that you saved to
"."
To be completed in Posit Cloud using the “In-class exercises: Data import” project.
Note: Posit Cloud
When we use Posit Cloud in our browsers, we’re really just logging into a computer (called a “server”) somewhere that runs RStudio. All of the Posit Cloud servers are Linux machines, so the drive is given by “/”.
readr::read_csv()
Tip: Importing as tibble
The readr::read_csv()
function will always create a tibble. The almost identically named base::read.csv()
will create a base R data frame object instead.
read_csv()
function usually does a great job at parsing columns properly, but sometimes we want more controlread_csv("./10_data/example_2.csv",
col_types = list(
year = col_integer(),
gdp = col_number()),
na = "")
NA
# A tibble: 5 × 4
region country year gdp
<chr> <chr> <int> <dbl>
1 NA Canada 2022 2221218
2 NA Cuba 2020 107352
3 NA Greenland 2020 3130
4 NA Mexico 2022 1322740
5 NA United States 2022 25346805
# A tibble: 9 × 5
`title:` `“Highest-grossing films”` ...3 ...4 ...5
<chr> <chr> <chr> <chr> <chr>
1 source: "\"https://en.wikipedia.org/wiki/List_o… <NA> <NA> <NA>
2 date accessed: "2022-09-27" <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA>
4 rank "peak" title gross year
5 1 "1" Avat… $2,8… 2009
6 2 "1" Aven… $2,7… 2019
7 3 "1" Tita… $2,1… 1997
8 4 "3" Star… $2,0… 2015
9 5 "4" Aven… $2,0… 2018
read_csv("./10_data/example_3.csv",
col_types = list(
gross = col_number()),
skip = 4) # skip the first four lines
# A tibble: 5 × 5
rank peak title gross year
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 1 Avatar 2877897339 2009
2 2 1 Avengers: Endgame 2797501328 2019
3 3 1 Titanic 2187535296 1997
4 4 3 Star Wars: The Force Awakens 2068223624 2015
5 5 4 Avengers: Infinity War 2048359754 2018
read_csv("./10_data/example_4.csv",
col_names = FALSE) |> # to prevent values from becoming names
print(n = 5)
# A tibble: 10 × 7
X1 X2 X3 X4 X5 X6 X7
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 10/10/1949 20:30 san marcos tx cylinder 2700 29.9 -97.9
2 10/10/1949 21:00 lackland afb tx light 7200 29.4 -98.6
3 10/10/1956 21:00 edna tx circle 20 29.0 -96.6
4 10/10/1960 20:00 kaneohe hi light 900 21.4 -158.
5 10/10/1961 19:00 bristol tn sphere 300 36.6 -82.2
# ℹ 5 more rows
read_csv("./10_data/example_4.csv",
col_names = c("date_time", "city", "state", "shape",
"duration_s", "latitude", "longitude")) |>
print(n = 5)
# A tibble: 10 × 7
date_time city state shape duration_s latitude longitude
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 10/10/1949 20:30 san marcos tx cylinder 2700 29.9 -97.9
2 10/10/1949 21:00 lackland afb tx light 7200 29.4 -98.6
3 10/10/1956 21:00 edna tx circle 20 29.0 -96.6
4 10/10/1960 20:00 kaneohe hi light 900 21.4 -158.
5 10/10/1961 19:00 bristol tn sphere 300 36.6 -82.2
# ℹ 5 more rows
# A tibble: 258,000 × 4
year name percent sex
<dbl> <chr> <dbl> <chr>
1 1880 John 0.0815 boy
2 1880 William 0.0805 boy
3 1880 James 0.0501 boy
4 1880 Charles 0.0452 boy
5 1880 George 0.0433 boy
6 1880 Frank 0.0274 boy
7 1880 Joseph 0.0222 boy
8 1880 Thomas 0.0214 boy
9 1880 Henry 0.0206 boy
10 1880 Robert 0.0204 boy
# ℹ 257,990 more rows
read_tsv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-24/christmas_lyrics.tsv") |>
select(title, line, lyric)
# A tibble: 7,525 × 3
title line lyric
<chr> <dbl> <chr>
1 JINGLE BELL ROCK 1 Jingle bell, jingle bell, jingle bell rock
2 JINGLE BELL ROCK 2 Jingle bells swing and jingle bells ring
3 JINGLE BELL ROCK 3 Snowin' and blowin' up bushels of fun
4 JINGLE BELL ROCK 4 Now, the jingle hop has begun
5 JINGLE BELL ROCK 5 Jingle bell, jingle bell, jingle bell rock
6 JINGLE BELL ROCK 6 Jingle bells chime in jingle bell time
7 JINGLE BELL ROCK 7 Dancin' and prancin' in Jingle Bell Square
8 JINGLE BELL ROCK 8 In the frosty air
9 JINGLE BELL ROCK 9 What a bright time, it's the right time
10 JINGLE BELL ROCK 10 To rock the night away
# ℹ 7,515 more rows
To be completed in Posit Cloud using the “In-class exercises: Data import” project.
We use read_delim()
to handle all other delimited text files
All previously discussed arguments behave the same
Must define the delim =
argument
Possible delimiters include ;
, :
, |
, and ~
A really helpful tool for loading in unfamiliar datasets one at a time
We use read_excel()
from the {readxl}
package to import .xls and .xlsx files
All previously discussed arguments behave the same
The sheet =
argument is an integer or string that specifies the sheet to read in
In research, the first sheet of an excel file usually contains metadata
write_csv()
and write_delim()
from the {readr}
package to create local delimited text files from data framesThink carefully about your choice of delimiter
Don’t use ,
if your values are sentences
Don’t use :
if your values are times
DSC 210 Data Wrangling