Following this lecture, you should be able to:
Install and load R packages from CRAN
Understand aesthetics, geometries, and stats, and the relationships among them
Write {ggplot2}
code to visualize the distributions of both categorical and numerical variables
{ggplot2}
?“R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile.” ~ Hadley Wickham (creator of
{ggplot2}
)
Used by professionals
Consistent with other tidyverse packages
Based on a consistent “grammar”
{ggplot2}
?{ggplot2}
is a package
Packages are documented collections of (mostly) functions
Packages can be installed from CRAN (Comprehensive R Archive Network)
Tip: Forgetting to load packages
Forgetting to load in a package (using library()
) is a classic mistake that we all make from time to time. Be on the lookout for error messages like Error...could not find function <function name>
. Check to make sure you spelled the name of the function correctly, and then check to make sure that you didn’t forget to load in the package that the functions comes from!
To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.
{ggplot2}
and {palmerpenguins}
packages using the console.344 observations (rows)
8 variables (columns)
All ggplots start with ggplot()
Key arguments: data=
and mapping=
We give ggplot()
the dataset we want to visualize
ggplot()
doesn’t know which variable(s) to plot
Map variables from the data to plot aesthetics
Aesthetics are the visual properties of the plot
Data can be plotted in many different ways
We need to tell R which plotting geometrywe want
To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.
bill_length_mm
mapped to the y-axisflipper_length_mm
mapped to the x-axis and species
mapped to the y-axisbody_mass_g
mapped to the x-axis and species
mapped to color aestheticCategorical variables are those that can only take on a finite set of values
In the penguins
dataset, species
is a categorical variable
We can plot the distribution of the species
variable using a bar plot
Tip: Sorting categorical variables (factors)
The fct_infreq()
function from {forcats}
can be called on mapped variables to order them by frequency (count). For example, in the code above, rewriting line 2 as mapping = aes(x = fct_infreq(species))) +
will produce a bar plot with bars in descending order.
Numerical variables are those that take on number values, and can be continuous, or discrete
In the penguins
dataset bill_length_mm
is a continuous numerical variable
Histograms are one method used to plot the distribution of a continuous numerical variable
Tip: Setting a bin width
When plotting a histogram, make sure to play around with different binwidth=
values (an argument in geom_histogram()
, not ggplot()
. Try to choose a bin width that shows the pattern of the data well without being too “noisy.”
Note: Density curve bandwidth
The specific shape of a density curve is controlled, in part, by the bandwidth=
argument in geom_density()
. Unlike geom_histogram()
, we won’t play around with this argument because the default value will always be suitable for our purposes (i.e., quick and dirty plotting).
The aesthetics (position, shape, color, etc.) of a plot do not need to be mapped to variables in the dataset
We can define constant aesthetics outside of the mapping=
argument within the geometry we want to modify
Say we wanted to color the density curve green, and fill it’s shape with lightgreen
To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.
Tip: Color vs. fill aesthetics
In two-dimensional geometries (like bar plots and histograms), the color=
aesthetic sets the outline color, and the fill=
aesthetic sets the fill color.
island
variable.bill_depth_mm
with a bin width of 10. Is this an appropriate bin width?year
variable. This is technically a numerical variable, but why might it make more sense to use a bar plot instead of a histogram or density curve?flipper_length_mm
variable that is filled sky blue with a gray outline. What am I doing wrong? It’s all red. In a new code cell, write some code that will actually accomplish what I’m trying to do here.When plotting distributions, we tend to only map the x-axis aesthetic
Where do all of our distributions get their y-axis from?
Each distribution geometry calls a statistical transformation (stat) function in the background
geom_bar()
calls stat_count()
geom_histogram()
calls stat_bin()
geom_density()
calls stat_density()
Big idea: Geometries and stats
All geometries have a default stat, and all stats have a default geometry. Sometimes, the default stat for a geometry is stat_identity()
, which leaves the data unchanged.
Tip: Finding default stats
If you’re ever unsure of a geometry’s default stat, call up the help documentation for the corresponding geom_
function and look at the string value of the stat=
argument.
To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.
DSC 210 Data Wrangling