Plotting Basics and Distributions

Sam Mason

Learning goals

Following this lecture, you should be able to:

Install and load R packages from CRAN
Understand aesthetics, geometries, and stats, and the relationships among them
Write {ggplot2} code to visualize the distributions of both categorical and numerical variables

Why `{ggplot2}`?

“R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile.” ~ Hadley Wickham (creator of {ggplot2} )

Used by professionals
Consistent with other tidyverse packages
Based on a consistent “grammar”

What is `{ggplot2}`?

{ggplot2} is a package
Packages are documented collections of (mostly) functions
Packages can be installed from CRAN (Comprehensive R Archive Network)

Tip: Forgetting to load packages

Forgetting to load in a package (using library()) is a classic mistake that we all make from time to time. Be on the lookout for error messages like Error...could not find function <function name>. Check to make sure you spelled the name of the function correctly, and then check to make sure that you didn’t forget to load in the package that the functions comes from!

In-class exercises

To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.

Exercises
Solutions

Install the {ggplot2} and {palmerpenguins} packages using the console.
Create a new section header by placing your cursor on a new line and typing “##” with a space afterward. Name this section “Installing and loading packages”
Create a new code cell (option+command+I in macOS; Ctrl+Shift+I in Windows)
Write some code to load in both packages

Exercise 1 solution

# to be completed in the console, not in the Quarto notebook
install.packages("ggplot2") # don't forget the quotes
install.packages("palmerpenguins")

Exercise 4 solution

library(ggplot2) # no quotes when you load it in
library(palmerpenguins)

Palmer penguins

344 observations (rows)
8 variables (columns)

Creating a plot object

All ggplots start with ggplot()
Key arguments: data= and mapping=
We give ggplot() the dataset we want to visualize

ggplot() doesn’t know which variable(s) to plot
Map variables from the data to plot aesthetics

Mapping variables to aesthetics

Aesthetics are the visual properties of the plot
Data can be plotted in many different ways
We need to tell R which plotting geometrywe want

Plotting geometries

Geometries define the specific type of plot (e.g., scatter, bar, histogram, etc.)

In-class exercises

To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.

Exercises
Solutions

Create a new section header and call it “Aesthetics and geometries”
Create each plot in its own code cell
1. A histogram with bill_length_mm mapped to the y-axis
2. A box plot with flipper_length_mm mapped to the x-axis and species mapped to the y-axis
3. A density curve with body_mass_g mapped to the x-axis and species mapped to color aesthetic

Exercise 6 solutions

Part (a)

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm)) +
  geom_histogram()

Part (b)

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm,
                     y = species)) +
  geom_boxplot()

Part (c)

ggplot(data = penguins,
       mapping = aes(x = body_mass_g,
                     color = species)) +
  geom_density()

More aesthetics to know

Categorical distributions

Categorical variables are those that can only take on a finite set of values
In the penguins dataset, species is a categorical variable
We can plot the distribution of the species variable using a bar plot

Tip: Sorting categorical variables (factors)

The fct_infreq() function from {forcats} can be called on mapped variables to order them by frequency (count). For example, in the code above, rewriting line 2 as mapping = aes(x = fct_infreq(species))) + will produce a bar plot with bars in descending order.

Numerical distributions

Numerical variables are those that take on number values, and can be continuous, or discrete
In the penguins dataset bill_length_mm is a continuous numerical variable
Histograms are one method used to plot the distribution of a continuous numerical variable

Tip: Setting a bin width

When plotting a histogram, make sure to play around with different binwidth= values (an argument in geom_histogram(), not ggplot(). Try to choose a bin width that shows the pattern of the data well without being too “noisy.”

Bin width sensitivity

The shape of a histogram can change meaningfully with small changes in bin width

Density curve geometries are less sensitive in this way

Density curves

Think of a density curve as a smooth histogram geometry

Note: Density curve bandwidth

The specific shape of a density curve is controlled, in part, by the bandwidth= argument in geom_density(). Unlike geom_histogram(), we won’t play around with this argument because the default value will always be suitable for our purposes (i.e., quick and dirty plotting).

Unmapped aesthetics

The aesthetics (position, shape, color, etc.) of a plot do not need to be mapped to variables in the dataset
We can define constant aesthetics outside of the mapping= argument within the geometry we want to modify
Say we wanted to color the density curve green, and fill it’s shape with lightgreen

In-class exercises

To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.

Tip: Color vs. fill aesthetics

In two-dimensional geometries (like bar plots and histograms), the color= aesthetic sets the outline color, and the fill= aesthetic sets the fill color.

Exercises
Solutions

Create a new section header and call it “Plotting distributions”
Create each plot in its own code cell
1. A distribution of the island variable.
2. A red histogram of bill_depth_mm with a bin width of 10. Is this an appropriate bin width?
3. A distribution of the year variable. This is technically a numerical variable, but why might it make more sense to use a bar plot instead of a histogram or density curve?
In the code below, I try to generate a density curve for the flipper_length_mm variable that is filled sky blue with a gray outline. What am I doing wrong? It’s all red. In a new code cell, write some code that will actually accomplish what I’m trying to do here.

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm)) +
  geom_density(mapping = aes(color = "gray40", fill = "skyblue"))

Exercise 8 solutions

Part (a)

# island is categorical, so we'll use a bar plot
ggplot(data = penguins,
       mapping = aes(x = fct_infreq(island))) + # optional
  geom_bar()

Part (b)

ggplot(data = penguins,
       mapping = aes(x = bill_depth_mm)) +
  geom_histogram(fill = "red",
                 binwidth = 10)
# This bin width is waaay too large!
# A bin width around 1 would be appropriate

Part (c)

ggplot(data = penguins,
       mapping = aes(x = year)) +
  geom_bar()

# This is a tricky one. In theory, year, as a measurement of
# time can be continuous, but in this dataset it is essentially
# categorical. It can only take on three different values.

Exercise 9 solution

# I want color and fill to be unmapped aesthetics, so you don't
# want to set them as aesthetics to the mapping argument.
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm)) +
  geom_density(color = "gray40", fill = "skyblue")

Statistical transformations

When plotting distributions, we tend to only map the x-axis aesthetic
Where do all of our distributions get their y-axis from?
Each distribution geometry calls a statistical transformation (stat) function in the background
- geom_bar() calls stat_count()
- geom_histogram() calls stat_bin()
- geom_density() calls stat_density()

Big idea: Geometries and stats

All geometries have a default stat, and all stats have a default geometry. Sometimes, the default stat for a geometry is stat_identity(), which leaves the data unchanged.

For these distribution geometries, the stats calculate the y-axes based on the variable mapped to the x-axis

Tip: Finding default stats

If you’re ever unsure of a geometry’s default stat, call up the help documentation for the corresponding geom_ function and look at the string value of the stat= argument.

In-class exercises

To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.

Exercise
Solution

In a new code cell, create a bar plot where island is mapped to the x-axis and sex is mapped to fill. Instead of using geom_bar(), use stat_count().

Exercise 10 solution

ggplot(data = penguins,
       mapping = aes(x = island, fill = sex)) +
  stat_count() # calling geom_bar() in the background

Plotting Basics and Distributions

Learning goals

Why {ggplot2}?

What is {ggplot2}?

In-class exercises

Exercise 1 solution

Exercise 4 solution

Palmer penguins

Creating a plot object

Mapping variables to aesthetics

Plotting geometries

In-class exercises

Exercise 6 solutions

Part (a)

Part (b)

Part (c)

More aesthetics to know

Categorical distributions

Numerical distributions

Bin width sensitivity

Density curves

Unmapped aesthetics

In-class exercises

Exercise 8 solutions

Part (a)

Part (b)

Part (c)

Exercise 9 solution

Statistical transformations

In-class exercises

Exercise 10 solution

Why `{ggplot2}`?

What is `{ggplot2}`?