Plotting Basics and Distributions

Sam Mason

Learning goals

Following this lecture, you should be able to:

  • Install and load R packages from CRAN

  • Understand aesthetics, geometries, and stats, and the relationships among them

  • Write {ggplot2} code to visualize the distributions of both categorical and numerical variables

Why {ggplot2}?

“R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile.” ~ Hadley Wickham (creator of {ggplot2} )

  • Used by professionals

  • Consistent with other tidyverse packages

  • Based on a consistent “grammar”

What is {ggplot2}?

  • {ggplot2} is a package

  • Packages are documented collections of (mostly) functions

  • Packages can be installed from CRAN (Comprehensive R Archive Network)

Tip: Forgetting to load packages

Forgetting to load in a package (using library()) is a classic mistake that we all make from time to time. Be on the lookout for error messages like Error...could not find function <function name>. Check to make sure you spelled the name of the function correctly, and then check to make sure that you didn’t forget to load in the package that the functions comes from!

In-class exercises

To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.

  1. Install the {ggplot2} and {palmerpenguins} packages using the console.
  2. Create a new section header by placing your cursor on a new line and typing “##” with a space afterward. Name this section “Installing and loading packages”
  3. Create a new code cell (option+command+I in macOS; Ctrl+Shift+I in Windows)
  4. Write some code to load in both packages

Exercise 1 solution

# to be completed in the console, not in the Quarto notebook
install.packages("ggplot2") # don't forget the quotes
install.packages("palmerpenguins")

Exercise 4 solution

library(ggplot2) # no quotes when you load it in
library(palmerpenguins)

Palmer penguins

  • 344 observations (rows)

  • 8 variables (columns)

Creating a plot object

  • All ggplots start with ggplot()

  • Key arguments: data= and mapping=

  • We give ggplot() the dataset we want to visualize

  • ggplot() doesn’t know which variable(s) to plot

  • Map variables from the data to plot aesthetics

Mapping variables to aesthetics

  • Aesthetics are the visual properties of the plot

  • Data can be plotted in many different ways

  • We need to tell R which plotting geometrywe want

Plotting geometries

  • Geometries define the specific type of plot (e.g., scatter, bar, histogram, etc.)

In-class exercises

To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.

  1. Create a new section header and call it “Aesthetics and geometries”
  2. Create each plot in its own code cell
    1. A histogram with bill_length_mm mapped to the y-axis
    2. A box plot with flipper_length_mm mapped to the x-axis and species mapped to the y-axis
    3. A density curve with body_mass_g mapped to the x-axis and species mapped to color aesthetic

Exercise 6 solutions

Part (a)
ggplot(data = penguins,
       mapping = aes(y = bill_length_mm)) +
  geom_histogram()
Part (b)
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm,
                     y = species)) +
  geom_boxplot()
Part (c)
ggplot(data = penguins,
       mapping = aes(x = body_mass_g,
                     color = species)) +
  geom_density()

More aesthetics to know

Categorical distributions

  • Categorical variables are those that can only take on a finite set of values

  • In the penguins dataset, species is a categorical variable

  • We can plot the distribution of the species variable using a bar plot

Tip: Sorting categorical variables (factors)

The fct_infreq() function from {forcats} can be called on mapped variables to order them by frequency (count). For example, in the code above, rewriting line 2 as mapping = aes(x = fct_infreq(species))) + will produce a bar plot with bars in descending order.

Numerical distributions

  • Numerical variables are those that take on number values, and can be continuous, or discrete

  • In the penguins dataset bill_length_mm is a continuous numerical variable

  • Histograms are one method used to plot the distribution of a continuous numerical variable

Tip: Setting a bin width

When plotting a histogram, make sure to play around with different binwidth= values (an argument in geom_histogram(), not ggplot(). Try to choose a bin width that shows the pattern of the data well without being too “noisy.”

Bin width sensitivity

  • The shape of a histogram can change meaningfully with small changes in bin width

  • Density curve geometries are less sensitive in this way

Density curves

  • Think of a density curve as a smooth histogram geometry

Note: Density curve bandwidth

The specific shape of a density curve is controlled, in part, by the bandwidth= argument in geom_density(). Unlike geom_histogram(), we won’t play around with this argument because the default value will always be suitable for our purposes (i.e., quick and dirty plotting).

Unmapped aesthetics

  • The aesthetics (position, shape, color, etc.) of a plot do not need to be mapped to variables in the dataset

  • We can define constant aesthetics outside of the mapping= argument within the geometry we want to modify

  • Say we wanted to color the density curve green, and fill it’s shape with lightgreen

In-class exercises

To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.

Tip: Color vs. fill aesthetics

In two-dimensional geometries (like bar plots and histograms), the color= aesthetic sets the outline color, and the fill= aesthetic sets the fill color.

  1. Create a new section header and call it “Plotting distributions”
  2. Create each plot in its own code cell
    1. A distribution of the island variable.
    2. A red histogram of bill_depth_mm with a bin width of 10. Is this an appropriate bin width?
    3. A distribution of the year variable. This is technically a numerical variable, but why might it make more sense to use a bar plot instead of a histogram or density curve?
  3. In the code below, I try to generate a density curve for the flipper_length_mm variable that is filled sky blue with a gray outline. What am I doing wrong? It’s all red. In a new code cell, write some code that will actually accomplish what I’m trying to do here.
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm)) +
  geom_density(mapping = aes(color = "gray40", fill = "skyblue"))

Exercise 8 solutions

Part (a)
# island is categorical, so we'll use a bar plot
ggplot(data = penguins,
       mapping = aes(x = fct_infreq(island))) + # optional
  geom_bar()
Part (b)
ggplot(data = penguins,
       mapping = aes(x = bill_depth_mm)) +
  geom_histogram(fill = "red",
                 binwidth = 10)
# This bin width is waaay too large!
# A bin width around 1 would be appropriate
Part (c)
ggplot(data = penguins,
       mapping = aes(x = year)) +
  geom_bar()

# This is a tricky one. In theory, year, as a measurement of
# time can be continuous, but in this dataset it is essentially
# categorical. It can only take on three different values.

Exercise 9 solution

# I want color and fill to be unmapped aesthetics, so you don't
# want to set them as aesthetics to the mapping argument.
ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm)) +
  geom_density(color = "gray40", fill = "skyblue")

Statistical transformations

  • When plotting distributions, we tend to only map the x-axis aesthetic

  • Where do all of our distributions get their y-axis from?

  • Each distribution geometry calls a statistical transformation (stat) function in the background

    • geom_bar() calls stat_count()

    • geom_histogram() calls stat_bin()

    • geom_density() calls stat_density()

Big idea: Geometries and stats

All geometries have a default stat, and all stats have a default geometry. Sometimes, the default stat for a geometry is stat_identity(), which leaves the data unchanged.

  • For these distribution geometries, the stats calculate the y-axes based on the variable mapped to the x-axis

Tip: Finding default stats

If you’re ever unsure of a geometry’s default stat, call up the help documentation for the corresponding geom_ function and look at the string value of the stat= argument.

In-class exercises

To be completed in Posit Cloud using the “In-class exercises: Plotting basics” project.

  1. In a new code cell, create a bar plot where island is mapped to the x-axis and sex is mapped to fill. Instead of using geom_bar(), use stat_count().

Exercise 10 solution

ggplot(data = penguins,
       mapping = aes(x = island, fill = sex)) +
  stat_count() # calling geom_bar() in the background