Visualizing Relationships

Sam Mason

Learning goals

Following this lecture, you should be able to:

Distinguish between different fundamental variable types
Choose an appropriate plotting geometry based on the variables being related
Write {ggplot2} code to produce plots that relate two (or more) variables
Understand positions, layers, and facets, and the relationships among them (and also to/among/between aesthetics, geometries, and stats)

Fundamental variable types

Categorical variables take on values from a small set of options (levels)
- Nominal variables have no natural ordering to their levels
- Ordinal variables have a natural ordering to their levels
Numerical variables take on number values
- Continuous variables can (in theory) take on an infinite number of numerical values within a specified range
- Discrete variables take on counting number values

Tip: Plotting discrete numericals

When plotting, discrete numericals with few unique values (think of palmerpenguins::penguins$year) are best treated as categoricals.

Relating two categoricals

Motivating question

In this dataset, does the proportion of females (relative to males) differ depending on the island sampled?

In other words, is there some sort of relationship between the sex of the penguins and the island that they live on?
Both sex and island are categorical variables

ggplot(
  data = penguins,
  mapping = aes(x = island)) +
  geom_bar()

ggplot(
  data = drop_na(penguins, sex), # removing obs w/ NA sex
  mapping = aes(x = island, fill = sex)) +
  geom_bar()

Note: The drop_na() function

The drop_na() function comes from {tidyr}. It takes a data frame as its first argument (e.g., penguins) and then one more columns (e.g., sex). This function removes all observations that have NA values in the given column(s).

Position adjustments

We can use position adjustments to change how the geometry is organized
Positional adjustments are set within the geometry function

Big idea: Position adjustments

The position aesthetics (arguments x= and y=) are the big dogs in terms of how a geometry is plotted in space. In general, position adjustments tweak the position aesthetics, though some make more substantive changes (e.g., in a bar plot, position = "fill" triggers a whole new statistical transformation.

Relating a categorical and a numerical

Motivating question

How does flipper length tend to differ among the three penguin species?

We’re looking to visualize the relationship between flipper_length_mm and species

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm)) +
  # Observations from all species
  # combined in single curve
  geom_density(fill = "gray30")

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm,
                fill = species,
                color = species)) +
  geom_density(alpha = 0.25) # transparency 25%

Interpreting box plots

A box plot would be the more traditional approach to visualizing this relationship

ggplot(data = penguins,
       mapping = aes(x = species,
                     y = flipper_length_mm)) +
  geom_boxplot()

Relating two numericals

Motivating question

In general, do heavier penguins tend to have longer flippers?

Here we’ll need to relate two continuous variables, body_mass_g and flipper_length_mm

Tip: Response and explanatory variables

The response (a.k.a. independent) variable is the one that we are interested in learning about. The explanatory (a.k.a. dependent) variable is the one that we think might explain the variability we observe in the response. As a rule, we map the response variable to the y-axis, and the explanatory variable to the x-axis.

In-class exercises

To be completed in Posit Cloud using an RStudio project that you create.

Exercises
Solutions

Navigate to Posit Cloud and go to “Your Workspace”
Click on the “New Project” button (upper right-hand corner) and select “New RStudio Project”
Name your project “In-class exercises: Visualizing relationships”
Click on File > New File > Quarto Document…
Give your document a title (“In-class exercises: Visualizing relationships” would be a good title) and enter your name as the author, then click “Create”
Select everything from the “Quarto” heading down and delete it
Create a new header (using two “#”) named “Packages”
In a new code cell below the header, install and load {ggplot2} and {palmerpenguins}
Create a new header (using two “#”) named “Visualizing relationships”
Create a plot that appropriately visualizes each of the following relationships. Use separate code cells for each plot.
1. The relationship between bill_length_mm and bill_depth_mm
2. The relationship between sex and species
3. The relationship between species and body_mass_g

Exercise 10 solutions

Part (a)

# x and y mapping doesn't matter here
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm,
                     y = bill_depth_mm)) +
  geom_point()

Part (b)

ggplot(data = drop_na(penguins, sex), # install/load {tidyr}
       mapping = aes(x = species,
                     fill = sex)) +
  geom_bar(position = "dodge") # other positions also correct

Part (c)

# Colored histogram solution
ggplot(data = penguins,
       mapping = aes(x = body_mass_g,
                     fill = species,
                     color = species)) +
  geom_density(alpha = 0.25)

# Box plot solution
ggplot(data = penguins,
       mapping = aes(x = species,
                     y = body_mass_g)) +
  geom_boxplot()

Relating more than two variables

Motivating question

How does body mass differ by sex, and does this relationship change depending on the species of penguin?

We have three variables here, body_mass_g, sex, and species

Tip: Mapping three variables

When visualizing the relationship between three (or more) variables, you may need to play around with the mappings to find an arrangement that makes it easy to address your motivating question.

Motivating question

Does the relationship between bill length and flipper length change depending on the species?

Adding more layers

Trend lines can help us more easily see patterns in scatterplots

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm,
                     y = flipper_length_mm,
                     color = species)) +
  geom_point() +
  geom_smooth(method = "lm") # line of best fit (regression)

Big idea: Layers

The trend line geometry represents a new layer that we’ve added on top of our scatterplot. In {ggplot2} code, each call to a geom_*() function adds a layer. All layers inherit the aesthetic mappings defined in the first call to ggplot().

Global and local mappings

Any aesthetics mapped in ggplot() are global — they apply to all layers
We can also map aesthetic locally (i.e., within geom_*() calls)

Faceting for complex plots

Motivating question

In this dataset, does the proportion of females (relative to males) differ depending on the island sampled, and do island-specific proportions differ by year?

Looks like we need to relate three categorical variables: sex, island, and year

Tip: Faceting

Think of faceting as just another aesthetic mapping. The facet aesthetic “pulls” the plot apart so that each value of the “mapped” variable gets its own panel. In general, only categorical variables should be used to facet.

More concise code

In R, argument names can be dropped as long as the argument values are given in the expected order
Use the help documentation to find the argument order

With arg. names

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm,
                     y = bill_depth_mm,
                     color = species)) +
  geom_point() +
  geom_smooth(mapping = aes(fill = species),
              method = "lm")

Without arg. names

ggplot(penguins,
       aes(bill_length_mm,
           bill_depth_mm,
           color = species)) +
  geom_point() +
  geom_smooth(aes(fill = species),
              method = "lm")

All layers inherit the data from ggplot()
The aes() function expects x= first and y= second, all other arguments must be named

Tip: Function argument names

Just write the damn argument names. Who are you trying to impress? It makes your code more readable and it mitigates the risk of errors.

In-class exercises

To be completed in Posit Cloud.

Exercises
Solutions

Create a new header (using two “#”) named “Complex relationships”
Create each of the following plots in its own code cell
1. A boxplot showing the relationship between species and bill_depth_mm where boxes are filled by sex
2. A density curve plot showing the relationship between body_mass_g and sex faceted by species
3. A scatterplot showing the relationship between bill_length_mm and body_mass_g colored by species and faceted by island

Exercise 12

Part (a)

ggplot(data = drop_na(penguins, sex),
       mapping = aes(x = species,
                     y = bill_depth_mm,
                     fill = sex)) +
  geom_boxplot()

Part (b)

ggplot(data = drop_na(penguins, sex),
       mapping = aes(x = body_mass_g,
                     fill = sex,
                     color = sex)) +
  geom_density(alpha = 0.25) +
  facet_wrap(~species)

Part (c)

ggplot(data = penguins,
       mapping = aes(x = body_mass_g,
                     y = bill_length_mm,
                     color = species)) +
  geom_point() +
  facet_wrap(~island)