Following this lecture, you should be able to:
{ggplot2}
code to produce plots that relate two (or more) variablesCategorical variables take on values from a small set of options (levels)
Nominal variables have no natural ordering to their levels
Ordinal variables have a natural ordering to their levels
Numerical variables take on number values
Continuous variables can (in theory) take on an infinite number of numerical values within a specified range
Discrete variables take on counting number values
Tip: Plotting discrete numericals
When plotting, discrete numericals with few unique values (think of palmerpenguins::penguins$year
) are best treated as categoricals.
Motivating question
In this dataset, does the proportion of females (relative to males) differ depending on the island sampled?
In other words, is there some sort of relationship between the sex of the penguins and the island that they live on?
Both sex
and island
are categorical variables
Note: The drop_na()
function
The drop_na()
function comes from {tidyr}
. It takes a data frame as its first argument (e.g., penguins
) and then one more columns (e.g., sex
). This function removes all observations that have NA
values in the given column(s).
We can use position adjustments to change how the geometry is organized
Positional adjustments are set within the geometry function
Big idea: Position adjustments
The position aesthetics (arguments x=
and y=
) are the big dogs in terms of how a geometry is plotted in space. In general, position adjustments tweak the position aesthetics, though some make more substantive changes (e.g., in a bar plot, position = "fill"
triggers a whole new statistical transformation.
Motivating question
How does flipper length tend to differ among the three penguin species?
flipper_length_mm
and species
Motivating question
In general, do heavier penguins tend to have longer flippers?
body_mass_g
and flipper_length_mm
Tip: Response and explanatory variables
The response (a.k.a. independent) variable is the one that we are interested in learning about. The explanatory (a.k.a. dependent) variable is the one that we think might explain the variability we observe in the response. As a rule, we map the response variable to the y-axis, and the explanatory variable to the x-axis.
To be completed in Posit Cloud using an RStudio project that you create.
{ggplot2}
and {palmerpenguins}
bill_length_mm
and bill_depth_mm
sex
and species
species
and body_mass_g
Motivating question
How does body mass differ by sex, and does this relationship change depending on the species of penguin?
body_mass_g
, sex
, and species
Tip: Mapping three variables
When visualizing the relationship between three (or more) variables, you may need to play around with the mappings to find an arrangement that makes it easy to address your motivating question.
Motivating question
Does the relationship between bill length and flipper length change depending on the species?
Big idea: Layers
The trend line geometry represents a new layer that we’ve added on top of our scatterplot. In {ggplot2}
code, each call to a geom_*()
function adds a layer. All layers inherit the aesthetic mappings defined in the first call to ggplot()
.
Any aesthetics mapped in ggplot()
are global — they apply to all layers
We can also map aesthetic locally (i.e., within geom_*()
calls)
Motivating question
In this dataset, does the proportion of females (relative to males) differ depending on the island sampled, and do island-specific proportions differ by year?
sex
, island
, and year
Tip: Faceting
Think of faceting as just another aesthetic mapping. The facet aesthetic “pulls” the plot apart so that each value of the “mapped” variable gets its own panel. In general, only categorical variables should be used to facet.
In R, argument names can be dropped as long as the argument values are given in the expected order
Use the help documentation to find the argument order
With arg. names
All layers inherit the data from ggplot()
The aes()
function expects x=
first and y=
second, all other arguments must be named
Tip: Function argument names
Just write the damn argument names. Who are you trying to impress? It makes your code more readable and it mitigates the risk of errors.
To be completed in Posit Cloud.
species
and bill_depth_mm
where boxes are filled by sex
body_mass_g
and sex
faceted by species
bill_length_mm
and body_mass_g
colored by species
and faceted by island
DSC 210 Data Wrangling