ggplot2 Fundamentals Example

Author

Erik Westlund

Published

June 11, 2025

Modified

June 12, 2025

# Install required packages if not already installed
required_packages <- c("dplyr", "ggplot2", "forcats", "kableExtra", "readr", "ggtext")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load required packages
library(dplyr)
library(ggplot2)
library(forcats)
library(kableExtra)
library(readr)
library(ggtext)  # Add ggtext for markdown support

Research Question

We had substantial missing data on depression/anxiety questions. For this reason, let’s focus on the relationship between location and binge drinking.

Workflow with ggplot2:

  1. Start with data
  2. Pick an aesthetic mapping
  3. Choose a geometric object
  4. Add statistical transformations
  5. Adjust finer details: scales, coordinate systems, faceting, etc.

Build this up, layer by layer.

Step 1: Data

Let’s load the data.

df_final <- readRDS(here::here("data", "processed", "cdc_prams_df_final.rds"))

df_final |> 
  glimpse()
Rows: 1,222
Columns: 8
$ location_abbr                        <fct> AR, AR, AR, AR, AR, AR, AR, AR, A…
$ subgroup_cat                         <chr> "Adequacy of Prenatal care", "Ade…
$ subgroup                             <chr> "ADEQUATE PNC", "INADEQUATE PNC",…
$ depression_within_3_months_birth     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
$ anxiety_within_3_months_birth        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
$ binge_drinking_within_3_months_birth <dbl> 24.6, 12.6, 22.2, 36.0, 21.5, 24.…
$ alcohol_use_within_3_months_birth    <dbl> 53.5, 28.7, 40.9, 54.1, 45.8, 49.…
$ location                             <fct> Arkansas, Arkansas, Arkansas, Ark…

Let’s filter out missing data now to avoid warnings later.

df_binge_location <- df_final |>
  filter(!is.na(binge_drinking_within_3_months_birth)) |> 
  select(location_abbr, subgroup_cat, subgroup, location, binge_drinking_within_3_months_birth)

This isn’t actually enough. Remember, the data has sub groupings:

df_binge_location |>
  select(subgroup_cat, subgroup) |>
  distinct() |> 
  kable()
subgroup_cat subgroup
Adequacy of Prenatal care ADEQUATE PNC
Adequacy of Prenatal care INADEQUATE PNC
Adequacy of Prenatal care INTERMEDIATE PNC
Adequacy of Prenatal care UNKNOWN PNC
Birth Weight LBW (<=2500g)
Birth Weight NBW (>2500g)
Income (years 2004 and beyond) $10,000 to $24,999
Income (years 2004 and beyond) $25,000 to $49,999
Income (years 2004 and beyond) $50,000 or more
Income (years 2004 and beyond) Less than $10,000
Marital Status MARRIED
Marital Status OTHER
Maternal Age (3 Levels) 20-29 yrs
Maternal Age (3 Levels) 30+ yrs
Maternal Age (3 Levels) <20 yrs
Maternal Age (4 Levels) 20-24 yrs
Maternal Age (4 Levels) 25-34 yrs
Maternal Age (4 Levels) 35+ yrs
Maternal Age (4 Levels) <20 yrs
Maternal Age - 18 to 44 years in groupings Age 18 - 24
Maternal Age - 18 to 44 years in groupings Age 25 - 29
Maternal Age - 18 to 44 years in groupings Age 30 - 44
Maternal Age - 18 to 44 years in groupings Age < 18
Maternal Age - 18 to 44 years only Age 18 - 44
Maternal Education 12 yrs
Maternal Education < 12 yrs
Maternal Education >12 yrs
Maternal Race/Ethnicity Black, non-Hispanic
Maternal Race/Ethnicity Hispanic
Maternal Race/Ethnicity White, non-Hispanic
Medicaid Recipient Medicaid
Medicaid Recipient Non-Medicaid
Mother Hispanic Hispanic
Mother Hispanic Non-Hispanic
None None
Number of Previous Live Births 0
Number of Previous Live Births 1 or more
On WIC during Pregnancy Non-WIC
On WIC during Pregnancy WIC
Pregnancy Intendedness Intended
Pregnancy Intendedness Unintended
Smoked 3 months before Pregnancy Non-Smoker
Smoked 3 months before Pregnancy Smoker
Smoked last 3 months of Pregnancy Non-Smoker
Smoked last 3 months of Pregnancy Smoker
Maternal Race/Ethnicity Other non-Hispanic

For now, let’s just filter on the “None” subgroup.

df_binge_location <- df_binge_location |>
  filter(subgroup_cat == "None")

Step 2. Aesthetic Mappings

  • What is your X (independent) variable?
  • What is your Y (dependent) variable?

Here:

  • x = location
  • y = binge_drinking_within_3_months_birth
p1 <- df_binge_location |>
    ggplot(aes(x = location, y = binge_drinking_within_3_months_birth)) 

p1

We can see already we’re likely going to want to do a coordinate flip, but let’s save that for later.

Note how we save our plot as an object, p1.

It’s often nice to build plots up, step by step, with p{n} where n is the step number.

Step 3. Geometric Object

  • Our x is a categorical variable
  • Our y is a continuous variable

It makes sense to use a bar chart.

p2 <- p1 + geom_bar(stat = "identity")

p2

Step 4. Statistical Transformations

  • The public 2011 PRAMS data we are using is aggregated at the location level.
  • For this reason, we don’t need to do any statistical transformations.

You can imagine data where we have data at the individual level, and we want to aggregate it to the location level, we’d need to use an aggregation function. I’ll show this below with simulated data.

Step 5. Adjust Fine Details

Coordinate Flip

First, let’s flip the coordinates.

p3 <- p2 + coord_flip()

p3

That looks a lot better.

Fixing Labels

Right now, the labels are reverse alphabetical. Let’s fix that.

p4 <- p3 + 
  scale_x_discrete(limits = rev(levels(df_binge_location$location)))

p4

We might also consider sorting by the rate of binge drinking.

p5 <- df_binge_location |>
  mutate(location = fct_reorder(location, binge_drinking_within_3_months_birth)) |>
  ggplot(aes(x = location, y = binge_drinking_within_3_months_birth)) +
  geom_bar(stat = "identity") +
  coord_flip()

p5

Ah – something of an insight!

Labels

Our labels are not great. We actually do not need the location names, but we need to label the x-axis and give it a title.

p6 <- p5 +
  labs(
    x = NULL,
    y = "Percent",
    title = "Binge Drinking Before Pregnancy",
    subtitle = "Percentage of mothers who reported binging drinking in the 3 months before pregnancy"
)

p6

You can consider where to put the necessary description. Possible options:

  • title
  • subtitle
  • caption
  • x-axis or y-axis labels

Above, I opted for the subtitle.

Theme

I find the minimal theme to be a good starting point.

p7 <- p6 +
  theme_minimal()

p7

Gridlines

Here, the horizontal gridlines are not helpful. (Remember, we used coord_flip() above, so we need to “flip” the gridlines.)

p8 <- p7 +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank()
  )

p8

Let’s make the title bold:

p9 <- p8 +
  theme(plot.title = element_text(face = "bold"))

p9

Emphasizing Certain Values

Imagine we were interested in comparing New York with everyone else. We have two New York bars.

We could reduce the opacity of the bars, color the New York bars, and then annotate the specific values.

Let’s start by removing the fill.

Let’s also reset our entire plot so we can see all the steps we took

p_ny <- df_binge_location |>
  mutate(location = fct_reorder(location, binge_drinking_within_3_months_birth)) |>
  ggplot(aes(x = location, y = binge_drinking_within_3_months_birth)) +
  geom_bar(
    stat = "identity",
    aes(fill = location %in% c("New York (excluding NYC)", "New York City"))
  ) +
  scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "lightgray")) +
  coord_flip() +
  theme_minimal() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    legend.position = "none" # Remove the legend since we don't need it
  ) +
  labs(
    x = NULL,
    y = "Percent",
    title = "Binge Drinking Before Pregnancy",
    subtitle = "Percentage of mothers who reported binging drinking in the 3 months before pregnancy"
  ) +
  theme(plot.title = element_text(face = "bold"))


p_ny

Convert to Lollipop Chart

I feel the gray is too overwhelming and the red is too bright. We can convert it to a so-called lollipop chart. The information is the same, but for many bars this can look cleaner.

p_ny2 <- df_binge_location |>
    mutate(
      location = fct_reorder(location, binge_drinking_within_3_months_birth),
      is_ny = location %in% c("New York (excluding NYC)", "New York City")
    ) |>
    ggplot(aes(x = location, y = binge_drinking_within_3_months_birth)) +
    geom_segment(aes(xend = location, yend = 0,
                    color = is_ny),
                linewidth = ifelse(df_binge_location$location %in% c("New York (excluding NYC)", "New York City"), 1.5, 0.8)) +
    geom_point(aes(color = is_ny),
              size = ifelse(df_binge_location$location %in% c("New York (excluding NYC)", "New York City"), 4.5, 2.2)) +
    scale_color_manual(
      values = c("TRUE" = "#E63946", "FALSE" = "grey")  # Darker red
    ) +
    scale_x_discrete(
      labels = function(x) ifelse(x %in% c("New York (excluding NYC)", "New York City"), 
                                 paste0("**", x, "**"), 
                                 x)
    ) +
    coord_flip() +
    theme_minimal() +
    theme(
      panel.grid.major.y = element_blank(),
      panel.grid.minor.y = element_blank(),
      legend.position = "none",  # Remove legend
      axis.text.y = element_markdown(size = 11)  # Use markdown for labels
    ) +
    labs(
      x = NULL,
      y = "Percent",
      title = "Binge Drinking Before Pregnancy in New York vs. Other States",
      subtitle = "Percentage of mothers who reported binging drinking in the 3 months before pregnancy"
    ) +
    theme(plot.title = element_text(face = "bold"))


p_ny2

Add Text Annotations

p_ny3 <- p_ny2 +
  geom_text(data = df_binge_location |>
              filter(location %in% c("New York (excluding NYC)", "New York City")) |>
              mutate(location = fct_reorder(location, binge_drinking_within_3_months_birth)),
            aes(x = location, y = binge_drinking_within_3_months_birth,
                label = paste0(round(binge_drinking_within_3_months_birth, 1), "%")),
            hjust = -0.5,
            vjust = 0.5,
            size = 4,
            fontface = "bold")

p_ny3

This isn’t perfect, but I think it’s a vast improvement and conveys the message well.

We can also force the scale to 100%; this can sometimes help viewers realize the actual rates.

Force Scale to 100%

p_ny4 <- p_ny3 +
  scale_y_continuous(limits = c(0, 100))

p_ny4

White Space Adjustments

Another final touch: I do not think there is enough vertical space between the title/subtitle and the lines. Likewise, the x-axis labels are a bit too close to the bars.

p_ny5 <- p_ny4 +
  theme(
    plot.subtitle = element_text(margin = margin(b = 12)),
    axis.title.x = element_text(margin = margin(t = 10))
  )

p_ny5