AI & LLM Example

Author

Erik Westlund

Published

June 11, 2025

Modified

June 12, 2025

Overview

Data Preparation

Overview

Data Preparation

df <- read_csv(here::here("data", "raw", "cdc_Influenza_Vaccination_Coverage_for_All_Ages__6__Months__20250610.csv"))

df |> glimpse()
Rows: 220,729
Columns: 11
$ Vaccine              <chr> "Seasonal Influenza", "Seasonal Influenza", "Seas…
$ `Geography Type`     <chr> "Counties", "Counties", "Counties", "Counties", "…
$ Geography            <chr> "New Haven", "New Haven", "New Haven", "New Haven…
$ FIPS                 <chr> "09009", "09009", "09009", "09009", "09009", "090…
$ `Season/Survey Year` <chr> "2018", "2021", "2020", "2021", "2018", "2019", "…
$ Month                <dbl> 1, 1, 1, 1, 1, 1, 5, 4, 11, 10, 9, 8, 9, 1, 12, 1…
$ `Dimension Type`     <chr> ">=18 Years", ">=18 Years", "Age", "Age", "Age", …
$ Dimension            <chr> "Non-Medical Setting", "Non-Medical Setting", ">=…
$ `Estimate (%)`       <chr> "45.5", "53.0", "52.4", "50.2", "34.0", "49.5", "…
$ `95% CI (%)`         <chr> "43.9 to 47.2", "46.0 to 60.9", "50.6 to 54.3", "…
$ `Sample Size`        <dbl> NA, NA, NA, NA, NA, NA, 48, 48, 552, 552, 552, 55…

Chat with the LLM

Data Preparation

From here, we can chat with the LLM to help us prepare the data for a visualization.

Below I provide key aspects of the prompts sent to the ChatGPT LLM as well as the results. I’ve excluded some of the LLM’s missteps.

Initial Data Exploration

I am working with an unfamiliar data set. Here is a glimpse using dplyr:

df <- read_csv(here::here("data", "raw", "cdc_Influenza_Vaccination_Coverage_for_All_Ages__6__Months__20250610.csv"))
Rows: 220729 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Vaccine, Geography Type, Geography, FIPS, Season/Survey Year, Dimen...
dbl (2): Month, Sample Size

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df |> glimpse()
Rows: 220,729
Columns: 11
$ Vaccine              <chr> "Seasonal Influenza", "Seasonal Influenza", "Seas…
$ `Geography Type`     <chr> "Counties", "Counties", "Counties", "Counties", "…
$ Geography            <chr> "New Haven", "New Haven", "New Haven", "New Haven…
$ FIPS                 <chr> "09009", "09009", "09009", "09009", "09009", "090…
$ `Season/Survey Year` <chr> "2018", "2021", "2020", "2021", "2018", "2019", "…
$ Month                <dbl> 1, 1, 1, 1, 1, 1, 5, 4, 11, 10, 9, 8, 9, 1, 12, 1…
$ `Dimension Type`     <chr> ">=18 Years", ">=18 Years", "Age", "Age", "Age", …
$ Dimension            <chr> "Non-Medical Setting", "Non-Medical Setting", ">=…
$ `Estimate (%)`       <chr> "45.5", "53.0", "52.4", "50.2", "34.0", "49.5", "…
$ `95% CI (%)`         <chr> "43.9 to 47.2", "46.0 to 60.9", "50.6 to 54.3", "…
$ `Sample Size`        <dbl> NA, NA, NA, NA, NA, NA, 48, 48, 552, 552, 552, 55…

The bot then summarized the data set.

Using the following prompt, we then sent the LLM the actual data file for more context (🚨 do not do this with private data! 🚨):

“I am actually going to give you the file. Interpret it. We are going to use dplyr and ggplot2 to prepare it and visualize aspects of it. Tell me what you see.”

It responded with a summary, warning us about some of the odd structures of the data.

We then asked it to clean the data:

“Yes, let’s get clean, standardized column names. Let’s make all factor variables factor types. Let’s extract the confidence intervals into separate columns. Let’s just start there. Give me R code that I can paste into RStudio.”

Please note from here on out, ChatGPT gave us code that includes library calls and new code. The library calls are loaded above.

It first gave us back broken code because it did not understand fully the results of applying clean_names() to the data.

We eventually sent it the following, explaining the data structure, after which it recovered:

# Clean and prepare
clean_df <- df |>
  clean_names() |>
  mutate(
    vaccine = factor(vaccine),
    geography_type = factor(geography_type),
    geography = factor(geography),
    season_survey_year = factor(season_survey_year),
    dimension_type = factor(dimension_type),
    dimension = factor(dimension),
    estimate = as.numeric(str_replace(estimate_percent, "[^0-9.]", "")),
    ci_clean = str_remove(x95_percent_ci_percent, " ‡| †"),
    ci_lower = as.numeric(str_extract(ci_clean, "^[0-9.]+")),
    ci_upper = as.numeric(str_extract(ci_clean, "[0-9.]+$"))
  )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `estimate = as.numeric(str_replace(estimate_percent, "[^0-9.]",
  ""))`.
Caused by warning:
! NAs introduced by coercion
# Preview result
glimpse(clean_df)
Rows: 220,729
Columns: 15
$ vaccine                <fct> "Seasonal Influenza", "Seasonal Influenza", "Se…
$ geography_type         <fct> Counties, Counties, Counties, Counties, Countie…
$ geography              <fct> New Haven, New Haven, New Haven, New Haven, New…
$ fips                   <chr> "09009", "09009", "09009", "09009", "09009", "0…
$ season_survey_year     <fct> 2018, 2021, 2020, 2021, 2018, 2019, 2009-10, 20…
$ month                  <dbl> 1, 1, 1, 1, 1, 1, 5, 4, 11, 10, 9, 8, 9, 1, 12,…
$ dimension_type         <fct> >=18 Years, >=18 Years, Age, Age, Age, Age, Rac…
$ dimension              <fct> "Non-Medical Setting", "Non-Medical Setting", "…
$ estimate_percent       <chr> "45.5", "53.0", "52.4", "50.2", "34.0", "49.5",…
$ x95_percent_ci_percent <chr> "43.9 to 47.2", "46.0 to 60.9", "50.6 to 54.3",…
$ sample_size            <dbl> NA, NA, NA, NA, NA, NA, 48, 48, 552, 552, 552, …
$ estimate               <dbl> 45.5, 53.0, 52.4, 50.2, 34.0, 49.5, 57.7, 57.7,…
$ ci_clean               <chr> "43.9 to 47.2", "46.0 to 60.9", "50.6 to 54.3",…
$ ci_lower               <dbl> 43.9, 46.0, 50.6, 45.4, 32.6, 47.8, 40.2, 40.2,…
$ ci_upper               <dbl> 47.2, 60.9, 54.3, 55.8, 35.5, 51.3, 75.2, 75.2,…

Data Summary

We then sent it this to inform it of the structure of the data, which is somewhat unusual:

table(clean_df$dimension)

                                             >=18 Years 
                                                  26534 
                                             >=6 Months 
                                                  10449 
                                             >=65 Years 
                                                  10903 
                                            13-17 Years 
                                                  10265 
                                            18-49 Years 
                                                   9931 
                               18-49 Years at High Risk 
                                                  11038 
                           18-49 Years not at High Risk 
                                                   9257 
                                            18-64 Years 
                                                   9942 
                               18-64 Years at High Risk 
                                                   9898 
                           18-64 Years not at High Risk 
                                                   8018 
                               25-64 Years at High Risk 
                                                    496 
                25-64 Years not in Initial Target Group 
                                                    496 
                                             5-12 Years 
                                                  10279 
                                            50-64 Years 
                                                  11125 
                                    6 Months - 17 Years 
                                                  12018 
                                     6 Months - 4 Years 
                                                  10282 
6 Months - 64 Years at High Risk (Initial Target Group) 
                                                    496 
         American Indian or Alaska Native, Non-Hispanic 
                                                   1702 
                                    Asian, Non-Hispanic 
                                                   1712 
                                    Black, Non-Hispanic 
                                                  11067 
                                             Greater 65 
                                                    770 
                              Greater than 18 Years flu 
                                                    770 
                              Greater than 6 Months flu 
                                                    682 
                                               Hispanic 
                                                  11106 
                                        Medical Setting 
                                                    770 
                                    Non-Medical Setting 
                                                   6975 
                  Other or Multiple Races, Non-Hispanic 
                                                  11093 
                                         Pharmacy/Store 
                                                    770 
                                                 School 
                                                    198 
                                    White, Non-Hispanic 
                                                  11115 
                                              Workplace 
                                                    572 
table(clean_df$dimension_type)

         >=18 Years          >=65 Years         18-49 Years         18-64 Years 
               6829                 416                 416                 416 
        50-64 Years 6 Months - 17 Years                 Age  Race and Ethnicity 
                416                 792              163649               47795 

Here is what we have:

Please notice that the age categories have overlap, so it’s clear that these data are pre-aggregated, and we’ll want to make sure we use mutually exclusive categories.

Here is what we want:

We want to create a time series that shows the vaccination rate, by age, over time.

Before you give me code, what do you recommend?

It went over challenges and recommendations.

We then asked it:

“OK, give me that dataframe before we visualize”

mutually_exclusive_ages <- c(
  "6 Months - 4 Years",
  "5-12 Years",
  "13-17 Years",
  "18-49 Years",
  "50-64 Years",
  ">=65 Years"
)

# Build the cleaned time series dataframe
age_time_series <- clean_df |>
  filter(
    dimension_type == "Age",
    dimension %in% mutually_exclusive_ages,
    !is.na(estimate)
  ) |>
  group_by(season_survey_year, age_group = dimension) |>
  summarise(
    mean_estimate = mean(estimate, na.rm = TRUE),
    n_records = n(),
    .groups = "drop"
  ) |>
  mutate(
    age_group = fct_relevel(age_group, mutually_exclusive_ages)
  )

# View the resulting structure
glimpse(age_time_series)
Rows: 85
Columns: 4
$ season_survey_year <fct> 2009-10, 2009-10, 2010-11, 2010-11, 2010-11, 2010-1…
$ age_group          <fct> ">=65 Years", "50-64 Years", ">=65 Years", "13-17 Y…
$ mean_estimate      <dbl> 46.28232, 38.60872, 52.04114, 30.16828, 25.02017, 4…
$ n_records          <int> 1680, 1158, 598, 558, 575, 585, 580, 582, 602, 562,…

Data Visualization

With the data now prepared and us asking for a line plot broken apart by age group, it gave us the following code, which worked the first time:

ggplot(age_time_series, aes(x = season_survey_year, y = mean_estimate, group = age_group, color = age_group)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2) +
  labs(
    title = "Influenza Vaccination Coverage Over Time by Age Group",
    x = "Season/Survey Year",
    y = "Mean Vaccination Estimate (%)",
    color = "Age Group"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

This image needs refinement, which ChatGPT would happily provide if we continued, but it worked. We wrote zero code ourselves.