Rows: 220729 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Vaccine, Geography Type, Geography, FIPS, Season/Survey Year, Dimen...
dbl (2): Month, Sample Size
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Using the following prompt, we then sent the LLM the actual data file for more context (🚨 do not do this with private data! 🚨):
“I am actually going to give you the file. Interpret it. We are going to use dplyr and ggplot2 to prepare it and visualize aspects of it. Tell me what you see.”
It responded with a summary, warning us about some of the odd structures of the data.
We then asked it to clean the data:
“Yes, let’s get clean, standardized column names. Let’s make all factor variables factor types. Let’s extract the confidence intervals into separate columns. Let’s just start there. Give me R code that I can paste into RStudio.”
Please note from here on out, ChatGPT gave us code that includes library calls and new code. The library calls are loaded above.
It first gave us back broken code because it did not understand fully the results of applying clean_names() to the data.
We eventually sent it the following, explaining the data structure, after which it recovered:
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `estimate = as.numeric(str_replace(estimate_percent, "[^0-9.]",
""))`.
Caused by warning:
! NAs introduced by coercion
We then sent it this to inform it of the structure of the data, which is somewhat unusual:
table(clean_df$dimension)
>=18 Years
26534
>=6 Months
10449
>=65 Years
10903
13-17 Years
10265
18-49 Years
9931
18-49 Years at High Risk
11038
18-49 Years not at High Risk
9257
18-64 Years
9942
18-64 Years at High Risk
9898
18-64 Years not at High Risk
8018
25-64 Years at High Risk
496
25-64 Years not in Initial Target Group
496
5-12 Years
10279
50-64 Years
11125
6 Months - 17 Years
12018
6 Months - 4 Years
10282
6 Months - 64 Years at High Risk (Initial Target Group)
496
American Indian or Alaska Native, Non-Hispanic
1702
Asian, Non-Hispanic
1712
Black, Non-Hispanic
11067
Greater 65
770
Greater than 18 Years flu
770
Greater than 6 Months flu
682
Hispanic
11106
Medical Setting
770
Non-Medical Setting
6975
Other or Multiple Races, Non-Hispanic
11093
Pharmacy/Store
770
School
198
White, Non-Hispanic
11115
Workplace
572
table(clean_df$dimension_type)
>=18 Years >=65 Years 18-49 Years 18-64 Years
6829 416 416 416
50-64 Years 6 Months - 17 Years Age Race and Ethnicity
416 792 163649 47795
Here is what we have:
Please notice that the age categories have overlap, so it’s clear that these data are pre-aggregated, and we’ll want to make sure we use mutually exclusive categories.
Here is what we want:
We want to create a time series that shows the vaccination rate, by age, over time.
With the data now prepared and us asking for a line plot broken apart by age group, it gave us the following code, which worked the first time:
ggplot(age_time_series, aes(x = season_survey_year, y = mean_estimate, group = age_group, color = age_group)) +geom_line(linewidth =1.2) +geom_point(size =2) +labs(title ="Influenza Vaccination Coverage Over Time by Age Group",x ="Season/Survey Year",y ="Mean Vaccination Estimate (%)",color ="Age Group" ) +theme_minimal(base_size =14) +theme(axis.text.x =element_text(angle =45, hjust =1),legend.position ="bottom" )
This image needs refinement, which ChatGPT would happily provide if we continued, but it worked. We wrote zero code ourselves.
Source Code
---title: "AI & LLM Example"author: "Erik Westlund"date: "2025-06-11"date-modified: "`r format(Sys.Date(), '%Y-%m-%d')`"editor: render-on-save: true---## Overview* We'll use an LLM to help us prepare data for a visualization* Let's use the [CDC's data on influenza vaccination coverage for all ages](https://data.cdc.gov/Flu-Vaccinations/Influenza-Vaccination-Coverage-for-All-Ages-6-Mont/vh55-3he6/about_data)## Data Preparation```{r setup}#| echo: false#| message: false# Install required packages if not already installedrequired_packages <-c("dplyr", "ggplot2", "forcats", "janitor", "kableExtra", "readr", "stringr", "tidyr")new_packages <- required_packages[!(required_packages %in%installed.packages()[,"Package"])]if(length(new_packages)) install.packages(new_packages)# Load required packageslibrary(dplyr)library(ggplot2)library(forcats)library(janitor)library(kableExtra)library(readr)library(stringr)library(tidyr)```## Overview* We'll use an LLM to help us prepare data for a visualization* Let's use the [CDC's data on influenza vaccination coverage for all ages](https://data.cdc.gov/Flu-Vaccinations/Influenza-Vaccination-Coverage-for-All-Ages-6-Mont/vh55-3he6/about_data)## Data Preparation```{r}#| message: falsedf <-read_csv(here::here("data", "raw", "cdc_Influenza_Vaccination_Coverage_for_All_Ages__6__Months__20250610.csv"))df |>glimpse()```## Chat with the LLM### Data PreparationFrom here, we can chat with the LLM to help us prepare the data for a visualization.Below I provide key aspects of the prompts sent to the ChatGPT LLM as well as the results. I've excluded some of the LLM's missteps.### Initial Data ExplorationI am working with an unfamiliar data set. Here is a glimpse using dplyr:```{r}df <-read_csv(here::here("data", "raw", "cdc_Influenza_Vaccination_Coverage_for_All_Ages__6__Months__20250610.csv"))df |>glimpse()```The bot then summarized the data set.Using the following prompt, we then sent the LLM the actual data file for more context (🚨 do not do this with private data! 🚨):"I am actually going to give you the file. Interpret it. We are going to use dplyr and ggplot2 to prepare it and visualize aspects of it. Tell me what you see."It responded with a summary, warning us about some of the odd structures of the data.We then asked it to clean the data:"Yes, let's get clean, standardized column names. Let's make all factor variables factor types. Let's extract the confidence intervals into separate columns. Let's just start there. Give me R code that I can paste into RStudio."Please note from here on out, ChatGPT gave us code that includes library calls and new code. The library calls are loaded above.It first gave us back broken code because it did not understand fully the results of applying `clean_names()` to the data.We eventually sent it the following, explaining the data structure, after which it recovered:```{r}# Clean and prepareclean_df <- df |>clean_names() |>mutate(vaccine =factor(vaccine),geography_type =factor(geography_type),geography =factor(geography),season_survey_year =factor(season_survey_year),dimension_type =factor(dimension_type),dimension =factor(dimension),estimate =as.numeric(str_replace(estimate_percent, "[^0-9.]", "")),ci_clean =str_remove(x95_percent_ci_percent, " ‡| †"),ci_lower =as.numeric(str_extract(ci_clean, "^[0-9.]+")),ci_upper =as.numeric(str_extract(ci_clean, "[0-9.]+$")) )# Preview resultglimpse(clean_df)```### Data SummaryWe then sent it this to inform it of the structure of the data, which is somewhat unusual:```{r}table(clean_df$dimension)table(clean_df$dimension_type)```Here is what we have:Please notice that the age categories have overlap, so it's clear that these data are pre-aggregated, and we'll want to make sure we use mutually exclusive categories.Here is what we want:We want to create a time series that shows the vaccination rate, by age, over time.Before you give me code, what do you recommend?It went over challenges and recommendations.We then asked it:"OK, give me that dataframe before we visualize"```{r}mutually_exclusive_ages <-c("6 Months - 4 Years","5-12 Years","13-17 Years","18-49 Years","50-64 Years",">=65 Years")# Build the cleaned time series dataframeage_time_series <- clean_df |>filter( dimension_type =="Age", dimension %in% mutually_exclusive_ages,!is.na(estimate) ) |>group_by(season_survey_year, age_group = dimension) |>summarise(mean_estimate =mean(estimate, na.rm =TRUE),n_records =n(),.groups ="drop" ) |>mutate(age_group =fct_relevel(age_group, mutually_exclusive_ages) )# View the resulting structureglimpse(age_time_series)```### Data VisualizationWith the data now prepared and us asking for a line plot broken apart by age group, it gave us the following code, which worked the first time:```{r}ggplot(age_time_series, aes(x = season_survey_year, y = mean_estimate, group = age_group, color = age_group)) +geom_line(linewidth =1.2) +geom_point(size =2) +labs(title ="Influenza Vaccination Coverage Over Time by Age Group",x ="Season/Survey Year",y ="Mean Vaccination Estimate (%)",color ="Age Group" ) +theme_minimal(base_size =14) +theme(axis.text.x =element_text(angle =45, hjust =1),legend.position ="bottom" )```This image needs refinement, which ChatGPT would happily provide if we continued, but it worked. We wrote zero code ourselves.