Data Preparation Example

Author

Erik Westlund

Published

June 11, 2025

Modified

June 12, 2025

Data Source

We use CDC’s PRAMStat data for 2011.

Note

In this file, I use exposition to explain what I am doing and what I am thinking. In this context, this is an unknown data set we’re working with and I am trying to teach others.

For a project orreport, you would want to first read the data file documentation and codebook first to understand the data. Your commentary in the notebook should be targeted to the audience, conveying everything required for people to understand the steps you took and why you took them.

This data is aggregated at the location level and comes to us in “long” format.

Loading Required Libraries

We first need to load in the libraries we’ll use.

We will use:

readr to load in the data ( I prefer readr over the built-in read.csv function because it is faster and has better default settings for avoiding common issues with CSVs.)
dplyr to clean data.
kableExtra is used to make tables look nice.
DT to make tables interactive and get nice pagination on long tables.
tidyr to pivot the data to wide format.

# Install required packages if not already installed
required_packages <- c("dplyr", "DT", "kableExtra", "readr", "tidyr")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load required packages
library(dplyr)
library(DT)
library(kableExtra)
library(readr)
library(tidyr)

To hide this code from the output, I typically prepend the above code with.

#| echo: false
#| message: false

Here, for illustration, I hide messages but not the code block.

Read and Explore the Data

Here we read the data using the readr::read_csv function. Note that we use here::here to ensure the path to the data is correct relative to the project root.

We then use the glimpse function to take a look at the data. Glimpse tells us the number of rows and columns in the data, the names of the columns, the type of each column, and the first few rows of the data.

I create a new object df_clean to store the cleaned data. For now, we’ll just copy the original data to this object. This is a good habit to get into, as it allows you to keep the original data for reference and to easily revert to it if needed.

# Load the data using here::here to ensure correct path resolution
df <- read_csv(here::here("data", "raw", "cdc_PRAMStat_Data_for_2011_20250610.csv"))

# Take a look at the data structure
df |> glimpse()

Rows: 520,381
Columns: 27
$ Year                       <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2…
$ LocationAbbr               <chr> "AR", "AR", "CO", "CO", "DE", "DE", "GA", "…
$ LocationDesc               <chr> "Arkansas", "Arkansas", "Colorado", "Colora…
$ Class                      <chr> "Prenatal Care", "Prenatal Care", "Prenatal…
$ Topic                      <chr> "Prenatal Care - Content", "Prenatal Care -…
$ Question                   <chr> "During any of your prenatal care visits  d…
$ DataSource                 <chr> "PRAMS", "PRAMS", "PRAMS", "PRAMS", "PRAMS"…
$ Response                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Data_Value_Unit            <chr> "%", "%", "%", "%", "%", "%", "%", "%", "%"…
$ Data_Value_Type            <chr> "Percentage", "Percentage", "Percentage", "…
$ Data_Value                 <dbl> 52.0, 52.0, 39.0, 35.0, 22.0, 19.0, 71.0, 3…
$ Data_Value_Footnote_Symbol <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1"…
$ Data_Value_Footnote        <chr> "Missing includes not applicable, don't kno…
$ Data_Value_Std_Err         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Low_Confidence_Limit       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ High_Confidence_Limit      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Sample_Size                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Break_Out                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Break_Out_Category         <chr> "Maternal Age - 18 to 44 years in groupings…
$ Geolocation                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ClassId                    <chr> "CLA11", "CLA11", "CLA11", "CLA11", "CLA11"…
$ TopicId                    <chr> "TOP19", "TOP19", "TOP19", "TOP19", "TOP19"…
$ QuestionId                 <chr> "QUO333", "QUO67", "QUO333", "QUO67", "QUO3…
$ LocationId                 <dbl> 5, 5, 8, 8, 10, 10, 13, 15, 15, 25, 25, 24,…
$ BreakOutId                 <chr> "BOC17", "BOC17", "BOC17", "BOC17", "BOC17"…
$ BreakOutCategoryid         <chr> "BOC17", "BOC17", "BOC17", "BOC17", "BOC17"…
$ ResponseId                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

df_clean <- df

Tabbing Out Data

Always read the codebook first to understand the data. Nevertheless, it’s still helpful to tab out the data to get a sense of the variables and their values.

Let’s first look at the topics availble

table(df$Topic) |> kable()

Var1	Freq
Abuse - Physical	6799
Alcohol Use	17875
Assisted Reproduction	1964
Breastfeeding	8829
Contraception - Conception	32460
Contraception - Postpartum	26716
Delivery - Method	999
Delivery - Payment	14802
HIV Test	9975
Hospital Length of Stay	9564
Household Characteristics	3469
Income	8143
Infant Health Care	6377
Injury Prevention	1692
Insurance Coverage	2997
Maternal Health Care	2220
Medicaid	11770
Mental Health	5994
Morbidity - Infant	5697
Morbidity - Maternal	35790
Multivitamin Use	12185
Obesity	11517
Oral Health	10543
Preconception Health	21977
Preconception Morbidity	7142
Pregnancy History	12875
Pregnancy Intention	14249
Pregnancy Outcome	2928
Pregnancy Recognition	2996
Prenatal Care Provider	999
Prenatal Care - Content	36688
Prenatal Care - Initiation	8991
Prenatal Care - Location	1692
Prenatal Care - Payment	14846
Prenatal Care - Visits	7930
Sleep Behaviors	19536
Smoke Exposure	2986
Stress	56867
Tobacco Use	56440
WIC	2862

Suppose we’re intersted in the relationship between Alcohol Use and Mental Health.

Let’s filter the data to include only those topics, and then look at the questions available.

df |>
  filter(Topic %in% c("Alcohol Use", "Mental Health")) |>
  select(QuestionId, Topic, Question) |>
  distinct() |>
  kable()

QuestionId	Topic	Question
QUO100	Alcohol Use	Indicator of drinking any alcohol during the past two years
QUO271	Alcohol Use	(*PCH) Indicator of binge drinking (4+ drinks) during 3 months before pregnancy
QUO8	Alcohol Use	(*PCH) Indicator of drinking alcohol during the three months before pregnancy
QUO9	Alcohol Use	Indicator of whether mother reported having any alcoholic drinks during the last 3 months of pregnancy
QUO99	Alcohol Use	Change in drinking between three months before pregnancy and last three months of pregnancy
QUO133	Mental Health	During the 3 months before you got pregnant with your new baby did you have depression?
QUO219	Mental Health	(*PCH) Indicator of whether mother reported frequent postpartum depressive symptoms (years 2009 - 2011)
QUO232	Mental Health	During the 3 months before you got pregnant with your new baby did you have anxiety?
QUO97	Mental Health	Did a doctor nurse or other health care worker talk with you about baby blues or postpartum depression during pregnancy or after your delivery?
QUO133	Mental Health	During the 3 months before you got pregnant with your new baby, did you have depression?
QUO232	Mental Health	During the 3 months before you got pregnant with your new baby, did you have anxiety?
QUO97	Mental Health	Did a doctor, nurse, or other health care worker talk with you about baby blues or postpartum depression during pregnancy or after your delivery?

Filtering Data

Suppose we’re interested in the relationship between drinking in the first three months before pregnancy and reporting depression and anxiety. H

question_ids_of_interest <- c(
  "QUO271", # Binge drinking in the first three months before pregnancy
  "QUO8",   # Any alcohol use in the first three months before pregnancy
  "QUO133", # Depression reported in the first three months before pregnancy
  "QUO232"  # Anxiety reported in the first three months before pregnancy
)

df_clean <- df_clean |>
  filter(QuestionId %in% question_ids_of_interest)

df_clean |>
  head() |>
  kable()

Year	LocationAbbr	LocationDesc	Class	Topic	Question	DataSource	Response	Data_Value_Unit	Data_Value_Type	Data_Value	Data_Value_Footnote_Symbol	Data_Value_Footnote	Data_Value_Std_Err	Low_Confidence_Limit	High_Confidence_Limit	Sample_Size	Break_Out	Break_Out_Category	Geolocation	ClassId	TopicId	QuestionId	LocationId	BreakOutId	BreakOutCategoryid	ResponseId
2011	PRAMS Total	PRAMS Total	Maternal Behavior/Health	Alcohol Use	(*PCH) Indicator of binge drinking (4+ drinks) during 3 months before pregnancy	PRAMS	NO	%	Percentage	78.4	NA	NA	NA	76.7	79.9	7089	LBW (<=2500g)	Birth Weight	NA	CLA9	TOP2	QUO271	59	BWT1	BOC1	RES23
2011	PRAMS Total	PRAMS Total	Maternal Behavior/Health	Alcohol Use	(*PCH) Indicator of binge drinking (4+ drinks) during 3 months before pregnancy	PRAMS	YES	%	Percentage	21.6	NA	NA	NA	20.1	23.3	1832	LBW (<=2500g)	Birth Weight	NA	CLA9	TOP2	QUO271	59	BWT1	BOC1	RES40
2011	PRAMS Total	PRAMS Total	Maternal Behavior/Health	Alcohol Use	(*PCH) Indicator of binge drinking (4+ drinks) during 3 months before pregnancy	PRAMS	NO	%	Percentage	76.8	NA	NA	NA	76.0	77.5	20588	NBW (>2500g)	Birth Weight	NA	CLA9	TOP2	QUO271	59	BWT2	BOC1	RES23
2011	PRAMS Total	PRAMS Total	Maternal Behavior/Health	Alcohol Use	(*PCH) Indicator of binge drinking (4+ drinks) during 3 months before pregnancy	PRAMS	YES	%	Percentage	23.2	NA	NA	NA	22.5	24.0	6099	NBW (>2500g)	Birth Weight	NA	CLA9	TOP2	QUO271	59	BWT2	BOC1	RES40
2011	PRAMS Total	PRAMS Total	Maternal Behavior/Health	Alcohol Use	(*PCH) Indicator of binge drinking (4+ drinks) during 3 months before pregnancy	PRAMS	NA	%	Percentage	877.0	1	Missing includes not applicable, don’t know, not recorded, no responses, and legitimate skips	NA	NA	NA	NA	NA	On WIC during Pregnancy	NA	CLA9	TOP2	QUO271	59	BOC10	BOC10	NA
2011	PRAMS Total	PRAMS Total	Maternal Behavior/Health	Alcohol Use	(*PCH) Indicator of binge drinking (4+ drinks) during 3 months before pregnancy	PRAMS	NO	%	Percentage	74.5	NA	NA	NA	73.5	75.4	13716	Non-WIC	On WIC during Pregnancy	NA	CLA9	TOP2	QUO271	59	WIC1	BOC10	RES23

I personally prefer to remove values from the data set we do not need. We can always add them back in. By using select we also reorder the columns to make it easier to take in.

df_clean <- df_clean |>
  select(
    QuestionId,
    Topic,
    Question,
    Response,
    Data_Value,
    Data_Value_Unit,
    Data_Value_Type,
    LocationAbbr,
    LocationDesc,
    Break_Out,
    Break_Out_Category
  )

Renaming Variables

I prefer to rename variables to follow a consistent style and naming convention (e.g., lowercase, snake_case, noun first). Here I also rename variables to be more descriptive, such as with subgroups. Some would prefer keeping the original names, but I prefer the ergonomics of this approach.

We can use the DT::datatable function to view the data in our document in an interactive, filterable way. You’ll want to be careful with inline this data, but with public data, this is a good way to share data (or a selection of it) with others.

df_clean <- df_clean |>
  rename_with(tolower) |>
  rename(
    question_id = questionid,
    value = data_value,
    unit = data_value_unit,
    type = data_value_type,
    location_abbr = locationabbr,
    location = locationdesc,
    subgroup = break_out,
    subgroup_cat = break_out_category
  )

DT::datatable(df_clean)

Understanding The Data

First note that this data still needs work. Even without the codebook, we can seee:

The response/value has “yes,” “no”, and NA values.
The location_abbr has both aggregations and location values.
The meaning of each row depends on the subgroup and subgroup_cat variables.

First, let’s get a sense of the locations in the data

df_clean |> 
  group_by(location) |>
  summarise(
    location_abbr = first(location_abbr),
    n = n()
  ) |>
kable()

location	location_abbr	n
Arkansas	AR	210
Colorado	CO	216
Delaware	DE	438
Georgia	GA	210
Hawaii	HI	438
Maine	ME	216
Maryland	MD	438
Massachusetts	MA	216
Michigan	MI	438
Minnesota	MN	438
Missouri	MO	432
Nebraska	NE	210
New Jersey	NJ	216
New Mexico	NM	210
New York (excluding NYC)	NY	210
New York City	YC	216
Oklahoma	OK	216
Oregon	OR	210
PRAMS Total	PRAMS Total	432
Pennsylvania	PA	210
Rhode Island	RI	216
Utah	UT	438
Vermont	VT	216
Washington	WA	216
West Virginia	WV	438
Wisconsin	WI	438
Wyoming	WY	438

We see that some non-standard choices are made here: New York and New York City are both listed as locations. Also, aggregations are included under PRAMS Total.

Lets get an overview of the subgroup categories:

df_clean |> 
  group_by(subgroup_cat, subgroup) |>
  summarise(n = n()) |>
  kable()

subgroup_cat	subgroup	n
Adequacy of Prenatal care	ADEQUATE PNC	152
Adequacy of Prenatal care	INADEQUATE PNC	152
Adequacy of Prenatal care	INTERMEDIATE PNC	152
Adequacy of Prenatal care	UNKNOWN PNC	152
Adequacy of Prenatal care	NA	49
Birth Weight	LBW (<=2500g)	152
Birth Weight	NBW (>2500g)	152
Birth Weight	NA	67
Income (years 2004 and beyond)	$10,000 to $24,999	152
Income (years 2004 and beyond)	$25,000 to $49,999	152
Income (years 2004 and beyond)	$50,000 or more	152
Income (years 2004 and beyond)	Less than $10,000	152
Income (years 2004 and beyond)	NA	49
Marital Status	MARRIED	152
Marital Status	OTHER	152
Marital Status	NA	67
Maternal Age (3 Levels)	20-29 yrs	152
Maternal Age (3 Levels)	30+ yrs	152
Maternal Age (3 Levels)	<20 yrs	152
Maternal Age (3 Levels)	NA	49
Maternal Age (4 Levels)	20-24 yrs	152
Maternal Age (4 Levels)	25-34 yrs	152
Maternal Age (4 Levels)	35+ yrs	152
Maternal Age (4 Levels)	<20 yrs	152
Maternal Age (4 Levels)	NA	67
Maternal Age - 18 to 44 years in groupings	Age 18 - 24	152
Maternal Age - 18 to 44 years in groupings	Age 25 - 29	152
Maternal Age - 18 to 44 years in groupings	Age 30 - 44	152
Maternal Age - 18 to 44 years in groupings	Age 45+	152
Maternal Age - 18 to 44 years in groupings	Age < 18	152
Maternal Age - 18 to 44 years in groupings	NA	76
Maternal Age - 18 to 44 years only	Age 18 - 44	152
Maternal Age - 18 to 44 years only	NA	49
Maternal Education	12 yrs	152
Maternal Education	< 12 yrs	152
Maternal Education	>12 yrs	152
Maternal Education	NA	76
Maternal Race/Ethnicity	Black, non-Hispanic	152
Maternal Race/Ethnicity	Hispanic	152
Maternal Race/Ethnicity	Other non-Hispanic	152
Maternal Race/Ethnicity	White, non-Hispanic	152
Maternal Race/Ethnicity	NA	67
Medicaid Recipient	Medicaid	152
Medicaid Recipient	Non-Medicaid	152
Medicaid Recipient	NA	67
Mother Hispanic	Hispanic	152
Mother Hispanic	Non-Hispanic	152
Mother Hispanic	NA	67
None	None	152
Number of Previous Live Births	0	152
Number of Previous Live Births	1 or more	152
Number of Previous Live Births	NA	76
On WIC during Pregnancy	Non-WIC	152
On WIC during Pregnancy	WIC	152
On WIC during Pregnancy	NA	76
Pregnancy Intendedness	Intended	152
Pregnancy Intendedness	Unintended	152
Pregnancy Intendedness	NA	49
Smoked 3 months before Pregnancy	Non-Smoker	152
Smoked 3 months before Pregnancy	Smoker	152
Smoked 3 months before Pregnancy	NA	76
Smoked last 3 months of Pregnancy	Non-Smoker	152
Smoked last 3 months of Pregnancy	Smoker	152
Smoked last 3 months of Pregnancy	NA	49

We now have a sense of the structure of the entire data file.

Let’s finally use the group_by and summarise functions to try narrow down groups to Ns of 1.

df_summary <- df_clean |>
  group_by(question_id, location_abbr, subgroup_cat, subgroup, response) |>
  summarise(
    n = n(),
    question = first(question),
    mean_value = mean(value)
  )

DT::datatable(df_summary)

We now know that the following data combine to form a “base” row:

question_id
location_abbr
subgroup_cat
subgroup
response

With this in mind, we can now see that the following data combine to form a “base” row.

Preparing The Data For Visualizaiton

You’ll notice how the data is in a kind of long format: there are yes, no, and NA values for each question. We want to get all values per question per location/subgroup/subgroup_cat.

Steps:

Filter out the aggregate (non-location specific) values
Filter out NAs on the subroup (break out) category: of course, you would need to identify WHY these are NAs and address this issue if there is a reason.

We can then use the tidyr package to pivot the data to wide format.

df_final <- df_clean |>
  filter(location_abbr != "PRAMS Total") |>
  filter(!is.na(subgroup)) |>
  filter(response == "YES") |>  # Only keep YES responses
  pivot_wider(
    id_cols = c(location_abbr, subgroup_cat, subgroup),
    names_from = question_id,
    values_from = value,
    names_prefix = "q_"  # Add prefix to question ID columns
  ) |>
  rename(
    depression_within_3_months_birth = q_QUO133,
    anxiety_within_3_months_birth = q_QUO232,
    alcohol_use_within_3_months_birth = q_QUO8,
    binge_drinking_within_3_months_birth = q_QUO271
  ) |> 
  arrange(location_abbr, subgroup_cat, subgroup)
  

DT::datatable(df_final)

Data Diagnostics

Let’s do a quick analysis of missing values.

df_final |> 
  summarise(
    across(everything(), ~ 100 * sum(is.na(.)) / n())
  ) |>
  kable()

location_abbr	subgroup_cat	subgroup	depression_within_3_months_birth	anxiety_within_3_months_birth	binge_drinking_within_3_months_birth	alcohol_use_within_3_months_birth
0	0	0	63.42062	63.42062	5.07365	5.07365

Let’s examine missing data across all measures, by location.

df_final |>
  group_by(location_abbr) |>
  summarise(
    depression_missing = 100 * mean(is.na(depression_within_3_months_birth)),
    anxiety_missing = 100 * mean(is.na(anxiety_within_3_months_birth)),
    alcohol_missing = 100 * mean(is.na(alcohol_use_within_3_months_birth)),
    binge_missing = 100 * mean(is.na(binge_drinking_within_3_months_birth))
  ) |>
  arrange(location_abbr) |>
  kable(digits = 1)

location_abbr	depression_missing	anxiety_missing	alcohol_missing	binge_missing
AR	100.0	100.0	4.3	4.3
CO	100.0	100.0	2.1	2.1
DE	4.3	4.3	4.3	4.3
GA	100.0	100.0	2.1	2.1
HI	4.3	4.3	4.3	4.3
MA	100.0	100.0	4.3	4.3
MD	4.3	4.3	4.3	4.3
ME	100.0	100.0	14.9	14.9
MI	2.1	2.1	2.1	2.1
MN	4.3	4.3	4.3	4.3
MO	4.3	4.3	4.3	4.3
NE	100.0	100.0	2.1	2.1
NJ	100.0	100.0	6.4	6.4
NM	100.0	100.0	4.3	4.3
NY	100.0	100.0	4.3	4.3
OK	100.0	100.0	2.1	2.1
OR	100.0	100.0	2.1	2.1
PA	100.0	100.0	4.3	4.3
RI	100.0	100.0	6.4	6.4
UT	6.4	6.4	6.4	6.4
VT	100.0	100.0	12.8	12.8
WA	100.0	100.0	4.3	4.3
WI	2.1	2.1	2.1	2.1
WV	10.6	10.6	10.6	10.6
WY	6.4	6.4	6.4	6.4
YC	100.0	100.0	6.4	6.4

Data on alcohol use is more comprehensive than on anxiety and depression. Please note that just because we’re focused on data visualization does not mean our concerns over missing data are any less important than, say, when we are doing staitstical modeling.

Save The Data

Now, let’s save this processed data to an RDS file. Before we do, let’s merge back in our full location for reference and turn our location variables into factors.

df_final <- df_final |>
  left_join(
    df_clean |>
      select(location_abbr, location) |>
      distinct(),
    by = "location_abbr"
  ) |>
  mutate(
    location_abbr = factor(location_abbr),
    location = factor(location)
  )

saveRDS(df_final, here::here("data", "processed", "cdc_prams_df_final.rds"))