To simulate data, we implement our causal model. We will start by simulating data for root nodes or exogenous nodes (i.e., those without arrows going into them).
States/Geography
Using data from the Census and the Commonwealth Fund health grades, we’ll generate a data frame with some state-level data. This data will be used to generate state-level variables for patients, such as the political/economic conditions related to health. It will also be used to structure the distribution of race/ethnicity in the population per state.
## Population Data from US Census 2020. Combined with health scores from Commonwealth Fund## Health Scorecard data. The weights measure is based upon population data from the Census.## `pec` is a z-score of the rank of the state in terms of population and used as a ## measure of state political & economic conditions with respect to health.## This data is for illustration purposes only.state_population_data <-read.csv(here::here("data/raw/states_census2020_ranks.csv")) |>mutate(state_weight = population/sum(population)) |>mutate(inverted_rank =max(rank) +1- rank, # Invert ranks (lower is better)pec =scale(inverted_rank, center =TRUE, scale =TRUE)[, 1] # Z-score ) |>select(-inverted_rank) # Load in data on race/ethnicity population from states. This is from the Census2020 file.# It uses their one-race only and separates Hispanic from non-Hispanic. The weights# are just the proportion of the population of one category vs the sum of the population in# all race/ethnicity categories. # This data is meant for illustration purposes only.state_data <- readr::read_csv(here::here("data/raw/race_by_state_census2020.csv")) |>pivot_longer(cols =-"Label",names_to ="State",values_to ="Population") |>pivot_wider(names_from ="Label", values_from = Population) |>mutate(across(-State, ~as.numeric(gsub(",", "", .)))) |>rename(state = State,white = White,black = Black,hispanic ="Hispanic or Latino",asian = Asian,aian ="American Indian and Alaska Native",nhpi ="Native Hawaiian and Other Pacific Islander",other = Other, ) |>mutate(sum = white + hispanic + black + asian + aian + other + nhpi,white = white / sum,black = black / sum,hispanic = hispanic / sum,asian = asian / sum,aian = aian / sum,other = other / sum,nhpi = nhpi / sum, ) |>rename(state_name = state) |>select(-sum) |>left_join(state_population_data |>select(state_name, state, state_weight, pec)) |>filter(!is.na(state))
Rows: 7 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Label
num (52): Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecti...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(state_name)`
We need to generate providers. We will have a provider for roughly every 100 patients. We will generate a provider quality measure which is mostly random, but determined partially by state political and economic considerations with respect to health.
Providers will vary by state population. We below show a random sample of 50 providers.
We’ll start by generating a date frame with patients. We will generate a state for each patient based on the population of each state. We will randomly assign a race/ethnicity based upon state-level proportions. We will then assign that person to a provider in their state.
# Start by creating a dataframe with patients having state IDs proportional to# the population of each state. Then row-wise generate race/ethnicity categories.data <-data.frame(state =sample( state_data$state, n,replace =TRUE,prob = state_data$state_weight)) |>merge(state_data |>select(state, pec, re_cats), by ="state") |>rowwise() |>mutate(race =sample(re_cats, size =1, prob =c_across(all_of(re_cats)))) |>ungroup() |>select(-re_cats) |>mutate(id =row_number())
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(re_cats)
# Now:
data %>% select(all_of(re_cats))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
data |>sample_n(10) |>kable()
state
pec
race
id
TN
-0.5381382
hispanic
44601
NC
0.2690691
white
31863
NV
-0.7399401
black
34578
NY
1.0762765
white
36326
KS
-0.9417419
white
21635
NC
0.2690691
white
31121
MD
1.1435437
white
26242
MI
0.4708710
white
27201
MO
-0.3363364
white
29266
IN
-0.4708710
white
18294
Assign Patient to Provider
data <- data |>rowwise() |>mutate(provider_id =sample(providers$id[providers$state == state], 1)) |>ungroup() |>left_join(providers |>select(id, quality), by =c("provider_id"="id")) |>select(id, provider_id, quality, state, pec, race) |>rename(provider_quality = quality )data |>sample_n(10) |>kable()
id
provider_id
provider_quality
state
pec
race
46480
46
0.9183326
VA
0.3363364
white
9219
314
-1.2354202
CO
0.8744746
white
23482
176
0.2513363
KS
-0.9417419
white
9035
100
1.8010039
CO
0.8744746
hispanic
12261
323
0.0338907
FL
-0.4036037
hispanic
16551
368
0.9353468
IL
0.2018018
white
27001
421
0.8893215
MI
0.4708710
white
48287
150
0.1608852
WA
1.2108110
asian
29880
25
0.6985157
MO
-0.3363364
white
12666
178
0.4011250
FL
-0.4036037
white
Exogenous Nodes
Now every patient has a state, a provider, a race/ethnicity, and some associated data by state.
In addition to race/ethnicity and state, the following variables have no prior causes in our model:
AGE: Age of the patient.
PCE: Parent community connections
PED: Parent education
PI: Parent intelligence
PM: Parent motivation
PSR: Parent resilience
We will generate n observations for each of these variables.
## Next generate all the other exogenous variables# We need to assign a provider to each patient. This provider needs a qualitydata <- data |>mutate(age =gen_mother_ages(n),parent_income =gen_incomes( n,median_income =60000,sd =25000,min_income =0,max_income =1000000 ),parent_intelligence =rnorm(n, 1, 1),parent_resilience =rnorm(n, 1, 1),parent_motivation =rnorm(n, 1, 1),parent_community_connections =rnorm(n, 0, 1),parent_edu =gen_education(n, ed_cats, parent_intelligence, parent_resilience, parent_motivation, parent_community_connections, parent_income), )data |>sample_n(10) |>kable()
id
provider_id
provider_quality
state
pec
race
age
parent_income
parent_intelligence
parent_resilience
parent_motivation
parent_community_connections
parent_edu
5285
493
1.2630188
CA
0.8072074
white
26
46351
3.2436271
0.2525813
1.6954949
0.2546516
hs
45413
70
1.5896805
UT
-0.1345346
hispanic
23
55997
1.8129310
0.8719194
1.3830863
1.4434692
some_college
7687
230
0.4252149
CA
0.8072074
hispanic
39
65023
0.9022916
0.1343653
0.2312583
-0.9743949
hs
32576
449
0.9623769
NJ
0.5381382
white
22
60420
2.3778012
2.5190224
1.0760966
-2.5790280
college
1293
204
-0.0219117
AR
-1.0762765
white
34
56641
0.3425865
0.5454315
2.8228031
-0.1848488
some_college
17874
266
2.3467958
IL
0.2018018
hispanic
23
66054
0.7721575
-0.5256282
1.0021947
0.9933364
hs
46442
66
2.5322108
VA
0.3363364
black
23
62409
1.5751782
-0.1781548
1.4913066
0.7047459
some_college
23954
92
0.1438871
KY
-0.2690691
white
27
47060
1.1336424
1.1333922
2.2675736
-0.0986224
college
10513
56
0.4856040
FL
-0.4036037
white
17
62376
1.0374659
2.9382283
1.9864772
-2.6375022
less_than_hs
14780
198
0.9845173
GA
-0.6054055
black
28
100030
1.3601545
0.4665285
1.5547804
0.6022043
hs
Working our way up from the root nodes
Religion is patterned by race but has no other determinants.
data <- data |>mutate(religion =gen_religion(n, rel_cats, race),)data |>select(id, race, religion) |>sample_n(10) |>kable()
id
race
religion
41739
white
other
30510
white
other
47129
white
christian
14477
white
christian
16091
black
other
23753
white
christian
45524
white
christian
36187
asian
hindu
20614
hispanic
christian
24763
white
other
The following variables are determined by parents’ values:
---title: "Data Simulation"author: "Erik Westlund"date: "`r Sys.Date()`"output: html_documenteditor_options: chunk_output_type: console---```{r setup, include=FALSE}knitr::opts_chunk$set(echo =TRUE)# Install required packages if not already installedrequired_packages <-c("dplyr", "tidyr", "janitor", "readr", "forcats", "kableExtra")new_packages <- required_packages[!(required_packages %in%installed.packages()[,"Package"])]if(length(new_packages)) install.packages(new_packages)# Load required packageslibrary(dplyr)library(tidyr)library(janitor)library(readr)library(forcats)library(kableExtra)n <-50000n_providers =round(50000/100)source(here::here("examples", "colors.R"))source(here::here("examples", "utils.R"))re_cats <-c("white", "hispanic", "black", "asian", "aian", "nhpi", "other")ed_cats <-c("less_than_hs", "hs", "some_college", "college", "post_grad")rel_cats <-c("christian", "muslim", "jewish", "buddhist", "hindu", "other")job_type_cats =c("unemployed", "unskilled", "trade", "professional")```## Data SimulationTo simulate data, we implement our causal model. We will start by simulating data for root nodes or exogenous nodes (i.e., those without arrows going into them).## States/GeographyUsing data from the Census and the Commonwealth Fund health grades, we'll generate a data frame with some state-level data. This data will be used to generate state-level variables for patients, such as the political/economic conditions related to health. It will also be used to structure the distribution of race/ethnicity in the population per state.```{r gen_state}## Population Data from US Census 2020. Combined with health scores from Commonwealth Fund## Health Scorecard data. The weights measure is based upon population data from the Census.## `pec` is a z-score of the rank of the state in terms of population and used as a ## measure of state political & economic conditions with respect to health.## This data is for illustration purposes only.state_population_data <-read.csv(here::here("data/raw/states_census2020_ranks.csv")) |>mutate(state_weight = population/sum(population)) |>mutate(inverted_rank =max(rank) +1- rank, # Invert ranks (lower is better)pec =scale(inverted_rank, center =TRUE, scale =TRUE)[, 1] # Z-score ) |>select(-inverted_rank) # Load in data on race/ethnicity population from states. This is from the Census2020 file.# It uses their one-race only and separates Hispanic from non-Hispanic. The weights# are just the proportion of the population of one category vs the sum of the population in# all race/ethnicity categories. # This data is meant for illustration purposes only.state_data <- readr::read_csv(here::here("data/raw/race_by_state_census2020.csv")) |>pivot_longer(cols =-"Label",names_to ="State",values_to ="Population") |>pivot_wider(names_from ="Label", values_from = Population) |>mutate(across(-State, ~as.numeric(gsub(",", "", .)))) |>rename(state = State,white = White,black = Black,hispanic ="Hispanic or Latino",asian = Asian,aian ="American Indian and Alaska Native",nhpi ="Native Hawaiian and Other Pacific Islander",other = Other, ) |>mutate(sum = white + hispanic + black + asian + aian + other + nhpi,white = white / sum,black = black / sum,hispanic = hispanic / sum,asian = asian / sum,aian = aian / sum,other = other / sum,nhpi = nhpi / sum, ) |>rename(state_name = state) |>select(-sum) |>left_join(state_population_data |>select(state_name, state, state_weight, pec)) |>filter(!is.na(state))state_data |>kable()readr::write_csv(state_data, here::here("data/processed/state_data.csv"))```## ProvidersWe need to generate providers. We will have a provider for roughly every 100 patients. We will generate a provider quality measure which is mostly random, but determined partially by state political and economic considerations with respect to health.Providers will vary by state population. We below show a random sample of 50 providers.```{r gen_providers}providers <- state_data |>slice(rep(1:n(), each =1)) |>arrange(state) |>mutate(id =1:n(), ) |>select(id, state, pec)remaining_providers <- n_providers -nrow(providers)additional_providers <-data.frame(id = (nrow(providers) +1):n_providers,state =sample(state_data$state, remaining_providers, replace =TRUE, prob = state_data$state_weight)) |>left_join(state_data |>select(state, pec), by ="state")providers <-bind_rows(providers, additional_providers) |>mutate(quality =gen_provider_quality(n_providers, pec) ) |>arrange(id) |>select(id, state, quality)providers |>sample_n(50) |>kable()```## PatientsWe'll start by generating a date frame with patients. We will generate a state for each patient based on the population of each state. We will randomly assign a race/ethnicity based upon state-level proportions. We will then assign that person to a provider in their state.```{r gen_ind}# Start by creating a dataframe with patients having state IDs proportional to# the population of each state. Then row-wise generate race/ethnicity categories.data <-data.frame(state =sample( state_data$state, n,replace =TRUE,prob = state_data$state_weight)) |>merge(state_data |>select(state, pec, re_cats), by ="state") |>rowwise() |>mutate(race =sample(re_cats, size =1, prob =c_across(all_of(re_cats)))) |>ungroup() |>select(-re_cats) |>mutate(id =row_number())data |>sample_n(10) |>kable()```## Assign Patient to Provider```{r patient_to_provider}data <- data |>rowwise() |>mutate(provider_id =sample(providers$id[providers$state == state], 1)) |>ungroup() |>left_join(providers |>select(id, quality), by =c("provider_id"="id")) |>select(id, provider_id, quality, state, pec, race) |>rename(provider_quality = quality )data |>sample_n(10) |>kable()```## Exogenous NodesNow every patient has a state, a provider, a race/ethnicity, and some associated data by state.In addition to race/ethnicity and state, the following variables have no prior causes in our model:* `AGE`: Age of the patient.* `PCE`: Parent community connections* `PED`: Parent education* `PI`: Parent intelligence* `PM`: Parent motivation* `PSR`: Parent resilienceWe will generate `n` observations for each of these variables.```{r data_exog}## Next generate all the other exogenous variables# We need to assign a provider to each patient. This provider needs a qualitydata <- data |>mutate(age =gen_mother_ages(n),parent_income =gen_incomes( n,median_income =60000,sd =25000,min_income =0,max_income =1000000 ),parent_intelligence =rnorm(n, 1, 1),parent_resilience =rnorm(n, 1, 1),parent_motivation =rnorm(n, 1, 1),parent_community_connections =rnorm(n, 0, 1),parent_edu =gen_education(n, ed_cats, parent_intelligence, parent_resilience, parent_motivation, parent_community_connections, parent_income), )data |>sample_n(10) |>kable()```## Working our way up from the root nodesReligion is patterned by race but has no other determinants.```{r religion}data <- data |>mutate(religion =gen_religion(n, rel_cats, race),)data |>select(id, race, religion) |>sample_n(10) |>kable()```The following variables are determined by parents' values:* `I`: Intelligence* `SR`: Resilience* `M`: Motivation* `CE`: Community connections```{r ind}data <- data |>mutate(intelligence =gen_correlated(parent_intelligence, target_r =0.5),resilience =gen_correlated(parent_resilience, target_r =0.5),motivation =gen_correlated(parent_motivation, target_r =0.5),community_connections =gen_correlated(parent_community_connections, target_r =0.5),)data |>select(id, parent_intelligence, intelligence, parent_resilience, resilience, parent_motivation, motivation, parent_community_connections, community_connections) |>sample_n(10) |>kable()```The following variables are determined by a mixture personality, geography, and parents' values:* Education (varies by personal traits and parents' class position)* Cultural orientation (namely, trust of institutions; varies by parents class position, religion, geography, and commu nity connections)* Job type (varies by educational attainment and parental income)* Dependents (varies by income, job type, and age)* Insurance (varies by job type, state conditions, age)* Distance to provider (varies by state conditions)```{r ind_ses}data <- data |>mutate(edu =gen_education(n, ed_cats, intelligence, resilience, motivation, community_connections, parent_income, parent_edu),income =gen_incomes( n,median_income =60000,sd =25000,min_income =0,max_income =1000000 ),cultural_orientation =gen_cultural_orientation(n, parent_income, parent_edu, pec, religion, community_connections),job_type =gen_job_type(n, job_type_cats, ed_cats, edu, parent_income),dependents =gen_dependents(n, income, job_type, age),insurance =gen_insurance(n, job_type_cats, job_type, pec, age),distance_to_provider =gen_distance(n, pec),)data |>select(id, edu, income, cultural_orientation, job_type, dependents, insurance, distance_to_provider) |>sample_n(10) |>kable()```Comorbidity is determined by age, SES, and other comorbidities.* Obesity (varies by state conditions, age)* Multiple gestation (varies by age, obesity)* Diabetes (varies by age, obesity, income)* Heart disease (varies by age, obesity, diabetes)* Placenta previa (varies by multiple gestation)* Hypertension (varies by age, obesity)* Gestational hypertension (varies by hypertension, multiple gestation)* Preeclampsia (varies by age, hypertension, gestational hypertension, multiple gestation)```{r comorboidities}data <- data |>mutate(obesity =gen_obesity(n, income, edu, pec, age, target_prevalence=0.35),multiple_gestation =gen_multiple_gestation(n, age, obesity, target_prevalence =0.03),diabetes =gen_diabetes(n, age, obesity, income, target_prevalence =0.1),heart_disease =gen_heart_disease(n, age, obesity, diabetes, target_prevalence =0.15),placenta_previa =gen_placenta_previa(n, age, multiple_gestation, target_prevalence =0.01),hypertension =gen_hypertension(n, age, obesity, target_prevalence =0.2),gest_hypertension =gen_gest_hypertension(n, hypertension, multiple_gestation, target_prevalence =0.05),preeclampsia =gen_preeclampsia(n, age, hypertension, gest_hypertension, multiple_gestation, target_prevalence =0.02),)data |>select(id, obesity, multiple_gestation, diabetes, heart_disease, placenta_previa, hypertension, gest_hypertension, preeclampsia) |>sample_n(10) |>kable()```## Immediate causes of receipt of comprehensive postnatal careProvider quality is determined by state conditions and already calculated.Personal capacity (to attend visits) is determined by:* dependents* job type* income* distance to provider```{r personal_capacity} data <- data |>mutate(personal_capacity =gen_personal_capacity(n, dependents, job_type, income, distance_to_provider))```Now generate the risk profile, which is a function of:* provider quality (negatively correlated)* age* obesity* multiple gestation* diabetes* heart disease* placenta previa* hypertension* gestational hypertension* preeclampsia```{r risk_profile}data <- data |>mutate(risk_profile =gen_risk_profile(n, provider_quality = providers$quality[match(data$provider_id, providers$id)], age, obesity, multiple_gestation, diabetes, heart_disease, placenta_previa, hypertension, gest_hypertension, preeclampsia))```Now generate risk aversion, which we see as a function of the negative conseuqences from getting really sick and how much people want to avoid them.* insurance (people without insurance will be more risk averse because of cost; positively correlated)* provider quality (people with better providers will be less risk averse since they trust the care they can get; negatively correlated)* risk profile (people with higher risk will be more risk averse since they will have more consequences if getting ill; positively correlated)```{r risk_aversion}data <- data |>mutate(risk_aversion =gen_risk_aversion(n, insurance, provider_quality = providers$quality[match(data$provider_id, providers$id)], risk_profile))```Provider trust is a function of:* race/ethnicity (racial minorities are less likely to trust providers; negatively correlated)* provider quality (people with better providers will be more likely to trust them; positively correlated)* cultural orientation (people trusting institutions will be more likely to trust providers; positively correlated)```{r provider_trust}data <- data |>mutate(provider_trust =gen_provider_trust(n, re_cats, race, provider_quality, cultural_orientation))```And finally, willingness to pay, which we see as a function of:* provider quality (people with better providers will be willing to pay more; positively correlated)* income (people with higher income will be willing to pay more; positively correlated)* insurance (people with insurance will be willing to pay more since they don't have to cover it out of pocket; positively correlated)* risk aversion (people who are more risk averse will be willing to pay more; positively correlated)* cultural orientation (people who trust institutions will be willing to pay more; positively correlated)```{r willingness_to_pay}data <- data |>mutate(willingness_to_pay =gen_willingness_to_pay(n, provider_quality = providers$quality[match(data$provider_id, providers$id)], income, insurance, risk_aversion, cultural_orientation))```Finally, we generate the outcome of interest, receipt of comprehensive postnatal care, which is a function of:* personal capacity (people with more capacity will be more likely to attend visits; positively correlated)* willingness to pay (people who are willing to pay more will be more likely to attend visits; positively correlated)* provider quality (people with better providers will be more likely to attend visits; positively correlated)* provider trust (people who trust their providers will be more likely to attend visits; positively correlated)* risk aversion (people who are more risk averse will be more likely to attend visits; negatively correlated)* risk profile (people with higher risk will be more likely to attend visits; positively correlated)```{r gen_dv}data <- data |>mutate(received_comprehensive_postnatal_care =gen_received_comprehensive_postnatal_care(n, personal_capacity, willingness_to_pay, provider_quality, provider_trust, risk_aversion, risk_profile) )```We're going to include an income variable, but it's going to be censored and have measurement error.```{r income_osberved}sri_labels <-c("$0–$24,999", "$25,000–$49,999", "$50,000–$74,999", "$75,000–$99,999", "$100,000–$124,999", "$125,000–$149,999", "$150,000–$174,999", "$175,000+")data <- data |>mutate(self_report_income =cut(pmin(pmax(income +rnorm(n, mean =0, sd =5000), 0), 200000), # Add noise and capbreaks =seq(0, 200000, by =25000),include.lowest =TRUE,right =FALSE,labels = sri_labels ))table(data$self_report_income)analysis_data <- data |>select( id, provider_id, state, received_comprehensive_postnatal_care, self_report_income, age, edu, race, insurance, job_type, dependents, distance_to_provider, obesity, multiple_gestation, diabetes, heart_disease, placenta_previa, hypertension, gest_hypertension, preeclampsia) |>mutate(self_report_income =as.character(self_report_income),job_type =as.character(job_type),dependents =as.character(dependents), ) |>rename(race_ethnicity = race )readr::write_csv(analysis_data, here::here("data/processed/simulated_data.csv")) ```