Data Visualization: Day 3

Erik Westlund

2025-06-12

Housekeeping

Course Evaluations

JHSPH aims for 100% course evaluation completion
Please complete the course evaluation by https://courseevalsystem.jhsph.edu/.
I will follow up with an email reminder or tow.

Problem Set 3 & Final Project

The final problem set is due tomorrow.
The final project is due in three weeks, on July 3rd.
Review the final assignment sheet on CoursePlus or GitHub
Submit all assignments via email or on CoursePlus
I prefer you put your work in the work directory on your GitHub repository fork

Final Project

Email it to me or submit it on CoursePlus
Basic requirements:
- Submit a rendered notebook
- Each visualizatoin needs a justification
- 10 visualizations total
- 5 can be univariate
- 5 must have at least two variables
- 2 must have three or more
- Make them polished

Support

Please feel free to reach out to me via email or Teams
I am happy to meet with you on Zoom to discuss your work and answer questions

Overview for Day 3

Day 3: What to Visualize – From Question to Plot

Scientific questions and DAGs
What to visualize: distributions, time, space, relationships, models
Key design principles: honesty, annotation, small multiples
Case studies: 7 applied examples

Theme: Science Drives Visualization

Data visualization is valuable for its scientific application
It helps us:
- Understand our data
- Understand our models
- Communicate our findings in an impactful way

DAGs, Scientific Complexity, Causal Inference, and Visualization

From Question to DAG to Plot

We’ll start with a scientific question.
We’ll sketch out the causal structure using a DAG.
From this, we can assess typical problems scientific studies like this face
We can then use the DAG to guide our visualization choices

Scientific Question

“What causes mothers to receive comprehensive postnatal care?”

Directed Acyclic Graphs

A DAG is a graph that represents the causal relationships between variables.
It is a directed graph, meaning that the edges have a direction.
It is an acyclic graph, meaning that there are no cycles in the graph.
It is a graph, meaning that it has nodes and edges.

How DAGs Help Us In Science & Visualization

DAGs help us:
- Identify the variables that are important to the question
- Identify the variables that are confounding
- Identify the variables that are missing
- Identify the variables that are measured incorrectly
Having a good idea of the causal structure of a question helps us know what to visualize.

DAGs as a Visualization Tool

DAGs:
1. are themselves visualizations
2. encode scientific problems and causal structures
We can use the daggity R package to visualize the the causal structure of the problems around the motivating research question

Big Picture

rcp_dag <- dagify(RCP ~ PQ + PC + WTP + PT + RA + RP)

rcp_dag |>
    ggdag() +
    theme_dag()

RCP = “Received Comprehensive Postnatal Care”
PQ = “Provider Quality”
PC = “Personal Capacity”
WTP = “Willingness to Pay”
PT = “Provider Trust”
RA = “Risk Aversion”
RP = “Risk Profile”

Provider Quality (PQ) Sub-DAG

pq_dag <- dagify(
  PQ ~ PEC,
  PEC ~ ST
)

pq_dag |>
    ggdag() +
    theme_dag()

PQ = “Provider Quality”
PEC = “Political/Economic Conditions”
ST = “State”

Personal Capacity (PC) Sub-DAG

pc_dag <- dagify(
  PC ~ DEP + JOB + INC + DIS,
  DEP ~ INC + JOB + CO,
  JOB ~ EDU + CO + R + M + I + CC,
  DIS ~ PEC,
  INC ~ JOB + EDU + RE + PEC + AGE
)

pc_dag |>
    ggdag() +
    theme_dag()

PC = “Personal Capacity”
DEP = “Dependents”
JOB = “Job Type”
INC = “Income”
DIS = “Distance to Provider”
CO = “Cultural Orientation”
EDU = “Educational Attainment”
R = “Resilience”
M = “Motivation”
I = “Intelligence”
CC = “Community Connections”
RE = “Race/Ethnicity”
AGE = “Age”

Willingness to Pay (WTP) Sub-DAG

wtp_dag <- dagify(
  WTP ~ PQ + INC + INS + RA + CO,
  INS ~ JOB + PEC + AGE,
  CO ~ PED + PIN + REL + PEC + CC
)

wtp_dag |>
    ggdag() +
    theme_dag()

WTP = “Willingness to Pay”
PQ = “Provider Quality”
INC = “Income”
INS = “Insurance”
RA = “Risk Aversion”
CO = “Cultural Orientation”
JOB = “Job Type”
PEC = “Political/Economic Conditions”
AGE = “Age”
PED = “Parent Education”
PIN = “Parent Income”
REL = “Religion”
CC = “Community Connections”

Provider Trust (PT) Sub-DAG

pt_dag <- dagify(
  PT ~ PQ + RE + CO,
  CO ~ PED + PIN + REL + PEC + CC
)

pt_dag |>
    ggdag() +
    theme_dag()

PT = “Provider Trust”
PQ = “Provider Quality”
RE = “Race/Ethnicity”
CO = “Cultural Orientation”
PED = “Parent Education”
PIN = “Parent Income”
REL = “Religion”
PEC = “Political/Economic Conditions”
CC = “Community Connections”

Risk Aversion (RA) Sub-DAG

ra_dag <- dagify(
  RA ~ PQ + RP + INS,
  RP ~ PQ + AGE + OBE + MG + DM + HD + PP + PR + HT + GHT,
  INS ~ JOB + PEC + AGE
)

ra_dag |>
    ggdag() +
    theme_dag()

RA = “Risk Aversion”
PQ = “Provider Quality”
RP = “Risk Profile”
INS = “Insurance”
AGE = “Age”
OBE = “Obesity”
MG = “Multiple Gestation”
DM = “Diabetes Mellitus”
HD = “Heart Disease”
PP = “Placenta Previa”
PR = “Preeclampsia”
HT = “Hypertension”
GHT = “Gestational Hypertension”
JOB = “Job Type”
PEC = “Political/Economic Conditions”

Risk Profile (RP) Sub-DAG

rp_dag <- dagify(
  RP ~ PQ + AGE + OBE + MG + DM + HD + PP + PR + HT + GHT,
  OBE ~ PEC + AGE,
  MG ~ AGE + OBE,
  DM ~ AGE + OBE + INC,
  HD ~ AGE + OBE + DM,
  PP ~ AGE + MG,
  PR ~ AGE + HT + GHT + MG,
  HT ~ AGE + OBE,
  GHT ~ HT + MG
)

rp_dag |>
    ggdag() +
    theme_dag()

RP = “Risk Profile”
PQ = “Provider Quality”
AGE = “Age”
OBE = “Obesity”
MG = “Multiple Gestation”
DM = “Diabetes Mellitus”
HD = “Heart Disease”
PP = “Placenta Previa”
PR = “Preeclampsia”
HT = “Hypertension”
GHT = “Gestational Hypertension”
PEC = “Political/Economic Conditions”
INC = “Income”

Complete DAG Codification

complete_dag <- dagify(
  RCP ~ PQ + PC + WTP + PT + RA + RP,
  PQ ~ PEC,
  PC ~ DEP + JOB + INC + DIS,
  WTP ~ PQ + INC + INS + RA + CO,
  PT ~ PQ + RE + CO,
  RA ~ PQ + RP + INS,
  RP ~ PQ + AGE + OBE + MG + DM + HD + PP + PR + HT + GHT,
  PEC ~ ST,
  DEP ~ INC + JOB + CO,
  JOB ~ EDU + CO + R + M + I + CC,
  DIS ~ PEC,
  INC ~ JOB + EDU + RE + PEC + AGE,
  INS ~ JOB + PEC + AGE,
  CO ~ PED + PIN + REL + PEC + CC,
  PIN ~ PEC + PED,
  EDU ~ PED + PIN + R + M + I + CC,
  R ~ PR,
  M ~ PM,
  I ~ PI,
  CC ~ PCC,
  REL ~ RE,
  OBE ~ PEC + AGE,
  MG ~ AGE + OBE,
  DM ~ AGE + OBE + INC,
  HD ~ AGE + OBE + DM,
  PP ~ AGE + MG,
  PR ~ AGE + HT + GHT + MG,
  HT ~ AGE + OBE,
  GHT ~ HT + MG
)

Complete DAG Visualization

complete_dag |>
    ggdag() +
    theme_dag()

DAGs and Scientific Complexity

So What?

I think it’s worth stepping back and considering how complex many scientific questions really are
We do not always observe everythign we want to, but with the help of DAGs we can:
- Identify the variables that are important to the question
- Identify what we can actually observe
- Identify what we cannot observe
- Assess the limitations we face with causal analysis

Next Steps

We are now going to work through several applications/case studies, each addressing a different aspect of visualization.

Side Quest: Simulated Data

In examples/dag_sim_data.qmd we simulated data that encodes the DAG structure.
Let’s take a very brief glance at that file to see how we simulated the data.

Worfklow

Workflow: `ggplot` themes Staying DRY

In ggplot_themes_and_staying_dry.qmd we explore how to make a ggplot theme
We also discuss how to stay DRY (Don’t Repeat Yourself) when working with multifile notebooks

Workflow: Saving Visualizations

In saving_visualizations.qmd I show how to export/save visualizations
This includes some guidance on file format choices and best practices

Applications

Application 1: Effective and Honest Scales

In applications_1_effective_and_honest_scales.qmd we look at how to use scales and position effectively
We examine how axes can be used to mislead or clarify

Application 2: Choropleths for Spatial Data

In applications_2_choropleths_for_spatial_data.qmd we create a choropleth map of the United States using the simulated data from above
We show how even with maps, small multiples help find clarity

Application 3: Dot Plots for Spatial Data

In applications_3_dot_plot_for_spatial_data.qmd we create a dot plot of the data using the same simulated data.
We show how spatial variation can often be visualized better without actual maps.

Application 4: Distribution Plots

In applications_4_distribution_plots.qmd we examine distributions of variables
We create box plots, violin plots, and ridgeline plots to show the distribution of a variable across a population.

Application 5: Visualizing Time Trends

In applications_5_visualizing_time_trends.qmd we examine time trends of variables
We create line plots to show how a variable changes over time.
We also show how to visualize time trends using a sankey and sunburst plots.

Application 6: Visualizing Correlations and Models

In applications_6_visualizing_correlations_and_models.qmd we visualize correlations and model outputs
We’ll look at why it’s important to visualize correlations and model outputs

Summary and Takeaways

Visualization choices should be driven by the scientific question
Experimentation is key; often the “obvious” plot is not the best plot
Iterate: build in layers; get the basics down; then polish
Use small multiples, annotations, and thoughtful scales to guide the viewer

Data Visualization: Day 3

Housekeeping

Course Evaluations

Problem Set 3 & Final Project

Final Project

Support

Overview for Day 3

Day 3: What to Visualize – From Question to Plot

Theme: Science Drives Visualization

DAGs, Scientific Complexity, Causal Inference, and Visualization

From Question to DAG to Plot

Scientific Question

Directed Acyclic Graphs

How DAGs Help Us In Science & Visualization

DAGs as a Visualization Tool

Big Picture

Provider Quality (PQ) Sub-DAG

Personal Capacity (PC) Sub-DAG

Willingness to Pay (WTP) Sub-DAG

Provider Trust (PT) Sub-DAG

Risk Aversion (RA) Sub-DAG

Risk Profile (RP) Sub-DAG

Complete DAG Codification

Complete DAG Visualization

DAGs and Scientific Complexity

So What?

Next Steps

Side Quest: Simulated Data

Worfklow

Workflow: ggplot themes Staying DRY

Workflow: Saving Visualizations

Applications

Application 1: Effective and Honest Scales

Application 2: Choropleths for Spatial Data

Application 3: Dot Plots for Spatial Data

Application 4: Distribution Plots

Application 5: Visualizing Time Trends

Application 6: Visualizing Correlations and Models

Summary and Takeaways

Workflow: `ggplot` themes Staying DRY