Data Visualization: Day 3

Erik Westlund

2025-06-12

Housekeeping

Course Evaluations

  • JHSPH aims for 100% course evaluation completion
  • Please complete the course evaluation by https://courseevalsystem.jhsph.edu/.
  • I will follow up with an email reminder or tow.

Problem Set 3 & Final Project

  • The final problem set is due tomorrow.
  • The final project is due in three weeks, on July 3rd.
  • Review the final assignment sheet on CoursePlus or GitHub
  • Submit all assignments via email or on CoursePlus
  • I prefer you put your work in the work directory on your GitHub repository fork

Final Project

  • Email it to me or submit it on CoursePlus
  • Basic requirements:
    • Submit a rendered notebook
    • Each visualizatoin needs a justification
    • 10 visualizations total
    • 5 can be univariate
    • 5 must have at least two variables
    • 2 must have three or more
    • Make them polished

Support

  • Please feel free to reach out to me via email or Teams
  • I am happy to meet with you on Zoom to discuss your work and answer questions

Overview for Day 3

Day 3: What to Visualize – From Question to Plot

  • Scientific questions and DAGs
  • What to visualize: distributions, time, space, relationships, models
  • Key design principles: honesty, annotation, small multiples
  • Case studies: 7 applied examples

Theme: Science Drives Visualization

  • Data visualization is valuable for its scientific application
  • It helps us:
    • Understand our data
    • Understand our models
    • Communicate our findings in an impactful way

DAGs, Scientific Complexity, Causal Inference, and Visualization

From Question to DAG to Plot

  • We’ll start with a scientific question.
  • We’ll sketch out the causal structure using a DAG.
  • From this, we can assess typical problems scientific studies like this face
  • We can then use the DAG to guide our visualization choices

Scientific Question

“What causes mothers to receive comprehensive postnatal care?”

Directed Acyclic Graphs

  • A DAG is a graph that represents the causal relationships between variables.
  • It is a directed graph, meaning that the edges have a direction.
  • It is an acyclic graph, meaning that there are no cycles in the graph.
  • It is a graph, meaning that it has nodes and edges.

How DAGs Help Us In Science & Visualization

  • DAGs help us:

    • Identify the variables that are important to the question
    • Identify the variables that are confounding
    • Identify the variables that are missing
    • Identify the variables that are measured incorrectly
  • Having a good idea of the causal structure of a question helps us know what to visualize.

DAGs as a Visualization Tool

  • DAGs:

    1. are themselves visualizations
    2. encode scientific problems and causal structures
  • We can use the daggity R package to visualize the the causal structure of the problems around the motivating research question

Big Picture

rcp_dag <- dagify(RCP ~ PQ + PC + WTP + PT + RA + RP)

rcp_dag |>
    ggdag() +
    theme_dag()

  • RCP = “Received Comprehensive Postnatal Care”
  • PQ = “Provider Quality”
  • PC = “Personal Capacity”
  • WTP = “Willingness to Pay”
  • PT = “Provider Trust”
  • RA = “Risk Aversion”
  • RP = “Risk Profile”

Provider Quality (PQ) Sub-DAG

pq_dag <- dagify(
  PQ ~ PEC,
  PEC ~ ST
)

pq_dag |>
    ggdag() +
    theme_dag()

  • PQ = “Provider Quality”
  • PEC = “Political/Economic Conditions”
  • ST = “State”

Personal Capacity (PC) Sub-DAG

pc_dag <- dagify(
  PC ~ DEP + JOB + INC + DIS,
  DEP ~ INC + JOB + CO,
  JOB ~ EDU + CO + R + M + I + CC,
  DIS ~ PEC,
  INC ~ JOB + EDU + RE + PEC + AGE
)

pc_dag |>
    ggdag() +
    theme_dag()

  • PC = “Personal Capacity”
  • DEP = “Dependents”
  • JOB = “Job Type”
  • INC = “Income”
  • DIS = “Distance to Provider”
  • CO = “Cultural Orientation”
  • EDU = “Educational Attainment”
  • R = “Resilience”
  • M = “Motivation”
  • I = “Intelligence”
  • CC = “Community Connections”
  • RE = “Race/Ethnicity”
  • AGE = “Age”

Willingness to Pay (WTP) Sub-DAG

wtp_dag <- dagify(
  WTP ~ PQ + INC + INS + RA + CO,
  INS ~ JOB + PEC + AGE,
  CO ~ PED + PIN + REL + PEC + CC
)

wtp_dag |>
    ggdag() +
    theme_dag()

  • WTP = “Willingness to Pay”
  • PQ = “Provider Quality”
  • INC = “Income”
  • INS = “Insurance”
  • RA = “Risk Aversion”
  • CO = “Cultural Orientation”
  • JOB = “Job Type”
  • PEC = “Political/Economic Conditions”
  • AGE = “Age”
  • PED = “Parent Education”
  • PIN = “Parent Income”
  • REL = “Religion”
  • CC = “Community Connections”

Provider Trust (PT) Sub-DAG

pt_dag <- dagify(
  PT ~ PQ + RE + CO,
  CO ~ PED + PIN + REL + PEC + CC
)

pt_dag |>
    ggdag() +
    theme_dag()

  • PT = “Provider Trust”
  • PQ = “Provider Quality”
  • RE = “Race/Ethnicity”
  • CO = “Cultural Orientation”
  • PED = “Parent Education”
  • PIN = “Parent Income”
  • REL = “Religion”
  • PEC = “Political/Economic Conditions”
  • CC = “Community Connections”

Risk Aversion (RA) Sub-DAG

ra_dag <- dagify(
  RA ~ PQ + RP + INS,
  RP ~ PQ + AGE + OBE + MG + DM + HD + PP + PR + HT + GHT,
  INS ~ JOB + PEC + AGE
)

ra_dag |>
    ggdag() +
    theme_dag()

  • RA = “Risk Aversion”
  • PQ = “Provider Quality”
  • RP = “Risk Profile”
  • INS = “Insurance”
  • AGE = “Age”
  • OBE = “Obesity”
  • MG = “Multiple Gestation”
  • DM = “Diabetes Mellitus”
  • HD = “Heart Disease”
  • PP = “Placenta Previa”
  • PR = “Preeclampsia”
  • HT = “Hypertension”
  • GHT = “Gestational Hypertension”
  • JOB = “Job Type”
  • PEC = “Political/Economic Conditions”

Risk Profile (RP) Sub-DAG

rp_dag <- dagify(
  RP ~ PQ + AGE + OBE + MG + DM + HD + PP + PR + HT + GHT,
  OBE ~ PEC + AGE,
  MG ~ AGE + OBE,
  DM ~ AGE + OBE + INC,
  HD ~ AGE + OBE + DM,
  PP ~ AGE + MG,
  PR ~ AGE + HT + GHT + MG,
  HT ~ AGE + OBE,
  GHT ~ HT + MG
)

rp_dag |>
    ggdag() +
    theme_dag()

  • RP = “Risk Profile”
  • PQ = “Provider Quality”
  • AGE = “Age”
  • OBE = “Obesity”
  • MG = “Multiple Gestation”
  • DM = “Diabetes Mellitus”
  • HD = “Heart Disease”
  • PP = “Placenta Previa”
  • PR = “Preeclampsia”
  • HT = “Hypertension”
  • GHT = “Gestational Hypertension”
  • PEC = “Political/Economic Conditions”
  • INC = “Income”

Complete DAG Codification

complete_dag <- dagify(
  RCP ~ PQ + PC + WTP + PT + RA + RP,
  PQ ~ PEC,
  PC ~ DEP + JOB + INC + DIS,
  WTP ~ PQ + INC + INS + RA + CO,
  PT ~ PQ + RE + CO,
  RA ~ PQ + RP + INS,
  RP ~ PQ + AGE + OBE + MG + DM + HD + PP + PR + HT + GHT,
  PEC ~ ST,
  DEP ~ INC + JOB + CO,
  JOB ~ EDU + CO + R + M + I + CC,
  DIS ~ PEC,
  INC ~ JOB + EDU + RE + PEC + AGE,
  INS ~ JOB + PEC + AGE,
  CO ~ PED + PIN + REL + PEC + CC,
  PIN ~ PEC + PED,
  EDU ~ PED + PIN + R + M + I + CC,
  R ~ PR,
  M ~ PM,
  I ~ PI,
  CC ~ PCC,
  REL ~ RE,
  OBE ~ PEC + AGE,
  MG ~ AGE + OBE,
  DM ~ AGE + OBE + INC,
  HD ~ AGE + OBE + DM,
  PP ~ AGE + MG,
  PR ~ AGE + HT + GHT + MG,
  HT ~ AGE + OBE,
  GHT ~ HT + MG
)

Complete DAG Visualization

complete_dag |>
    ggdag() +
    theme_dag()

DAGs and Scientific Complexity

So What?

  • I think it’s worth stepping back and considering how complex many scientific questions really are
  • We do not always observe everythign we want to, but with the help of DAGs we can:
    • Identify the variables that are important to the question
    • Identify what we can actually observe
    • Identify what we cannot observe
    • Assess the limitations we face with causal analysis

Next Steps

  • We are now going to work through several applications/case studies, each addressing a different aspect of visualization.

Side Quest: Simulated Data

  • In examples/dag_sim_data.qmd we simulated data that encodes the DAG structure.
  • Let’s take a very brief glance at that file to see how we simulated the data.

Worfklow

Workflow: ggplot themes Staying DRY

  • In ggplot_themes_and_staying_dry.qmd we explore how to make a ggplot theme
  • We also discuss how to stay DRY (Don’t Repeat Yourself) when working with multifile notebooks

Workflow: Saving Visualizations

  • In saving_visualizations.qmd I show how to export/save visualizations
  • This includes some guidance on file format choices and best practices

Applications

Application 1: Effective and Honest Scales

  • In applications_1_effective_and_honest_scales.qmd we look at how to use scales and position effectively
  • We examine how axes can be used to mislead or clarify

Application 2: Choropleths for Spatial Data

  • In applications_2_choropleths_for_spatial_data.qmd we create a choropleth map of the United States using the simulated data from above
  • We show how even with maps, small multiples help find clarity

Application 3: Dot Plots for Spatial Data

  • In applications_3_dot_plot_for_spatial_data.qmd we create a dot plot of the data using the same simulated data.
  • We show how spatial variation can often be visualized better without actual maps.

Application 4: Distribution Plots

  • In applications_4_distribution_plots.qmd we examine distributions of variables
  • We create box plots, violin plots, and ridgeline plots to show the distribution of a variable across a population.

Application 6: Visualizing Correlations and Models

  • In applications_6_visualizing_correlations_and_models.qmd we visualize correlations and model outputs
  • We’ll look at why it’s important to visualize correlations and model outputs

Summary and Takeaways

  • Visualization choices should be driven by the scientific question
  • Experimentation is key; often the “obvious” plot is not the best plot
  • Iterate: build in layers; get the basics down; then polish
  • Use small multiples, annotations, and thoughtful scales to guide the viewer