Data Visualization: Day 2

Erik Westlund

2025-06-11

Day 2: AI & LLMs, Data Preparation, and the Grammar of Graphics

Overview For Day 2

  • AI & LLMs: An Illustration
  • Data preparation fundamentals
  • Introduction to the grammar of graphics
  • Colors & accessibility

AI & LLMs: An Illustration

  • We’ll use an LLM to help us prepare data for a visualization
  • We’ll work with the ai_example.qmd file in the examples directory

Data Preparation

Getting Data Ready For Visualization

  • Before we can do any visualization, we need to load in data and prepare it

  • Take this step seriously, especially your variable names, as your code depends on it

  • Workflow options:

    1. Clean up the data in a block at the top of your notebook
    2. Separate out the data preparation from the visualization code
  • There are no universal answers to the best approach here; friction as you work often tells you what to do

Option 1: Clean up the data in a block at the top of your notebook

Benefits

  • Quick and straightforward
  • All code in one place
  • Easy for you and others to follow the workflow
  • Good for exploratory analysis
  • Minimal file management

Drawbacks

  • Can become messy with large datasets
  • Expensive transformations take time to run every time you run the notebook
  • Harder to reuse code, especially in analyses with multiple steps
  • Your colleagues may not care about the data preparation

Option 2: Separate out the data preparation from the visualization code

  • In this approach, you prepare the data in a separate script or notebook

  • You can then load the data into your visualization notebook

  • Two options:

    1. Run the data preparation script before running the visualization script
    2. Run the data preparation script as part of the visualization script
  • Example of (1) above provided in our examples/project-example directory

Benefits/Drawbacks of separating scripts

Benefits

  • Data preparation notebook serves as documentation of your data
  • Data preparation can be reused across projects
  • Better separation of concerns
  • Often asier to maintain and update

Drawbacks

  • More files to manage
  • Need to ensure data preparation is run in the right order
  • Can be overkill for simple projects
  • More complex project structure

Saving Data To Load Later

  • When using separate scripts, you’ll often want to save a copy of your cleaned data to load later scripts/notebooks

  • A good option here is use the saveRDS function to save your data as an R data object

  • This has a few benefits

    • RDS files maintain all R metadata
    • You can quickly load the data into R with readRDS, which can save you a lot of time when working with large datasets
    • RDS files are not easily to accidentally modify

Preparing Data Example

  • I have created an example of data preparation in the examples/prams_1_data_prep.qmd file
  • Let’s load this up in RStudio and step through it.
  • Note that we’ll be opening the entire course repo as a project, so the root directory is one level up from the examples directory
  • We will return to this data file later to visualize it

The Grammar of Graphics

ggplot2 and the Grammar of Graphics

  • The “gg” in ggplot2 stands for the grammar of graphics, a concept coined by Leland Wilkinson and popularized by Hadley Wickham.
  • Grammar in language is a set of rules for how language is structured.
  • Grammar in graphics is a set of rules for how visual representations of data are structured.

Key Concepts in ggplot2

ggplot2 workflow

ggplot2 Workflow

  1. Start with data
  2. Pick an aesthetic mapping
  3. Choose a geometric object
  4. Add statistical transformations
  5. Adjust finer details: scales, coordinate systems, faceting, etc.

Aesthetic Mappings

  • This is how your variables map onto the aesthetics of your figure
  • It sounds highfalutin, but it usually just means:
    • What is your x?
    • What is your y?

Geometric Objects (“geoms”)

  • Geoms are the elements used to represent the data
  • Geoms have associated stats/functions/statistical transformations
  • Stats such as counts for groups often involve aggregations

Geoms (cont.)

Geom R Function Stat/Transformation Use for
Point geom_point() Identity Scatter plots
Bar geom_bar() Count Bar plots
AB Line geom_abline() Slope Line plots
Horizontal line geom_hline() Y intercept Reference lines
Vertical line geom_vline() X intercept Reference lines
Smoother geom_smooth() Smoothing function (GAM, Loess, etc.) Showing patterns/trends

Position/Fill

  • You can use position/fill to move elements around and color code geoms by aspects of the data, such as categories.

  • For example, with a bar chart:

    • stack: Stack bars on top of each other
    • fill: Stack bars, but make them always fill up vertical space to 1
    • dodge: Put the bars next to each other
  • Tip: When you reach for fill, always consider faceting in small multiples. The brain struggles with fills. More on this later.

Aggregations

  • To make a bar chart, one groups data and takes a mean, proportion, or count
  • Software like ggplot will try to do this for you and it works well for simple cases
  • This can fail with more difficult aggregations or when doing things like applying annotation labels
  • A powerful and frustration-reducing technique is to do the aggregation yourself and then use the identity mapping
  • This separates the statistical logic from the visual logic and is often what makes a tricky figure easier to make

Examples

Illustration: ggplot2 Concepts

  • Let’s load the prams_2_ggplot_concepts.qmd file
  • In this file we will load the data we prepared in the previous file
  • We will then work through the key ggplot2 concepts with examples.
  • We will iterate on the figure to make it more informative and visually appealing

Illustration: Iteration & Aggregation

  • The PRAMS data we used comes pre-aggregated

  • In prams_3_iteration_aggregation.qmd we will illustrate how statistical transformations work by showing:

    • The identity transformation with pre-aggregated data
    • Using aggregations with simulated data

We will also demonstrate the process of iterating on a figure to make it more informative and visually appealing.

Colors & Accessibility

Colors

  • Colors are a powerful tool for communicating information
  • They can be used to highlight important information and create a visual hierarchy

RColorBrewer

  • The RColorBrewer package gives good default color schemes that are accessible (i.e., easy to read for everyone) and aesthetically pleasing.
  • ggplot by default picks out pleasing colors that are distinguishable from each other by the human eye.

Accessible color swatches

Using a Color Design System

  • A design system is a set of rules for how to create a consistent visual language
  • If you are really polishing a figure, you might want to use a color design system
  • To the right is an example drawn from the colors.R file in the examples directory, which implements TailwindCSS’s color palette

Tailwind Color Palette

Example

  • In the examples/colors_and_accessibility.qmd file, we will explore the use of color design systems and perceptually uniform colors