Data Visualization: Day 2

Erik Westlund

2025-06-11

Day 2: AI & LLMs, Data Preparation, and the Grammar of Graphics

Overview For Day 2

AI & LLMs: An Illustration
Data preparation fundamentals
Introduction to the grammar of graphics
Colors & accessibility

AI & LLMs: An Illustration

We’ll use an LLM to help us prepare data for a visualization
We’ll work with the ai_example.qmd file in the examples directory

Data Preparation

Getting Data Ready For Visualization

Before we can do any visualization, we need to load in data and prepare it
Take this step seriously, especially your variable names, as your code depends on it
Workflow options:
1. Clean up the data in a block at the top of your notebook
2. Separate out the data preparation from the visualization code
There are no universal answers to the best approach here; friction as you work often tells you what to do

Option 1: Clean up the data in a block at the top of your notebook

Benefits

Quick and straightforward
All code in one place
Easy for you and others to follow the workflow
Good for exploratory analysis
Minimal file management

Drawbacks

Can become messy with large datasets
Expensive transformations take time to run every time you run the notebook
Harder to reuse code, especially in analyses with multiple steps
Your colleagues may not care about the data preparation

Option 2: Separate out the data preparation from the visualization code

In this approach, you prepare the data in a separate script or notebook
You can then load the data into your visualization notebook
Two options:
1. Run the data preparation script before running the visualization script
2. Run the data preparation script as part of the visualization script
Example of (1) above provided in our examples/project-example directory

Benefits/Drawbacks of separating scripts

Benefits

Data preparation notebook serves as documentation of your data
Data preparation can be reused across projects
Better separation of concerns
Often asier to maintain and update

Drawbacks

More files to manage
Need to ensure data preparation is run in the right order
Can be overkill for simple projects
More complex project structure

Saving Data To Load Later

When using separate scripts, you’ll often want to save a copy of your cleaned data to load later scripts/notebooks
A good option here is use the saveRDS function to save your data as an R data object
This has a few benefits
- RDS files maintain all R metadata
- You can quickly load the data into R with readRDS, which can save you a lot of time when working with large datasets
- RDS files are not easily to accidentally modify

Preparing Data Example

I have created an example of data preparation in the examples/prams_1_data_prep.qmd file
Let’s load this up in RStudio and step through it.
Note that we’ll be opening the entire course repo as a project, so the root directory is one level up from the examples directory
We will return to this data file later to visualize it

The Grammar of Graphics

`ggplot2` and the Grammar of Graphics

The “gg” in ggplot2 stands for the grammar of graphics, a concept coined by Leland Wilkinson and popularized by Hadley Wickham.
Grammar in language is a set of rules for how language is structured.
Grammar in graphics is a set of rules for how visual representations of data are structured.

Key Concepts in ggplot2

ggplot2 workflow

`ggplot2` Workflow

Start with data
Pick an aesthetic mapping
Choose a geometric object
Add statistical transformations
Adjust finer details: scales, coordinate systems, faceting, etc.

Aesthetic Mappings

This is how your variables map onto the aesthetics of your figure
It sounds highfalutin, but it usually just means:
- What is your x?
- What is your y?

Geometric Objects (“geoms”)

Geoms are the elements used to represent the data
Geoms have associated stats/functions/statistical transformations
Stats such as counts for groups often involve aggregations

Geoms (cont.)

Geom	R Function	Stat/Transformation	Use for
Point	`geom_point()`	Identity	Scatter plots
Bar	`geom_bar()`	Count	Bar plots
AB Line	`geom_abline()`	Slope	Line plots
Horizontal line	`geom_hline()`	Y intercept	Reference lines
Vertical line	`geom_vline()`	X intercept	Reference lines
Smoother	`geom_smooth()`	Smoothing function (GAM, Loess, etc.)	Showing patterns/trends

Position/Fill

You can use position/fill to move elements around and color code geoms by aspects of the data, such as categories.
For example, with a bar chart:
- stack: Stack bars on top of each other
- fill: Stack bars, but make them always fill up vertical space to 1
- dodge: Put the bars next to each other
Tip: When you reach for fill, always consider faceting in small multiples. The brain struggles with fills. More on this later.

Aggregations

To make a bar chart, one groups data and takes a mean, proportion, or count
Software like ggplot will try to do this for you and it works well for simple cases
This can fail with more difficult aggregations or when doing things like applying annotation labels
A powerful and frustration-reducing technique is to do the aggregation yourself and then use the identity mapping
This separates the statistical logic from the visual logic and is often what makes a tricky figure easier to make

Examples

Illustration: `ggplot2` Concepts

Let’s load the prams_2_ggplot_concepts.qmd file
In this file we will load the data we prepared in the previous file
We will then work through the key ggplot2 concepts with examples.
We will iterate on the figure to make it more informative and visually appealing

Illustration: Iteration & Aggregation

The PRAMS data we used comes pre-aggregated
In prams_3_iteration_aggregation.qmd we will illustrate how statistical transformations work by showing:
- The identity transformation with pre-aggregated data
- Using aggregations with simulated data

We will also demonstrate the process of iterating on a figure to make it more informative and visually appealing.

Colors & Accessibility

Colors

Colors are a powerful tool for communicating information
They can be used to highlight important information and create a visual hierarchy

RColorBrewer

The RColorBrewer package gives good default color schemes that are accessible (i.e., easy to read for everyone) and aesthetically pleasing.
ggplot by default picks out pleasing colors that are distinguishable from each other by the human eye.

Using a Color Design System

A design system is a set of rules for how to create a consistent visual language
If you are really polishing a figure, you might want to use a color design system
To the right is an example drawn from the colors.R file in the examples directory, which implements TailwindCSS’s color palette

Example

In the examples/colors_and_accessibility.qmd file, we will explore the use of color design systems and perceptually uniform colors

Data Visualization: Day 2

Day 2: AI & LLMs, Data Preparation, and the Grammar of Graphics

Overview For Day 2

AI & LLMs: An Illustration

Data Preparation

Getting Data Ready For Visualization

Option 1: Clean up the data in a block at the top of your notebook

Benefits

Drawbacks

Option 2: Separate out the data preparation from the visualization code

Benefits/Drawbacks of separating scripts

Benefits

Drawbacks

Saving Data To Load Later

Preparing Data Example

The Grammar of Graphics

ggplot2 and the Grammar of Graphics

Key Concepts in ggplot2

ggplot2 Workflow

Aesthetic Mappings

Geometric Objects (“geoms”)

Geoms (cont.)

Position/Fill

Aggregations

Examples

Illustration: ggplot2 Concepts

Illustration: Iteration & Aggregation

Colors & Accessibility

Colors

RColorBrewer

Using a Color Design System

Example

`ggplot2` and the Grammar of Graphics

`ggplot2` Workflow

Illustration: `ggplot2` Concepts