Data Visualization: Day 1

Erik Westlund

2025-06-10

Welcome to Data Visualization

Housekeeping

Who am I?
Contact information
Course materials

Who am I?

I am a data scientist at the Johns Hopkins Bloomberg School of Public Health
I work out of the Johns Hopkins Biostatistics Center in the Department of Biostatistics
I was trained in the social sciences and have worked profesionally as a data scientist and software developer for over 10 years

Contact information

Email: ewestlund@jhu.edu

Course materials

Course GitHub repository:
https://github.com/erikwestlund/data-viz-summer-25
CoursePlus Website:
https://courseplus.jhu.edu/core/index.cfm/
go/course.home/coid/23902/

Course Goals

Understanding of core data visualization concepts
Develop strong data science & data visualization workflows
Learn to produce high-quality data visualizations
Learn to communicate effectively with and about data visualizations

Course Outline

Introduction to data visualization
Tooling & worfklow
Data preparation
Grammar of graphics
Making good, honest graphics
Dashboards

Introduction to Data Visualization

Edward Tufte: Graphical Excellence is….

“Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Dense

“It is that which gives to the view the great number of ideas in the shortest time with the least ink in the smallest space…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Multivariate

“It is nearly always multivariate…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Truthful

“Graphical excellence requires telling the truth about the data…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Exemplar: Napoleon’s March

Charles Minard’s Napoleon’s March

Achieving Minard’s Graphical Excellence

“[Minard’s classic image] can be described and admired, but there are no compositional principles on how to create that one wonder graphic in a million.””

Edward Tufte, The Visual Display of Quantitative Information, 1983

For The Rest of Us

Instead, Tufte suggests:

“[For] more routine, workaday designs”
“[Have] a properly chosen format and design”
“Use words, numbers, and drawing together”
“[D]isplay an accessible complexity of detail”
“Avoid content-free decoration, including chartjunk”

We will revisit more of Tufte’s principles throughout the course.

Tooling & Workflow

It is worth investing in learning your tools
A good data visualization workflow requires good tooling and workflow
Below we will discuss some of the tools we will use in this course and why we use them

Required Software For This Course

R

We will rely mostly on R for this course
R can be downloaded from r-project.org

`ggplot2`

ggplot2 is a powerful package for creating data visualizations
It is built on the grammar of graphics
It is a declarative grammar for data visualization

git

git is a powerful tool for version control
It allows you to
- track changes to your code
- revert to previous versions of your code
- collaborate with others on your code
- maintain multiple branches/versions of your code
- and more

GitHub

git is not GitHub
GitHub is a web-based platform for hosting and collaborating on code
It is technically a remote repository for git
It gives you a place to store your code and collaborate with others
It is free for open source projects

Scientific Notebooks

Notebooks are a powerful way to work with data and do data visualization
They allow you to embed code, text, and visualizations in a single document
They thus allow you to easily share both the process and the results of your work
I do not require a specific notebook system for this course, but I will be using Quarto for examples

Notebooks: Quarto

Quarto is an open source scientific and technical publishing system
You can create reports, websites, presentations, and books with Quarto
This presentation is built with Quarto
You can embed Python, R, and other code in in plaintext Quarto documents
Quarto renders down to a document in HTML, PDF, or Word format

RMarkdown

RMarkdown is a way to create documents that mix R code and text
It integrates with RStudio well and has a very similar workflow to Quarto
RMarkdown renders down to a document in HTML, PDF, or Word format; the files them selves are plain text
Easy to store notebooks in version control with git

Jupyter

Jupyter is a notebook system popular with Python users
Jupyter stores code and results in the same document (Quarto/RMarkdown render into a separate document)
Jupyter supports R and other languages
Jupyter stores itself as JSON (javascript object notation) files and are not as easy to diff in git

Optional/Popular Software

RStudio

RStudio is a powerful IDE for R
It is free and open source
It helps you understand what is in your environment (e.g., variables, functions, packages, etc.)
It also makes it easy to view your visualizations as you make them

Python

Python is a powerful general purpose programming language that is very popular in the data science community, especially in machine learning
It has A-tier data management and scientific computing libraries, such as pandas and numpy
It has a large ecosystem of packages for data visualization, including matplotlib and seaborn

LLMs

LLMs are commonly used to help with code
Common ones used in data science are ChatGPT, Claude, Gemini, and GitHub’s Copilot
They can help you write code, debug code, and write documentation
They can also make mistakes, so you cannot blindly trust their work

AI, LLMs, and Data Visualization

AI and Data Visualization

AI and LLMs are becoming more and more powerful
They can help you with many data-related tasks, but require care
They are allowed in this course, but you are responsible for checking their work

My Philosophy

I use LLMs in nearly all aspects of my work
I have found that there is now less value in being able to “make computer do something” and more in high level concepts
To that extent, in this course we will try to focus a little more on concepts and less on ggplot2 syntax, since LLMs really can mostly solve technical visualization problems

Prerequisites for Today

Tools To Install

R: Install at r-project.org
RStudio: Install at rstudio.com
Git: Install at git-scm.com

Accounts To Create

GitHub: Create an account at github.com

Setup Tasks

You may need to configure your git username and email.

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

On Windows, you can run this in “Git Bash”. On Mac, you can run this in the terminal.

Project Organization & Tooling

Before We Do Actual Visualization

I am spending a lot of time on tools and organization
This may seem off-topic, but I assure you it is not
Efficient workflows and reproducible work are key to success in data science

Benefits Of Being Organized

Efficiency: Find files quickly
Reproducibility: Others can follow your work
Collaboration: Team members understand your structure
Maintenance: Easier to update and fix issues
Scalability: Structure grows with your project

Naming Convention Challenges

Files need to be:
- Easy to find
- Easy to understand
- Easy to sort
- Easy to version control
Common problems:
- Unclear file purposes
- Inconsistent naming
- Missing execution order
- Lost files in nested directories
- Confusion about latest versions

Course Naming Conventions

I recommend these conventions:

work/
├── example_*         # Learning examples
├── ps1_*             # Problem set 1
├── ps2_*             # Problem set 2
├── ps3_*             # Problem set 3
├── final_project_*   # Final project
├── shared_*          # Shared resources
└── data/             # Data directory

Prefixes indicate purpose
Numbers show execution order
It is a best practice to document your structure in a README

Real-World Flexibility

These conventions are a starting point
Real projects often need:
- Different structures for different teams
- Adaptation as projects grow
- Balance between consistency and flexibility
- Documentation of changes

Directory Structure Options

Flat Structure:

work/
├── data_prep.qmd
├── analysis.qmd
└── data/

Nested Structure:

work/
├── data_prep/
│   └── data_prep.qmd
├── analysis/
│   └── analysis.qmd
└── data/

The Challenge of Subdirectories

Relative Paths are Tricky!

Files in subdirectories need to reference parent directories
Paths like ../../data/file.csv are:
- Error-prone
- Hard to maintain
- Break when moving files
- Confusing to read

The `.Rproj` File

Defines your project root
Sets the working directory
Stores project settings
Makes paths relative to project root
Essential for project portability

The `here` Package

library(here)

# Instead of "../../data/file.csv"
read_csv(here("data", "file.csv"))

Builds paths from project root
Works from any subfolder
More readable than ../
More maintainable

Benefits of Good Path Management

Paths work from anywhere
No more ../ counting
More maintainable
More portable
Easier collaboration

When to Change Structure?

As project complexity grows
When working with multiple datasets
When collaborating with others
When sharing code with different audiences

Getting Started with RStudio, Projects, Quarto, and `git`

Goals

Learn the data science workflow: RStudio, projects, Quarto, and git
Set up your course workspace
Create your first notebook

Fork/Open the Course Repository

Fork the course repo on GitHub
In RStudio: File -> New Project -> From Version Control
Paste your fork URL and create

Exploring the Example Project

Open examples/project-example/
Notice the structure:
- _quarto.yml for configuration
- Numbered notebooks
- data/ directory
- .gitignore

Setting Up Your Work Directory

Create new project in work/ using RStudio (File -> New Project -> Existing Directory)
Select the “work” directory in the repository we just forked

I have tested this and RStudio handles the remote repository in the directory one higher up.

Your Work Directory

Important!

The work/ directory is your personal workspace for everything in this course:

All homework assignments
Course projects
Learning examples
Your final project

You are responsible for:

Keeping your work organized
Following the naming conventions
Maintaining a clean project structure
Documenting your organization in README

This is your space - keep it clean and organized!

Configuring Your Project (1/2)

Let’s get a file set up to work with Quarto and have data to read from.

Copy _quarto.yml from examples to _quarto.yml in your new project
Create data/ directory

Configuring Your Project (2/2)

We are now going to create a file called .gitignore to tell git to ignore certain files.

Create a file called .gitignore (if it doesn’t already exist)

Understanding `.gitignore`

Never Commit Sensitive Data!

.gitignore tells Git which files to ignore
Critical for:
- Protecting sensitive data
- Preventing accidental commits
- Keeping repositories clean

Why `.gitignore` Matters

Data Privacy:
- Health data is sensitive
- Patient information must be protected
- Legal requirements (HIPAA, etc.)
- Ethical obligations
Repository Health:
- Prevents large binary files
- Avoids temporary files
- Keeps repository size manageable
- Makes collaboration easier

Our `.gitignore` Setup

With the course-wide .gitiginore repository file, you will see these lines:

work/data/*


work/.Rproj.user/
work/.Rhistory
work/.RData

work/data/*: Keeps all data files in the work directory local
work/.Rproj.user: RStudio temporary files
work/.Rhistory: Command history
work/.RData: R workspace files

Creating Your First Notebook (1/4)

In RStudio:

Click File -> New File -> Quarto Document
In the dialog:
- Title: “Data Preparation Example”
- Author: Your Name
- Format: HTML
- Template: Default
Click “Create”
Save as example_cars_1_data_prep.qmd in your work/ directory

Notebook Setup (2/4)

In RStudio:

The YAML header will be automatically created at the top of your file
It will look like this:

---
title: "Data Preparation Example"
format: html
---

You can remove the editor: visual line – we’re going to try to work with text.

Data Preparation (3/4)

Let’s create the data preparation setup file.

Include this as a setup block:

```{r setup-prep}
#| echo: false
#| message: false
library(dplyr)
library(readr)
library(stringr)
library(ggplot2)
```

And then this to load and prepare the data:

```{r load-data}
# Load and prepare data
mtcars_clean <- mtcars |>
  mutate(
    car_name = rownames(mtcars),
    make = word(car_name, 1),  # First word is make
    model = str_remove(car_name, paste0(make, " ")),  # Rest is model
    efficiency = mpg / wt
  )

# Save processed data
write_csv(mtcars_clean, "data/mtcars_clean.csv")
```

Render the File (4/4)

Render the file to see the results (click the “Render” button above the editor)
We will get two outputs:
1. A rendered HTML file
2. A data file stored in the data/ directory

Create an Analysis Notebook (1/2)

In RStudio:

Click File -> New File -> Quarto Document
In the dialog:
- Title: “Data Analysis Example”
- Author: Your Name
- Format: HTML
- Template: Default
Click “Create”
Save as example_cars_2_analysis.qmd in your work/ directory

Set Up the Analysis Notebook (2/2)

```{r setup-analysis}
#| echo: false
#| message: false
library(dplyr)
library(readr)
library(ggplot2)
library(forcats)
```

```{r load-processed}
# Load processed data
df <- read_csv("data/mtcars_clean.csv")
df |> head()
```

Version Control – Save our Work (1/2)

You have:

Forked the course repository from GitHub
Cloned your fork to your local machine

Now, let’s set up version control in your project.

Option 1: Using RStudio’s Git Interface

In RStudio, go to Tools → Version Control → Project Setup
Select “Git” as the version control system
Click “Create Repository”
In the Git tab (usually in the top-right panel):
- Click “Add” to stage all files (or select individual files)
- Enter commit message: “Initial project setup”
- Click “Commit”
- Click “Push” to sync with your fork

Option 2: Using Terminal Commands

In the terminal:

git add .
git commit -m "Initial project setup"
git push

Version Control Workflow

Option A: Using RStudio’s Git Interface

Stage changes: Click “Add” in the Git tab
Commit: Enter message and click “Commit”
Push: Click “Push” to sync with your fork

Option B: Using Terminal Commands

Stage changes: git add .
Commit: git commit -m "Description of changes"
Push: git push

Choose whichever method you’re most comfortable with as both accomplish the same thing!

Summarize

We are now at a point where we have a clean, reproducible workflow
We’ve created a two-step workflow for data preparation and analysis
We’ve committed our code to GitHub
This workflow puts you ahead of the curve

Homework Assignment

Problem Set 1 is due tomorrow; see the assignment
Your job is to create a bar chart, scatter plot, and histogram of data from any data set (you can use the mtcars data set if you want)
You should comment on what the purpose of the plot is and what it communicates (a few sentences each is fine)
You should follow the workflow you learned today
Give the files sensible names, commit them to GitHub, and email me the link to your repository
Problems with GitHub? Email me your answers.

Visualization Workflow Preview

Let’s create a few very, very simple plots
We’re not going to get into the details of the code until tomorrow
For now, let’s just put a couple points on the scoreboard

Code Blocks

From here on out, it’s up to you to create the code blocks, such as below:

```{r}

# Code goes here

```

Histogram (Base R)

Histograms are easy to plot using base R or ggplot2
Base R:

hist(df$mpg)

Histogram (ggplot2)

In our course, we will focus on ggplot2
Tomorrow we will explicate the logic behind the gg or grammar of graphics

ggplot(df, aes(x = mpg)) +
  geom_histogram()

Histogram (ggplot2) notes

Note how the Base R and ggplot2 versions differ: what do you notice?
What might we do to improve the ggplot2 version?
A histogram is a workhourse “utility” plot. When is it worth “polishing” it?

Histogram polishing (but only a little bit)

ggplot(df, aes(x = mpg)) +
  geom_histogram(binwidth = 5) +
  theme_minimal() + 
  labs(
    title = "Distribution of MPG",
    x = "MPG",
    y = "Count"
  )

Bar Chart v1

# Count cars by make
bar_plot_v1 <- df |>
  ggplot(aes(x = make)) +
  geom_bar()

bar_plot_v1

What’s wrong with this?

Possible Bar Chart Improvements

coord_flip(): Flip the x and y axes
theme_minimal(): Use a minimal theme
labs(): Add labels to the plot

bar_plot_v2 <- df |>
  ggplot(aes(x = make)) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Number of Cars by Make",
    x = "Make",
    y = "Count"
  )

Bar Chart v2

bar_plot_v2

Possible Bar Chart v2 Improvements

Sort by count

bar_plot_v3 <- df |>
  count(make) |>
  mutate(make = fct_reorder(make, n)) |>
  ggplot(aes(x = make, y = n)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Number of Cars by Make",
    x = "Make",
    y = "Count"
  )

Bar Chart v3

bar_plot_v3

Bar Chart v3 Improvements

The title is not very informative
We don’t need Y axis labels
Remove the horizontal lines
We can annotate the plot with the number of cars and remove the vertical lines

bar_plot_v4 <- bar_plot_v3 +
  labs(
    title = "Distribution of Car Makes in the mtcars Dataset",
    x = NULL, # Note that because of coord_flip(), x is now the y axis
    y = "Number of Cars in Fleet"
  ) +
  theme(panel.grid.major.x = element_blank())

Bar Chart v4

bar_plot_v4

Bar Chart Thoughts

We see the workflow of building good visualizations with ggplot2
We also see that this figure is still not perfect. We could consider:
- Making it colorful
- Making the title bold
- Adding a subtitle
- Using text annotations

Scatterplot v1

scatter_plot <- df |>
  ggplot(aes(x = wt, y = mpg)) +
  geom_point()

scatter_plot

Scatterplot Improvements

We can add a trend line
We could soften up the gray background
We could add a title and labels

Scatterplot v2

scatter_plot_v2 <- scatter_plot +
  geom_smooth(method = "lm") + 
  theme_minimal() + 
  labs(
    title = "Relationship between Weight and MPG",
    x = "Weight (1000 lbs)",
    y = "MPG"
  )

Scatterplot v2

scatter_plot_v2

Scatterplot thoughts

This is looking better
In a real analysis, we may want to annotate certain points or facet by some other variable

Fancier Plotting

If you look at our code set up, we created an efficiency variable
What if we tried to compare efficiency across makes? How might we proceed?

efficiency_by_make <- df |>
  group_by(make) |>
  summarise(avg_efficiency = mean(efficiency)) |>
  mutate(make = fct_reorder(make, avg_efficiency)) |>
  ggplot(aes(x = make, y = avg_efficiency)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank()) +
  labs(
    title = "Average Fuel Efficiency by Make",
    x = NULL,
    y = "Average Efficiency (MPG/1000 lbs)"
  )

Efficiency By Make Plot

efficiency_by_make

Data Visualization: Day 1

Welcome to Data Visualization

Housekeeping

Who am I?

Contact information

Course materials

Course Goals

Course Outline

Introduction to Data Visualization

Edward Tufte: Graphical Excellence is….

Dense

Multivariate

Truthful

Exemplar: Napoleon’s March

Achieving Minard’s Graphical Excellence

For The Rest of Us

Tooling & Workflow

Required Software For This Course

R

ggplot2

git

GitHub

Scientific Notebooks

Notebooks: Quarto

RMarkdown

Jupyter

Optional/Popular Software

RStudio

Python

LLMs

AI, LLMs, and Data Visualization

AI and Data Visualization

My Philosophy

Prerequisites for Today

Tools To Install

Accounts To Create

Setup Tasks

Project Organization & Tooling

Before We Do Actual Visualization

Benefits Of Being Organized

Naming Convention Challenges

Course Naming Conventions

Real-World Flexibility

Directory Structure Options

The Challenge of Subdirectories

The .Rproj File

The here Package

Benefits of Good Path Management

When to Change Structure?

Getting Started with RStudio, Projects, Quarto, and git

Goals

Fork/Open the Course Repository

Exploring the Example Project

Setting Up Your Work Directory

Your Work Directory

Configuring Your Project (1/2)

Configuring Your Project (2/2)

Understanding .gitignore

Why .gitignore Matters

Our .gitignore Setup

Creating Your First Notebook (1/4)

Notebook Setup (2/4)

Data Preparation (3/4)

Render the File (4/4)

Create an Analysis Notebook (1/2)

Set Up the Analysis Notebook (2/2)

Version Control – Save our Work (1/2)

Option 1: Using RStudio’s Git Interface

Option 2: Using Terminal Commands

Version Control Workflow

Option A: Using RStudio’s Git Interface

Option B: Using Terminal Commands

Summarize

Homework Assignment

Visualization Workflow Preview

Code Blocks

Histogram (Base R)

Histogram (ggplot2)

Histogram (ggplot2) notes

Histogram polishing (but only a little bit)

`ggplot2`

The `.Rproj` File

The `here` Package

Getting Started with RStudio, Projects, Quarto, and `git`

Understanding `.gitignore`

Why `.gitignore` Matters

Our `.gitignore` Setup