Data Visualization: Day 1

Erik Westlund

2025-06-10

Welcome to Data Visualization

Housekeeping

  • Who am I?
  • Contact information
  • Course materials

Who am I?

  • I am a data scientist at the Johns Hopkins Bloomberg School of Public Health
  • I work out of the Johns Hopkins Biostatistics Center in the Department of Biostatistics
  • I was trained in the social sciences and have worked profesionally as a data scientist and software developer for over 10 years

Contact information

  • Email: ewestlund@jhu.edu

Course materials

Course Goals

  • Understanding of core data visualization concepts
  • Develop strong data science & data visualization workflows
  • Learn to produce high-quality data visualizations
  • Learn to communicate effectively with and about data visualizations

Course Outline

  • Introduction to data visualization
  • Tooling & worfklow
  • Data preparation
  • Grammar of graphics
  • Making good, honest graphics
  • Dashboards

Introduction to Data Visualization

Edward Tufte: Graphical Excellence is….

Edward Tufte

“Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Dense

Edward Tufte

“It is that which gives to the view the great number of ideas in the shortest time with the least ink in the smallest space…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Multivariate

Edward Tufte

“It is nearly always multivariate…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Truthful

Edward Tufte

“Graphical excellence requires telling the truth about the data…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Exemplar: Napoleon’s March

Charles Minard’s Napoleon’s March

Achieving Minard’s Graphical Excellence

“[Minard’s classic image] can be described and admired, but there are no compositional principles on how to create that one wonder graphic in a million.””

Edward Tufte, The Visual Display of Quantitative Information, 1983

For The Rest of Us

Instead, Tufte suggests:

  • “[For] more routine, workaday designs”
  • “[Have] a properly chosen format and design”
  • “Use words, numbers, and drawing together”
  • “[D]isplay an accessible complexity of detail”
  • “Avoid content-free decoration, including chartjunk”

We will revisit more of Tufte’s principles throughout the course.

Tooling & Workflow

  • It is worth investing in learning your tools
  • A good data visualization workflow requires good tooling and workflow
  • Below we will discuss some of the tools we will use in this course and why we use them

Required Software For This Course

R

  • We will rely mostly on R for this course
  • R can be downloaded from r-project.org

R Logo

ggplot2

  • ggplot2 is a powerful package for creating data visualizations
  • It is built on the grammar of graphics
  • It is a declarative grammar for data visualization

ggplot2 Logo

git

  • git is a powerful tool for version control
  • It allows you to
    • track changes to your code
    • revert to previous versions of your code
    • collaborate with others on your code
    • maintain multiple branches/versions of your code
    • and more

git Logo

GitHub

  • git is not GitHub
  • GitHub is a web-based platform for hosting and collaborating on code
  • It is technically a remote repository for git
  • It gives you a place to store your code and collaborate with others
  • It is free for open source projects

GitHub Logo

Scientific Notebooks

  • Notebooks are a powerful way to work with data and do data visualization
  • They allow you to embed code, text, and visualizations in a single document
  • They thus allow you to easily share both the process and the results of your work
  • I do not require a specific notebook system for this course, but I will be using Quarto for examples

Notebook Logo

Notebooks: Quarto

  • Quarto is an open source scientific and technical publishing system
  • You can create reports, websites, presentations, and books with Quarto
  • This presentation is built with Quarto
  • You can embed Python, R, and other code in in plaintext Quarto documents
  • Quarto renders down to a document in HTML, PDF, or Word format

Quarto Logo

RMarkdown

  • RMarkdown is a way to create documents that mix R code and text
  • It integrates with RStudio well and has a very similar workflow to Quarto
  • RMarkdown renders down to a document in HTML, PDF, or Word format; the files them selves are plain text
  • Easy to store notebooks in version control with git

RMarkdown Logo

Jupyter

  • Jupyter is a notebook system popular with Python users
  • Jupyter stores code and results in the same document (Quarto/RMarkdown render into a separate document)
  • Jupyter supports R and other languages
  • Jupyter stores itself as JSON (javascript object notation) files and are not as easy to diff in git

Jupyter Logo

Optional/Popular Software

RStudio

  • RStudio is a powerful IDE for R
  • It is free and open source
  • It helps you understand what is in your environment (e.g., variables, functions, packages, etc.)
  • It also makes it easy to view your visualizations as you make them

RStudio Logo

Python

  • Python is a powerful general purpose programming language that is very popular in the data science community, especially in machine learning
  • It has A-tier data management and scientific computing libraries, such as pandas and numpy
  • It has a large ecosystem of packages for data visualization, including matplotlib and seaborn

Python Logo

LLMs

  • LLMs are commonly used to help with code
  • Common ones used in data science are ChatGPT, Claude, Gemini, and GitHub’s Copilot
  • They can help you write code, debug code, and write documentation
  • They can also make mistakes, so you cannot blindly trust their work

GitHub Copilot Logo

AI, LLMs, and Data Visualization

AI and Data Visualization

  • AI and LLMs are becoming more and more powerful
  • They can help you with many data-related tasks, but require care
  • They are allowed in this course, but you are responsible for checking their work

My Philosophy

  • I use LLMs in nearly all aspects of my work
  • I have found that there is now less value in being able to “make computer do something” and more in high level concepts
  • To that extent, in this course we will try to focus a little more on concepts and less on ggplot2 syntax, since LLMs really can mostly solve technical visualization problems

Prerequisites for Today

Tools To Install

Accounts To Create

Setup Tasks

You may need to configure your git username and email.

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

On Windows, you can run this in “Git Bash”. On Mac, you can run this in the terminal.

Project Organization & Tooling

Before We Do Actual Visualization

  • I am spending a lot of time on tools and organization
  • This may seem off-topic, but I assure you it is not
  • Efficient workflows and reproducible work are key to success in data science

Benefits Of Being Organized

  • Efficiency: Find files quickly
  • Reproducibility: Others can follow your work
  • Collaboration: Team members understand your structure
  • Maintenance: Easier to update and fix issues
  • Scalability: Structure grows with your project

Naming Convention Challenges

  • Files need to be:
    • Easy to find
    • Easy to understand
    • Easy to sort
    • Easy to version control
  • Common problems:
    • Unclear file purposes
    • Inconsistent naming
    • Missing execution order
    • Lost files in nested directories
    • Confusion about latest versions

Course Naming Conventions

I recommend these conventions:

work/
├── example_*         # Learning examples
├── ps1_*             # Problem set 1
├── ps2_*             # Problem set 2
├── ps3_*             # Problem set 3
├── final_project_*   # Final project
├── shared_*          # Shared resources
└── data/             # Data directory
  • Prefixes indicate purpose
  • Numbers show execution order
  • It is a best practice to document your structure in a README

Real-World Flexibility

  • These conventions are a starting point
  • Real projects often need:
    • Different structures for different teams
    • Adaptation as projects grow
    • Balance between consistency and flexibility
    • Documentation of changes

Directory Structure Options

  1. Flat Structure:

    work/
    ├── data_prep.qmd
    ├── analysis.qmd
    └── data/
  2. Nested Structure:

    work/
    ├── data_prep/
    │   └── data_prep.qmd
    ├── analysis/
    │   └── analysis.qmd
    └── data/

The Challenge of Subdirectories

Relative Paths are Tricky!

  • Files in subdirectories need to reference parent directories
  • Paths like ../../data/file.csv are:
    • Error-prone
    • Hard to maintain
    • Break when moving files
    • Confusing to read

The .Rproj File

  • Defines your project root
  • Sets the working directory
  • Stores project settings
  • Makes paths relative to project root
  • Essential for project portability

The here Package

library(here)

# Instead of "../../data/file.csv"
read_csv(here("data", "file.csv"))
  • Builds paths from project root
  • Works from any subfolder
  • More readable than ../
  • More maintainable

Benefits of Good Path Management

  • Paths work from anywhere
  • No more ../ counting
  • More maintainable
  • More portable
  • Easier collaboration

When to Change Structure?

  • As project complexity grows
  • When working with multiple datasets
  • When collaborating with others
  • When sharing code with different audiences

Getting Started with RStudio, Projects, Quarto, and git

Goals

  • Learn the data science workflow: RStudio, projects, Quarto, and git
  • Set up your course workspace
  • Create your first notebook

Fork/Open the Course Repository

  1. Fork the course repo on GitHub
  2. In RStudio: File -> New Project -> From Version Control
  3. Paste your fork URL and create

Exploring the Example Project

  1. Open examples/project-example/
  2. Notice the structure:
    • _quarto.yml for configuration
    • Numbered notebooks
    • data/ directory
    • .gitignore

Setting Up Your Work Directory

  1. Create new project in work/ using RStudio (File -> New Project -> Existing Directory)
  2. Select the “work” directory in the repository we just forked

I have tested this and RStudio handles the remote repository in the directory one higher up.

Your Work Directory

Important!

The work/ directory is your personal workspace for everything in this course:

  • All homework assignments
  • Course projects
  • Learning examples
  • Your final project

You are responsible for:

  • Keeping your work organized
  • Following the naming conventions
  • Maintaining a clean project structure
  • Documenting your organization in README

This is your space - keep it clean and organized!

Configuring Your Project (1/2)

Let’s get a file set up to work with Quarto and have data to read from.

  1. Copy _quarto.yml from examples to _quarto.yml in your new project
  2. Create data/ directory

Configuring Your Project (2/2)

We are now going to create a file called .gitignore to tell git to ignore certain files.

  1. Create a file called .gitignore (if it doesn’t already exist)

Understanding .gitignore

Never Commit Sensitive Data!

  • .gitignore tells Git which files to ignore
  • Critical for:
    • Protecting sensitive data
    • Preventing accidental commits
    • Keeping repositories clean

Why .gitignore Matters

  • Data Privacy:
    • Health data is sensitive
    • Patient information must be protected
    • Legal requirements (HIPAA, etc.)
    • Ethical obligations
  • Repository Health:
    • Prevents large binary files
    • Avoids temporary files
    • Keeps repository size manageable
    • Makes collaboration easier

Our .gitignore Setup

With the course-wide .gitiginore repository file, you will see these lines:

work/data/*


work/.Rproj.user/
work/.Rhistory
work/.RData
  • work/data/*: Keeps all data files in the work directory local
  • work/.Rproj.user: RStudio temporary files
  • work/.Rhistory: Command history
  • work/.RData: R workspace files

Creating Your First Notebook (1/4)

In RStudio:

  • Click File -> New File -> Quarto Document
  • In the dialog:
    • Title: “Data Preparation Example”
    • Author: Your Name
    • Format: HTML
    • Template: Default
  • Click “Create”
  • Save as example_cars_1_data_prep.qmd in your work/ directory

Notebook Setup (2/4)

In RStudio:

  • The YAML header will be automatically created at the top of your file
  • It will look like this:
---
title: "Data Preparation Example"
format: html
---

You can remove the editor: visual line – we’re going to try to work with text.

Data Preparation (3/4)

Let’s create the data preparation setup file.

Include this as a setup block:

```{r setup-prep}
#| echo: false
#| message: false
library(dplyr)
library(readr)
library(stringr)
library(ggplot2)
```

And then this to load and prepare the data:

```{r load-data}
# Load and prepare data
mtcars_clean <- mtcars |>
  mutate(
    car_name = rownames(mtcars),
    make = word(car_name, 1),  # First word is make
    model = str_remove(car_name, paste0(make, " ")),  # Rest is model
    efficiency = mpg / wt
  )

# Save processed data
write_csv(mtcars_clean, "data/mtcars_clean.csv")
```

Render the File (4/4)

  • Render the file to see the results (click the “Render” button above the editor)

  • We will get two outputs:

    1. A rendered HTML file
    2. A data file stored in the data/ directory

Create an Analysis Notebook (1/2)

In RStudio:

  • Click File -> New File -> Quarto Document
  • In the dialog:
    • Title: “Data Analysis Example”
    • Author: Your Name
    • Format: HTML
    • Template: Default
  • Click “Create”
  • Save as example_cars_2_analysis.qmd in your work/ directory

Set Up the Analysis Notebook (2/2)

```{r setup-analysis}
#| echo: false
#| message: false
library(dplyr)
library(readr)
library(ggplot2)
library(forcats)
```
```{r load-processed}
# Load processed data
df <- read_csv("data/mtcars_clean.csv")
df |> head()
```

Version Control – Save our Work (1/2)

You have:

  1. Forked the course repository from GitHub
  2. Cloned your fork to your local machine

Now, let’s set up version control in your project.

Option 1: Using RStudio’s Git Interface

  1. In RStudio, go to Tools → Version Control → Project Setup
  2. Select “Git” as the version control system
  3. Click “Create Repository”
  4. In the Git tab (usually in the top-right panel):
    • Click “Add” to stage all files (or select individual files)
    • Enter commit message: “Initial project setup”
    • Click “Commit”
    • Click “Push” to sync with your fork

Option 2: Using Terminal Commands

  1. In the terminal:

    git add .
    git commit -m "Initial project setup"
    git push

Version Control Workflow

Option A: Using RStudio’s Git Interface

  1. Stage changes: Click “Add” in the Git tab
  2. Commit: Enter message and click “Commit”
  3. Push: Click “Push” to sync with your fork

Option B: Using Terminal Commands

  1. Stage changes: git add .
  2. Commit: git commit -m "Description of changes"
  3. Push: git push

Choose whichever method you’re most comfortable with as both accomplish the same thing!

Summarize

  • We are now at a point where we have a clean, reproducible workflow
  • We’ve created a two-step workflow for data preparation and analysis
  • We’ve committed our code to GitHub
  • This workflow puts you ahead of the curve

Homework Assignment

  • Problem Set 1 is due tomorrow; see the assignment
  • Your job is to create a bar chart, scatter plot, and histogram of data from any data set (you can use the mtcars data set if you want)
  • You should comment on what the purpose of the plot is and what it communicates (a few sentences each is fine)
  • You should follow the workflow you learned today
  • Give the files sensible names, commit them to GitHub, and email me the link to your repository
  • Problems with GitHub? Email me your answers.

Visualization Workflow Preview

  • Let’s create a few very, very simple plots
  • We’re not going to get into the details of the code until tomorrow
  • For now, let’s just put a couple points on the scoreboard

Code Blocks

From here on out, it’s up to you to create the code blocks, such as below:

```{r}

# Code goes here

```

Histogram (Base R)

  • Histograms are easy to plot using base R or ggplot2
  • Base R:
hist(df$mpg)

Histogram (ggplot2)

  • In our course, we will focus on ggplot2
  • Tomorrow we will explicate the logic behind the gg or grammar of graphics
ggplot(df, aes(x = mpg)) +
  geom_histogram()

Histogram (ggplot2) notes

  • Note how the Base R and ggplot2 versions differ: what do you notice?
  • What might we do to improve the ggplot2 version?
  • A histogram is a workhourse “utility” plot. When is it worth “polishing” it?

Histogram polishing (but only a little bit)

ggplot(df, aes(x = mpg)) +
  geom_histogram(binwidth = 5) +
  theme_minimal() + 
  labs(
    title = "Distribution of MPG",
    x = "MPG",
    y = "Count"
  )

Bar Chart v1

# Count cars by make
bar_plot_v1 <- df |>
  ggplot(aes(x = make)) +
  geom_bar()

bar_plot_v1

What’s wrong with this?

Possible Bar Chart Improvements

  • coord_flip(): Flip the x and y axes
  • theme_minimal(): Use a minimal theme
  • labs(): Add labels to the plot
bar_plot_v2 <- df |>
  ggplot(aes(x = make)) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Number of Cars by Make",
    x = "Make",
    y = "Count"
  )

Bar Chart v2

bar_plot_v2

Possible Bar Chart v2 Improvements

  • Sort by count
bar_plot_v3 <- df |>
  count(make) |>
  mutate(make = fct_reorder(make, n)) |>
  ggplot(aes(x = make, y = n)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Number of Cars by Make",
    x = "Make",
    y = "Count"
  )

Bar Chart v3

bar_plot_v3

Bar Chart v3 Improvements

  • The title is not very informative
  • We don’t need Y axis labels
  • Remove the horizontal lines
  • We can annotate the plot with the number of cars and remove the vertical lines
bar_plot_v4 <- bar_plot_v3 +
  labs(
    title = "Distribution of Car Makes in the mtcars Dataset",
    x = NULL, # Note that because of coord_flip(), x is now the y axis
    y = "Number of Cars in Fleet"
  ) +
  theme(panel.grid.major.x = element_blank())

Bar Chart v4

bar_plot_v4

Bar Chart Thoughts

  • We see the workflow of building good visualizations with ggplot2
  • We also see that this figure is still not perfect. We could consider:
    • Making it colorful
    • Making the title bold
    • Adding a subtitle
    • Using text annotations

Scatterplot v1

scatter_plot <- df |>
  ggplot(aes(x = wt, y = mpg)) +
  geom_point()

scatter_plot

Scatterplot Improvements

  • We can add a trend line
  • We could soften up the gray background
  • We could add a title and labels

Scatterplot v2

scatter_plot_v2 <- scatter_plot +
  geom_smooth(method = "lm") + 
  theme_minimal() + 
  labs(
    title = "Relationship between Weight and MPG",
    x = "Weight (1000 lbs)",
    y = "MPG"
  )

Scatterplot v2

scatter_plot_v2

Scatterplot thoughts

  • This is looking better
  • In a real analysis, we may want to annotate certain points or facet by some other variable

Fancier Plotting

  • If you look at our code set up, we created an efficiency variable
  • What if we tried to compare efficiency across makes? How might we proceed?
efficiency_by_make <- df |>
  group_by(make) |>
  summarise(avg_efficiency = mean(efficiency)) |>
  mutate(make = fct_reorder(make, avg_efficiency)) |>
  ggplot(aes(x = make, y = avg_efficiency)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank()) +
  labs(
    title = "Average Fuel Efficiency by Make",
    x = NULL,
    y = "Average Efficiency (MPG/1000 lbs)"
  )

Efficiency By Make Plot

efficiency_by_make