Introduction to ggplot and data tables
A PIG-PARADIGM workshop

Mani Arumugam

University of Copenhagen

2025-05-06

Visualizing data from tables

Graphs from tables

Imagine an experiment:

treatment and control groups
protein concentration as time-series

Graphs from tables

And you want to plot it as two lines using Excel

Graphs from tables

R gives you a professional looking figure

How do we get there?

Tables in R

Read a table in R

If this table is in dummy.csv:

[bash]$ cat dummy.csv
sample_id,condition,time,concentration
S1,control,1,5.1
S2,control,2,4.8
S3,control,3,5.3
S4,control,4,5.0
S5,control,5,5.2
S6,treated,1,6.7
S7,treated,2,6.9
S8,treated,3,7.0
S9,treated,4,6.8
S10,treated,5,6.6
[bash]$

Here’s how we can read this table into an R object.

df = read.csv("dummy.csv")

Peek into tables

Let us look at the contents of this table

df

##    sample_id condition time concentration
## 1         S1   control    1           5.1
## 2         S2   control    2           4.8
## 3         S3   control    3           5.3
## 4         S4   control    4           5.0
## 5         S5   control    5           5.2
## 6         S6   treated    1           6.7
## 7         S7   treated    2           6.9
## 8         S8   treated    3           7.0
## 9         S9   treated    4           6.8
## 10       S10   treated    5           6.6

Peek into tables

We can also get useful information about it.

# Dimensions: rows and columns
dim(df)

## [1] 10  4

# Structure: variable names and types
str(df)

## 'data.frame':    10 obs. of  4 variables:
##  $ sample_id    : chr  "S1" "S2" "S3" "S4" ...
##  $ condition    : chr  "control" "control" "control" "control" ...
##  $ time         : int  1 2 3 4 5 1 2 3 4 5
##  $ concentration: num  5.1 4.8 5.3 5 5.2 6.7 6.9 7 6.8 6.6

Peek into tables

We can also get useful information about it.

# Summary stats
summary(df)

##   sample_id          condition              time   concentration  
##  Length:10          Length:10          Min.   :1   Min.   :4.800  
##  Class :character   Class :character   1st Qu.:2   1st Qu.:5.125  
##  Mode  :character   Mode  :character   Median :3   Median :5.950  
##                                        Mean   :3   Mean   :5.940  
##                                        3rd Qu.:4   3rd Qu.:6.775  
##                                        Max.   :5   Max.   :7.000

Peek into tables

Or peek into parts of the table.

# First few rows
head(df)

##   sample_id condition time concentration
## 1        S1   control    1           5.1
## 2        S2   control    2           4.8
## 3        S3   control    3           5.3
## 4        S4   control    4           5.0
## 5        S5   control    5           5.2
## 6        S6   treated    1           6.7

# Last few rows
tail(df)

##    sample_id condition time concentration
## 5         S5   control    5           5.2
## 6         S6   treated    1           6.7
## 7         S7   treated    2           6.9
## 8         S8   treated    3           7.0
## 9         S9   treated    4           6.8
## 10       S10   treated    5           6.6

Peek into tables

Or summarize values in columns.

# Count values per group
table(df$condition)

## 
## control treated 
##       5       5

# Count values per group
table(df$time)

## 
## 1 2 3 4 5 
## 2 2 2 2 2

From tables to plots in R

Plotting in R is not so different from Excel

Excel and ggplot2 speak the same language
Map columns to axis: time is x-axis and concentration is y-axis

Line graph

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_line()

How about just points?

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_point()

How about lines AND points?

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_line() + geom_point()

Can we start x-axis from 0?

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_line() + geom_point() + ylim(c(0,10))

Can we fix the weird x-axis tics?

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_line() + geom_point() + scale_y_continuous(breaks=seq(0,10,1), limits=c(0,10))

What about that professional look?

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_line() + geom_point() + scale_y_continuous(breaks=seq(0,10,1), limits=c(0,10)) +
  theme_classic()

🎨 The basics of ggplot2

Anatomy of the `ggplot` function

ggplot2 builds plots layer by layer.
You always start with:

ggplot(data = <data>, aes(x = <x-column>, y = <y-column>))

or

ggplot(data = <data>) + aes(x = <x-column>, y = <y-column>)

Then, you add a geometry layer: geom_col(), geom_point(), etc.

geom_point(): for points
geom_lines(): for lines
geom_bar(): for barplot
geom_boxplot(): for boxplot

Recap: Key components of the `ggplot` function

ggplot() sets up the plot object and links data to axes.
aes() maps variables to visual properties (x-axis, y-axis, colors, etc.).
geom_*() plots data based on y values.
You can customize it with themes, labels, and colors.
More in the hands-on sessions.

Why use `ggplot`?

Feature	Excel	ggplot2
Data size	Small–Medium	Small to Very Large
Reproducibility	Manual clicks	Fully scriptable and reproducible
Styling	Limited templates	Full control via `theme()`, colors, facets
Automation	Hard	Easy: loop or script across datasets
Publication-ready	Time-consuming	High-quality out of the box

Joining data tables

Why Join Tables?

Biologists often have different pieces of information in separate tables:

One table has sample metadata (e.g., treatment).
Another has experimental results (e.g., protein levels).

We want to combine them using a common column (like Sample ID).

Example: Data spread in three tables

Sample Metadata

sample_id	tissue	animal_id	date
S1	blood	F01	2020-01-29
S2	blood	F01	2020-02-04
S3	blood	F03	2020-01-29
S4	blood	F03	2020-02-04

Animal Metadata

animal_id	condition	dob	sex
F01	control	2019-10-31	M
F03	treated	2019-10-27	F

Protein Results

sample_id	protein_X	protein_Y	protein_Z
S1	5.2	12.2	9.5
S2	5.3	13.1	10.2
S3	6.8	18.9	15.8
S4	7.0	19.2	17.1

Connect protein concentrations to samples

left_join(sample_metadata, protein_conc, by="sample_id")

sample_id	tissue	animal_id	date	protein_X	protein_Y	protein_Z
S1	blood	F01	2020-01-29	5.2	12.2	9.5
S2	blood	F01	2020-02-04	5.3	13.1	10.2
S3	blood	F03	2020-01-29	6.8	18.9	15.8
S4	blood	F03	2020-02-04	7.0	19.2	17.1

Each row now has both sample metadata and result.
We used a common key (sample_id) to match.

Connect protein concentrations to samples to animals

left_join(left_join(sample_metadata, animal_metadata, by="animal_id"), protein_conc, by="sample_id")

sample_id	tissue	animal_id	date	condition	dob	sex	protein_X	protein_Y	protein_Z
S1	blood	F01	2020-01-29	control	2019-10-31	M	5.2	12.2	9.5
S2	blood	F01	2020-02-04	control	2019-10-31	M	5.3	13.1	10.2
S3	blood	F03	2020-01-29	treated	2019-10-27	F	6.8	18.9	15.8
S4	blood	F03	2020-02-04	treated	2019-10-27	F	7.0	19.2	17.1

🔗 The pipe operator `%>%`

What is the pipe operator?

denoted by '%>%'
comes from magrittr package.
chains operations in a readable, step-by-step way.

E.g.

df %>% something() %>% somethingelse()

Think: “and then”.

Remember our `left_join()`?

Without the pipe

left_join(left_join(sample_metadata, animal_metadata, by="animal_id"), protein_conc, by="sample_id")

With the pipe

sample_metadata %>%
  left_join(animal_metadata, by="animal_id") %>%
  left_join(protein_conc, by="sample_id")

sample_id	tissue	animal_id	date	condition	dob	sex	protein_X	protein_Y	protein_Z
S1	blood	F01	2020-01-29	control	2019-10-31	M	5.2	12.2	9.5
S2	blood	F01	2020-02-04	control	2019-10-31	M	5.3	13.1	10.2
S3	blood	F03	2020-01-29	treated	2019-10-27	F	6.8	18.9	15.8
S4	blood	F03	2020-02-04	treated	2019-10-27	F	7.0	19.2	17.1

Easier to read: do this, then this, then this.
Reduces nested parentheses and clutter.

Drop the `sample_id` column and sort by concentration

Without the pipe

arrange(select(df, -sample_id), concentration)

##    condition time concentration
## 1    control    2           4.8
## 2    control    4           5.0
## 3    control    1           5.1
## 4    control    5           5.2
## 5    control    3           5.3
## 6    treated    5           6.6
## 7    treated    1           6.7
## 8    treated    4           6.8
## 9    treated    2           6.9
## 10   treated    3           7.0

With the pipe

df %>%
  select(-sample_id) %>%
  arrange(concentration)

##    condition time concentration
## 1    control    2           4.8
## 2    control    4           5.0
## 3    control    1           5.1
## 4    control    5           5.2
## 5    control    3           5.3
## 6    treated    5           6.6
## 7    treated    1           6.7
## 8    treated    4           6.8
## 9    treated    2           6.9
## 10   treated    3           7.0

Estimate mean concentration per condition

Without the pipe

summarise(
  group_by(df, condition),
  mean_conc = mean(concentration)
)

## # A tibble: 2 × 2
##   condition mean_conc
##   <chr>         <dbl>
## 1 control        5.08
## 2 treated        6.8

With the pipe

df %>%
  group_by(condition) %>%
  summarise(mean_conc = mean(concentration))

## # A tibble: 2 × 2
##   condition mean_conc
##   <chr>         <dbl>
## 1 control        5.08
## 2 treated        6.8

Introduction to ggplot and data tables A PIG-PARADIGM workshop

Visualizing data from tables

Graphs from tables

Graphs from tables

Graphs from tables

How do we get there?

Tables in R

Read a table in R

Peek into tables

Peek into tables

Peek into tables

Peek into tables

Peek into tables

From tables to plots in R

Plotting in R is not so different from Excel

Line graph

How about just points?

How about lines AND points?

Can we start x-axis from 0?

Can we fix the weird x-axis tics?

What about that professional look?

🎨 The basics of ggplot2

Anatomy of the ggplot function

Recap: Key components of the ggplot function

Why use ggplot?

Joining data tables

Why Join Tables?

Example: Data spread in three tables

Connect protein concentrations to samples

Connect protein concentrations to samples to animals

🔗 The pipe operator %>%

What is the pipe operator?

Remember our left_join()?

Drop the sample_id column and sort by concentration

Estimate mean concentration per condition

Thanks! Any Questions?

Introduction to ggplot and data tables
A PIG-PARADIGM workshop

Anatomy of the `ggplot` function

Recap: Key components of the `ggplot` function

Why use `ggplot`?

🔗 The pipe operator `%>%`

Remember our `left_join()`?

Drop the `sample_id` column and sort by concentration