Introduction to ggplot and data tables
A PIG-PARADIGM workshop

Mani Arumugam

University of Copenhagen

2025-05-06

Visualizing data from tables

Graphs from tables

Imagine an experiment:

Graphs from tables

And you want to plot it as two lines using Excel

Graphs from tables

R gives you a professional looking figure

How do we get there?

Tables in R

Read a table in R

If this table is in dummy.csv:

[bash]$ cat dummy.csv
sample_id,condition,time,concentration
S1,control,1,5.1
S2,control,2,4.8
S3,control,3,5.3
S4,control,4,5.0
S5,control,5,5.2
S6,treated,1,6.7
S7,treated,2,6.9
S8,treated,3,7.0
S9,treated,4,6.8
S10,treated,5,6.6
[bash]$

Here’s how we can read this table into an R object.

df = read.csv("dummy.csv")

Peek into tables

Let us look at the contents of this table

df
##    sample_id condition time concentration
## 1         S1   control    1           5.1
## 2         S2   control    2           4.8
## 3         S3   control    3           5.3
## 4         S4   control    4           5.0
## 5         S5   control    5           5.2
## 6         S6   treated    1           6.7
## 7         S7   treated    2           6.9
## 8         S8   treated    3           7.0
## 9         S9   treated    4           6.8
## 10       S10   treated    5           6.6

Peek into tables

We can also get useful information about it.

# Dimensions: rows and columns
dim(df)
## [1] 10  4
# Structure: variable names and types
str(df)
## 'data.frame':    10 obs. of  4 variables:
##  $ sample_id    : chr  "S1" "S2" "S3" "S4" ...
##  $ condition    : chr  "control" "control" "control" "control" ...
##  $ time         : int  1 2 3 4 5 1 2 3 4 5
##  $ concentration: num  5.1 4.8 5.3 5 5.2 6.7 6.9 7 6.8 6.6

Peek into tables

We can also get useful information about it.

# Summary stats
summary(df)
##   sample_id          condition              time   concentration  
##  Length:10          Length:10          Min.   :1   Min.   :4.800  
##  Class :character   Class :character   1st Qu.:2   1st Qu.:5.125  
##  Mode  :character   Mode  :character   Median :3   Median :5.950  
##                                        Mean   :3   Mean   :5.940  
##                                        3rd Qu.:4   3rd Qu.:6.775  
##                                        Max.   :5   Max.   :7.000

Peek into tables

Or peek into parts of the table.

# First few rows
head(df)
##   sample_id condition time concentration
## 1        S1   control    1           5.1
## 2        S2   control    2           4.8
## 3        S3   control    3           5.3
## 4        S4   control    4           5.0
## 5        S5   control    5           5.2
## 6        S6   treated    1           6.7
# Last few rows
tail(df)
##    sample_id condition time concentration
## 5         S5   control    5           5.2
## 6         S6   treated    1           6.7
## 7         S7   treated    2           6.9
## 8         S8   treated    3           7.0
## 9         S9   treated    4           6.8
## 10       S10   treated    5           6.6

Peek into tables

Or summarize values in columns.

# Count values per group
table(df$condition)
## 
## control treated 
##       5       5
# Count values per group
table(df$time)
## 
## 1 2 3 4 5 
## 2 2 2 2 2

From tables to plots in R

Plotting in R is not so different from Excel

Line graph

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_line()

How about just points?

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_point()

How about lines AND points?

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_line() + geom_point()

Can we start x-axis from 0?

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_line() + geom_point() + ylim(c(0,10))

Can we fix the weird x-axis tics?

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_line() + geom_point() + scale_y_continuous(breaks=seq(0,10,1), limits=c(0,10))

What about that professional look?

ggplot(data = df) +
  aes(x = time, y = concentration, group = condition, color = condition) +
  geom_line() + geom_point() + scale_y_continuous(breaks=seq(0,10,1), limits=c(0,10)) +
  theme_classic()

🎨 The basics of ggplot2

Anatomy of the ggplot function

ggplot(data = <data>, aes(x = <x-column>, y = <y-column>))

or

ggplot(data = <data>) + aes(x = <x-column>, y = <y-column>)

Then, you add a geometry layer: geom_col(), geom_point(), etc.

Recap: Key components of the ggplot function

Why use ggplot?

Feature Excel ggplot2
Data size Small–Medium Small to Very Large
Reproducibility Manual clicks Fully scriptable and reproducible
Styling Limited templates Full control via theme(), colors, facets
Automation Hard Easy: loop or script across datasets
Publication-ready Time-consuming High-quality out of the box

Joining data tables

Why Join Tables?

Biologists often have different pieces of information in separate tables:

We want to combine them using a common column (like Sample ID).

Example: Data spread in three tables

Sample Metadata

sample_id tissue animal_id date
S1 blood F01 2020-01-29
S2 blood F01 2020-02-04
S3 blood F03 2020-01-29
S4 blood F03 2020-02-04

Animal Metadata

animal_id condition dob sex
F01 control 2019-10-31 M
F03 treated 2019-10-27 F

Protein Results

sample_id protein_X protein_Y protein_Z
S1 5.2 12.2 9.5
S2 5.3 13.1 10.2
S3 6.8 18.9 15.8
S4 7.0 19.2 17.1

Connect protein concentrations to samples

left_join(sample_metadata, protein_conc, by="sample_id")
sample_id tissue animal_id date protein_X protein_Y protein_Z
S1 blood F01 2020-01-29 5.2 12.2 9.5
S2 blood F01 2020-02-04 5.3 13.1 10.2
S3 blood F03 2020-01-29 6.8 18.9 15.8
S4 blood F03 2020-02-04 7.0 19.2 17.1

Connect protein concentrations to samples to animals

left_join(left_join(sample_metadata, animal_metadata, by="animal_id"), protein_conc, by="sample_id")
sample_id tissue animal_id date condition dob sex protein_X protein_Y protein_Z
S1 blood F01 2020-01-29 control 2019-10-31 M 5.2 12.2 9.5
S2 blood F01 2020-02-04 control 2019-10-31 M 5.3 13.1 10.2
S3 blood F03 2020-01-29 treated 2019-10-27 F 6.8 18.9 15.8
S4 blood F03 2020-02-04 treated 2019-10-27 F 7.0 19.2 17.1

đź”— The pipe operator %>%

What is the pipe operator?

E.g.

df %>% something() %>% somethingelse()

Think: “and then”.

Remember our left_join()?

Without the pipe

left_join(left_join(sample_metadata, animal_metadata, by="animal_id"), protein_conc, by="sample_id")

With the pipe

sample_metadata %>%
  left_join(animal_metadata, by="animal_id") %>%
  left_join(protein_conc, by="sample_id")
sample_id tissue animal_id date condition dob sex protein_X protein_Y protein_Z
S1 blood F01 2020-01-29 control 2019-10-31 M 5.2 12.2 9.5
S2 blood F01 2020-02-04 control 2019-10-31 M 5.3 13.1 10.2
S3 blood F03 2020-01-29 treated 2019-10-27 F 6.8 18.9 15.8
S4 blood F03 2020-02-04 treated 2019-10-27 F 7.0 19.2 17.1

Drop the sample_id column and sort by concentration

Without the pipe

arrange(select(df, -sample_id), concentration)
##    condition time concentration
## 1    control    2           4.8
## 2    control    4           5.0
## 3    control    1           5.1
## 4    control    5           5.2
## 5    control    3           5.3
## 6    treated    5           6.6
## 7    treated    1           6.7
## 8    treated    4           6.8
## 9    treated    2           6.9
## 10   treated    3           7.0

With the pipe

df %>%
  select(-sample_id) %>%
  arrange(concentration)
##    condition time concentration
## 1    control    2           4.8
## 2    control    4           5.0
## 3    control    1           5.1
## 4    control    5           5.2
## 5    control    3           5.3
## 6    treated    5           6.6
## 7    treated    1           6.7
## 8    treated    4           6.8
## 9    treated    2           6.9
## 10   treated    3           7.0

Estimate mean concentration per condition

Without the pipe

summarise(
  group_by(df, condition),
  mean_conc = mean(concentration)
)
## # A tibble: 2 Ă— 2
##   condition mean_conc
##   <chr>         <dbl>
## 1 control        5.08
## 2 treated        6.8

With the pipe

df %>%
  group_by(condition) %>%
  summarise(mean_conc = mean(concentration))
## # A tibble: 2 Ă— 2
##   condition mean_conc
##   <chr>         <dbl>
## 1 control        5.08
## 2 treated        6.8

Thanks! Any Questions?