2025-05-06
Imagine an experiment:
And you want to plot it as two lines using Excel
R gives you a professional looking figure
If this table is in dummy.csv
:
[bash]$ cat dummy.csv
sample_id,condition,time,concentration
S1,control,1,5.1
S2,control,2,4.8
S3,control,3,5.3
S4,control,4,5.0
S5,control,5,5.2
S6,treated,1,6.7
S7,treated,2,6.9
S8,treated,3,7.0
S9,treated,4,6.8
S10,treated,5,6.6
[bash]$
Here’s how we can read this table into an R object.
Let us look at the contents of this table
## sample_id condition time concentration
## 1 S1 control 1 5.1
## 2 S2 control 2 4.8
## 3 S3 control 3 5.3
## 4 S4 control 4 5.0
## 5 S5 control 5 5.2
## 6 S6 treated 1 6.7
## 7 S7 treated 2 6.9
## 8 S8 treated 3 7.0
## 9 S9 treated 4 6.8
## 10 S10 treated 5 6.6
We can also get useful information about it.
## [1] 10 4
## 'data.frame': 10 obs. of 4 variables:
## $ sample_id : chr "S1" "S2" "S3" "S4" ...
## $ condition : chr "control" "control" "control" "control" ...
## $ time : int 1 2 3 4 5 1 2 3 4 5
## $ concentration: num 5.1 4.8 5.3 5 5.2 6.7 6.9 7 6.8 6.6
We can also get useful information about it.
## sample_id condition time concentration
## Length:10 Length:10 Min. :1 Min. :4.800
## Class :character Class :character 1st Qu.:2 1st Qu.:5.125
## Mode :character Mode :character Median :3 Median :5.950
## Mean :3 Mean :5.940
## 3rd Qu.:4 3rd Qu.:6.775
## Max. :5 Max. :7.000
Or peek into parts of the table.
## sample_id condition time concentration
## 1 S1 control 1 5.1
## 2 S2 control 2 4.8
## 3 S3 control 3 5.3
## 4 S4 control 4 5.0
## 5 S5 control 5 5.2
## 6 S6 treated 1 6.7
## sample_id condition time concentration
## 5 S5 control 5 5.2
## 6 S6 treated 1 6.7
## 7 S7 treated 2 6.9
## 8 S8 treated 3 7.0
## 9 S9 treated 4 6.8
## 10 S10 treated 5 6.6
Or summarize values in columns.
##
## control treated
## 5 5
##
## 1 2 3 4 5
## 2 2 2 2 2
time
is x-axis
and concentration
is y-axisggplot(data = df) +
aes(x = time, y = concentration, group = condition, color = condition) +
geom_line()
ggplot(data = df) +
aes(x = time, y = concentration, group = condition, color = condition) +
geom_point()
ggplot(data = df) +
aes(x = time, y = concentration, group = condition, color = condition) +
geom_line() + geom_point()
ggplot(data = df) +
aes(x = time, y = concentration, group = condition, color = condition) +
geom_line() + geom_point() + ylim(c(0,10))
ggplot(data = df) +
aes(x = time, y = concentration, group = condition, color = condition) +
geom_line() + geom_point() + scale_y_continuous(breaks=seq(0,10,1), limits=c(0,10))
ggplot(data = df) +
aes(x = time, y = concentration, group = condition, color = condition) +
geom_line() + geom_point() + scale_y_continuous(breaks=seq(0,10,1), limits=c(0,10)) +
theme_classic()
ggplot
functionor
Then, you add a geometry layer:
geom_col()
, geom_point()
, etc.
geom_point()
: for pointsgeom_lines()
: for linesgeom_bar()
: for barplotgeom_boxplot()
: for boxplotggplot
functionggplot()
sets up the plot object and links data to
axes.aes()
maps variables to visual properties (x-axis,
y-axis, colors, etc.).geom_*()
plots data based on y values.ggplot
?Feature | Excel | ggplot2 |
---|---|---|
Data size | Small–Medium | Small to Very Large |
Reproducibility | Manual clicks | Fully scriptable and reproducible |
Styling | Limited templates |
Full control via theme() , colors, facets
|
Automation | Hard | Easy: loop or script across datasets |
Publication-ready | Time-consuming | High-quality out of the box |
Biologists often have different pieces of information in separate tables:
We want to combine them using a common column (like Sample ID).
Sample Metadata
sample_id | tissue | animal_id | date |
---|---|---|---|
S1 | blood | F01 | 2020-01-29 |
S2 | blood | F01 | 2020-02-04 |
S3 | blood | F03 | 2020-01-29 |
S4 | blood | F03 | 2020-02-04 |
Animal Metadata
animal_id | condition | dob | sex |
---|---|---|---|
F01 | control | 2019-10-31 | M |
F03 | treated | 2019-10-27 | F |
Protein Results
sample_id | protein_X | protein_Y | protein_Z |
---|---|---|---|
S1 | 5.2 | 12.2 | 9.5 |
S2 | 5.3 | 13.1 | 10.2 |
S3 | 6.8 | 18.9 | 15.8 |
S4 | 7.0 | 19.2 | 17.1 |
sample_id | tissue | animal_id | date | protein_X | protein_Y | protein_Z |
---|---|---|---|---|---|---|
S1 | blood | F01 | 2020-01-29 | 5.2 | 12.2 | 9.5 |
S2 | blood | F01 | 2020-02-04 | 5.3 | 13.1 | 10.2 |
S3 | blood | F03 | 2020-01-29 | 6.8 | 18.9 | 15.8 |
S4 | blood | F03 | 2020-02-04 | 7.0 | 19.2 | 17.1 |
sample_id
) to
match.left_join(left_join(sample_metadata, animal_metadata, by="animal_id"), protein_conc, by="sample_id")
sample_id | tissue | animal_id | date | condition | dob | sex | protein_X | protein_Y | protein_Z |
---|---|---|---|---|---|---|---|---|---|
S1 | blood | F01 | 2020-01-29 | control | 2019-10-31 | M | 5.2 | 12.2 | 9.5 |
S2 | blood | F01 | 2020-02-04 | control | 2019-10-31 | M | 5.3 | 13.1 | 10.2 |
S3 | blood | F03 | 2020-01-29 | treated | 2019-10-27 | F | 6.8 | 18.9 | 15.8 |
S4 | blood | F03 | 2020-02-04 | treated | 2019-10-27 | F | 7.0 | 19.2 | 17.1 |
%>%
'%>%'
E.g.
Think: “and then”.
left_join()
?Without the pipe
left_join(left_join(sample_metadata, animal_metadata, by="animal_id"), protein_conc, by="sample_id")
With the pipe
sample_metadata %>%
left_join(animal_metadata, by="animal_id") %>%
left_join(protein_conc, by="sample_id")
sample_id | tissue | animal_id | date | condition | dob | sex | protein_X | protein_Y | protein_Z |
---|---|---|---|---|---|---|---|---|---|
S1 | blood | F01 | 2020-01-29 | control | 2019-10-31 | M | 5.2 | 12.2 | 9.5 |
S2 | blood | F01 | 2020-02-04 | control | 2019-10-31 | M | 5.3 | 13.1 | 10.2 |
S3 | blood | F03 | 2020-01-29 | treated | 2019-10-27 | F | 6.8 | 18.9 | 15.8 |
S4 | blood | F03 | 2020-02-04 | treated | 2019-10-27 | F | 7.0 | 19.2 | 17.1 |
sample_id
column and sort by
concentrationWithout the pipe
## condition time concentration
## 1 control 2 4.8
## 2 control 4 5.0
## 3 control 1 5.1
## 4 control 5 5.2
## 5 control 3 5.3
## 6 treated 5 6.6
## 7 treated 1 6.7
## 8 treated 4 6.8
## 9 treated 2 6.9
## 10 treated 3 7.0
With the pipe
## condition time concentration
## 1 control 2 4.8
## 2 control 4 5.0
## 3 control 1 5.1
## 4 control 5 5.2
## 5 control 3 5.3
## 6 treated 5 6.6
## 7 treated 1 6.7
## 8 treated 4 6.8
## 9 treated 2 6.9
## 10 treated 3 7.0
Without the pipe
## # A tibble: 2 Ă— 2
## condition mean_conc
## <chr> <dbl>
## 1 control 5.08
## 2 treated 6.8
With the pipe
## # A tibble: 2 Ă— 2
## condition mean_conc
## <chr> <dbl>
## 1 control 5.08
## 2 treated 6.8