This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(readr)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.3.3
For the first part of this exercise, we are going to use metadata from a study conducted by Kashani et al. (2019) on leptin-deficient (ob/ob) mouse. They compared the change in body mass and glucose-response in ad-libitum fed ob/ob mice (adlib) compared to wild-type (con), and compared to ob/ob mice fed the same amount as the control mice (pair).
The data is in a csv and we use a function from the
readr
package to read it into a data frame. Data frames are
R data objects, where each column is one specific type (fx. character,
numeric, logic). The read_csv
function can choose the
correct type for each column based on the data it reads in, but it can
also be specified via the col_types
parameter.
ob_mice_df <- read_csv("Kashani2019.subset.csv")
## Rows: 323 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Sample_ID, Week, condition, team, fecal_date
## dbl (10): Sample_number, week_experiment, animal_id, body_fat, body_weight, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ob_mice_df
Hint: You can inspect the different objects in RStudio either by double-clicking on them in the Environment view, or by ctrl+click on the variable names in the code-chunk.
Visualisation of mean or median values, for example for replicates in
experimental groups, is often done with bar-plots. Bar-plots take
categorical data on the x-axis and numeric data on the y-axis. The
variance could be denoted by adding error-bars. Here, we shall look at
the change in body weight in each experimental group of mice.
The categories on the axises of plots are by default alphabetically
ordered. If we would like to change the order, we can encode the string
variable into factors, and change their underlying levels with the
forcats
package.
f <- as.factor(ob_mice_df$condition)
# re-order the values of f by listing them in the desired order
f <- fct_relevel(f, c("con", "pair", "adlib"))
# re-label the values
ob_mice_df$cond_f <- fct_recode(f, Control = "con", `Ad-lib` = "adlib", Pair = "pair")
For ggplot2
, the means and standard deviation (SD) needs
to be pre-computed for the groups. First, let’s filter the data frame to
only contain data points from the last week of the experiment.
ob_mice_last_week_df <- ob_mice_df %>%
filter(! is.na(body_weight)) %>% # select only the weeks with measurements
group_by(cond_f, team) %>% # group for combination of condition and team
filter(week_experiment == max(week_experiment)) %>% # take only the last measurements
ungroup()
After grouping the values by condition, we use summarise
function from dplyr to calculate values for the last grouping variable.
For example, if we were to group by condition and team here, the mean
and SD in the returned data frame would be for the mice in each team in
each condition.
mean_wg_end_df <- ob_mice_last_week_df %>%
group_by(cond_f) %>%
summarise(mean_wg_wk=mean(body_weight_change, na.rm = T), se_wg_wk=sd(body_weight_change, na.rm = T)) %>% ungroup()
mean_wg_end_df
Now we can draw the bar plot, using geom_bar
:
ggplot(mean_wg_end_df, aes(x = cond_f, y = mean_wg_wk, fill = cond_f)) +
geom_bar(position = position_dodge(), stat = "identity") + # not stacked
geom_errorbar(aes(ymin = mean_wg_wk-se_wg_wk, ymax = mean_wg_wk+se_wg_wk), # define min/max y-value
width = .2,
position = position_dodge(.9)) +
theme_classic() +
xlab("Condition") +
ylab("Mean absolute weight change") +
guides(fill = "none") # remove label for fill option as they are noted on y-axis
On the other hand, ggpubr
has built-in functions to
derive mean and SD from the data: add = "mean_sd"
in
ggbarplot
. It can also add the individual data points, as
required by some journals, to the plot. Try dropping the faceting
argument!
ggbarplot(ob_mice_last_week_df,
x = "cond_f", y = "body_weight_change",
fill = "cond_f", # use the formatted labels for the groups
add = c("mean_sd", "jitter"), # add error bar and individual points
palette = c("#c2a5cf", "#a6dba0", "#e08214"), # manually define color palette for fill
position = position_dodge(),
facet.by = "team", # separate the two teams of mice
xlab="Condition",
ylab="Mean weight change") +
guides(fill = "none")
Let’s look at how lean mass changes during the experiment.
ob_mice_lm_week_df <- ob_mice_df %>%
filter(! is.na(lean_body_mass)) %>% # select only the weeks with measurements
group_by(cond_f, team) %>% # group for combination of condition and team
filter(week_experiment == max(week_experiment)) %>% # take only the last measurements
ungroup()
ob_mice_lm_week_df
Task: Create a new code-chunk by pressing
ctrl+alt+i and write the code to plot the mean and sd of
lean_body_mass_change
for each cond_f
per
team
using ggpubr
. Color by condition, using
any palette from https://github.com/nanxstats/ggsci by just passing the
palette name via the palette = NULL
parameter.
ggbarplot(ob_mice_lm_week_df,
x = "cond_f", y = "lean_body_mass_change",
color = "cond_f",
add = "mean_sd", palette = "npg",
position = position_dodge(),
facet.by = "team",
xlab="Condition",
ylab="Lean mass change") +
labs(color = "Condition")
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_summary()`).
The distribution of certain attributes is usually more complex, than
it could be conveyed using a simple bar-plot. Histograms are for
visualising counts of continuous values grouped into bins. Let’s use
geom_histogram
from ggplot2 to plot the distribution of
weight growth pace during the first year in our cohorts.
ggplot(ob_mice_df, aes(auc_glucose)) +
geom_histogram(aes(fill = cond_f), bins = 50) + # divide data range into 50 bins
ylab("Count") +
xlab("Glucose level") +
theme_bw() +
theme(legend.position="bottom") +
labs(fill = NULL) +
ggtitle("Glucose response")
## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_bin()`).
Is it possible to tell from a glance at the above plot, if there is difference in glucose tolerance between the mice?
Task: facet_grid(rows=NULL, cols=NULL)
is for faceting the plot onto a grid, where rows and columns could be
specified using the vars(column-name)
construct. Add it to
the plot above, to visualise the groups (and teams) separately. To allow
different ranges on the y-axis in the grids, facet_grid
also takes the scales = "free_y"
argument.
ggplot(ob_mice_df, aes(auc_glucose)) +
geom_histogram(aes(fill = cond_f), bins = 20) +
facet_grid(vars(cond_f), vars(team), scales = "free_y") +
ylab("Count") +
xlab("Glucose response AUC") +
theme_bw() +
theme(legend.position="bottom") +
labs(fill = NULL)
## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_bin()`).
Instead of stacking the bars of the histogram on top of each other,
ggpubr
shows them overlapping with partial transparency. In
addition, markers for mean or median values could be added.
gghistogram(ob_mice_df,
x = "auc_glucose",
bins = 30,
add = "median", # add vertical line to median value
rug = FALSE, # switch this to TRUE to see what it does
fill = "cond_f", # groups for color
palette = c("#a6dba0", "#c2a5cf", "#e08214"),
facet.by = "team",
ylab="Count",
xlab="Glucose response") +
labs(fill = "Condition") +
theme(legend.position="bottom") +
guides(colour = "none") # remove legend for median line
## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_bin()`).
While now we can see, that there is a difference in the median values for the glucose response between mice, we don’t know, if it is a significant difference. The graph is also quite busy, with multiple overlaps.
To offer a solution for both of these issues, we can turn to box and violin plots, that display categorical values on the x-axis and numeric values on the y-axis. A boxplot displays the quartiles of values for each category: the mid-line is drawn at 50% probability, and the sides of the box represent 25% and 75%. Outlier values are shown as individual dots.
Meanwhile a violin plot draws the kernel density of the values. The
two can be combined into one plot with ggplot, by providing the shared
aesthetics values in the main function. stat_compare_means
function from ggpubr
can add a p-value from selected
statistical test, such as Kruskal-Wallis (default), t-test,
Wilcoxon-test, ANOVA, etc.
ggplot(ob_mice_df, aes(cond_f, auc_glucose)) +
geom_violin() + # draw a violin plot
geom_boxplot(aes(color= cond_f), width = 0.1) + # add a thinner boxplot
xlab("Condition") +
ylab("Glucose response") +
theme_bw() +
theme(legend.position="bottom") +
guides(color = "none") +
stat_compare_means(method = "anova")
## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_compare_means()`).
Task: Use ggplot2 to visualise the distribution of
absolute weight change for each condition (factor!) in
ob_mice_last_week_df
.
ggplot(ob_mice_last_week_df, aes(cond_f, body_weight_change)) +
geom_violin() +
geom_boxplot(aes(color= cond_f), width = 0.1) +
xlab("Condition") +
ylab("Weight change") +
theme_bw() +
theme(legend.position="bottom") +
guides(color = "none")
ggpubr
can also combine box and violin plots, the
ggviolin
is the base-plot in this case, and the boxes can
be added on top either with add = NULL
or as additional
ggplot2
component.
ggviolin(ob_mice_df, x = "cond_f", y = "auc_glucose",
color = "cond_f",
add = "boxplot",
palette = c("#a6dba0","#c2a5cf", "#e08214"),
ylab="Weight change",
xlab="Condition")
## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Next, we define the base-plot with ggpubr
that we are
going to use in the next few steps:
boxplot_wg_p <- ggboxplot(ob_mice_last_week_df, x = "cond_f", y = "body_weight_change",
fill = "cond_f",
palette = c("#a6dba0","#c2a5cf", "#e08214"),
ylab="Weight change",
xlab="Condition") +
labs(fill = NULL) +
theme(legend.position="bottom")
show(boxplot_wg_p)
The stat_compare_means
function adds p-values from
different statistical tests. One can replace the numeric p-values with
symbols to denote significance by passing
label = after_stat(p.signif)
in the aesthetic argument. You
can remove this, to see how it looks without! Furthermore, we can
specify which group should be used as the reference for each comparison:
ref.group = "Control"
.
boxplot_wg_p + stat_compare_means(aes(label = after_stat(p.signif)),
ref.group = "Control",
method = "t.test")
Individual pairs of categories could also be specified, and supplied
to the comparisons
argument in the function.
wg_comparisons <- list( c("Control", "Pair"), c("Control", "Ad-lib"))
boxplot_wg_p + stat_compare_means(comparisons = wg_comparisons, method = "t.test")
Let’s make another plot, comparing the lean mass change for each
condition, in the two teams of animals. We are adding pairwise
comparisons to the plot using t-tests with geom_pwc
.
# create base plot object
bxplot_multi_p <- ggboxplot(ob_mice_lm_week_df, x = "cond_f", y = "lean_body_mass_change",
fill = "cond_f",
palette = c("#a6dba0", "#c2a5cf", "#e08214"),
ylab="Lean mass change",
xlab="Teams",
facet.by="team") +
geom_pwc(method = "t.test")
bxplot_multi_p
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_pwc()`).
And if you go “Wait, there are too many comparisons, shouldn’t we
account for multiple testing?”. You are completely right, and there is a
function in ggpubr
for that too:
ggadjust_pvalue
. Correction methods available are “holm”,
“hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”. Note: this
function is not applied to a plot with +
. Instead, the plot
object needs to be passed to it.
# pass plot to function
ggadjust_pvalue(
bxplot_multi_p,
p.adjust.method = "fdr",
label = "{p.adj.format}{p.adj.signif}", # formatting of the text label
hide.ns = T #hide non-significant values
)
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_pwc()`).
Task: Plot and compare the glucose response (AUC) in each condition, using both box and violin plots. Use symbols to show significance.
ggplot(ob_mice_df %>% filter(! is.na(auc_glucose)), aes(cond_f, auc_glucose)) +
geom_violin() +
geom_boxplot(aes(color= cond_f), width = 0.1) +
xlab("Condition") +
ylab("Glucose response") +
theme_bw() +
theme(legend.position="bottom") +
guides(color = "none") +
stat_compare_means(aes(label = after_stat(p.signif)), method = "t.test", comparisons = list( c("Control", "Pair"), c("Control", "Ad-lib"), c("Pair", "Ad-lib")))
For the next set of plots, we are going to use metadata from the DIABIMMUNE cohorts, stored in an Excel spreadsheet. The project aimed to explore the “hygiene hypothesis” and built cohorts of newborns from Estonia, Finland and Russia.
The data is organised over multiple sheets, which need to be read
into separate data frames. Like the parsers used previously, the
read_xlxs
function can choose the correct type for each
column, but it can also be specified via the col_types
parameter.
dia_metadata_samples_df <- read_xlsx("diabimmune_combined_metadata.xlsx", sheet="Samples")
dia_metadata_birth_df <- read_xlsx("diabimmune_combined_metadata.xlsx", sheet="Pregnancy, birth")
Task: Parse the “Growth” sheet from the same
spreadsheet into a variable called
dia_metadata_growth_df
!
dia_metadata_growth_df <- read_xlsx("diabimmune_combined_metadata.xlsx", sheet="Growth")
The metadata contains subjects from three different cohorts. The
assignments are in the “Samples” sheet, which we read into R already. As
there are multiple samples per subject, we need to extract the distinct
values for “subjectID” to get an overview. Despite them being unique, we
should in this case also include the variables “country” and “cohort” in
the function, in order to preserve these columns in the returned data
frame. Note: We can also specify .keep_all = TRUE
to keep
all columns.
dia_subjects_df <- dia_metadata_samples_df %>% distinct(subjectID, country, cohort)
The simplest way to piece together data frames that share columns
(typically containing some kind of unique identifier), is to use
mutating joins (https://dplyr.tidyverse.org/reference/mutate-joins.html).
Here, we are going to use full_join
, which keeps all
records from both data frame, and left_join
, which keeps
only the records from the data frame on the right that have matching
counterparts in the left. The shared column can be specified either as
by = "subjectID"
if the column name is also identical in
the two data frames, or as by = c("subjectID" = "ID")
if
different.
dia_metadata_combi_df <- full_join(dia_metadata_birth_df, dia_subjects_df, by = "subjectID")
Task: Add code to also left-join
dia_metadata_growth_df
to
dia_metadata_combi_df
, and inspect the resulting data frame
in the viewer.
dia_metadata_combi_df <- dia_metadata_combi_df %>% left_join(dia_metadata_growth_df, by = "subjectID")
Two continuous values are typically plotted against each other using a scatter plot. In addition, marker colors and shapes could be used to display categorical values. There are a limited number of colors, but even more limited number of shapes to use in R.
Task: Switch the attributes for shape and color! Does the resulting plot confer a different message, than before?
ggplot(dia_metadata_combi_df, aes(x = length_growth_pace_during_year_one, y = weight_growth_pace_during_year_one)) +
geom_point(aes(shape = csection, color = gender)) +
scale_shape_manual(values = c(1,3)) +
xlab("Length growth") +
ylab("Weight growth")
## Warning: Removed 23 rows containing missing values or values outside the scale range
## (`geom_point()`).
Sometimes, we are interested to see if two attributes correlate, or not. Calculating the Pearson or Spearman correlation between values gives us numerical answers to this question, but visualising it with a fitted line helps in interpreting the association.
cor.test(dia_metadata_combi_df$length_growth_pace_during_year_one, dia_metadata_combi_df$weight_growth_pace_during_year_one, method = "spearman")
## Warning in
## cor.test.default(dia_metadata_combi_df$length_growth_pace_during_year_one, :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: dia_metadata_combi_df$length_growth_pace_during_year_one and dia_metadata_combi_df$weight_growth_pace_during_year_one
## S = 1653045, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4960927
Basic R functions lm
, for fitting a linear model, and
plot
to plot it, give lot of control over the parameters of
the fitted model and confidence interval, but a very basic plot.
# fit a linear model with lm(y ~ x, data_frame)
wl_lm <- lm(weight_growth_pace_during_year_one ~ length_growth_pace_during_year_one, dia_metadata_combi_df)
# initialize new x points for calculating y' and confidence interval on
conf_x = seq(min(dia_metadata_combi_df$length_growth_pace_during_year_one, na.rm = T),max(dia_metadata_combi_df$length_growth_pace_during_year_one, na.rm = T),by = 0.05)
conf_x_df = data.frame(length_growth_pace_during_year_one=conf_x)
# predict fit and 95% conf. interval
fit_conf_interval <- predict(wl_lm, newdata=conf_x_df, interval="confidence",
level = 0.95)
# plot data
plot(dia_metadata_combi_df$length_growth_pace_during_year_one, dia_metadata_combi_df$weight_growth_pace_during_year_one,
xlab="Length growth", ylab="Weight growth")
# add fitted regression to plot
abline(wl_lm, col="blue")
# add lower and upper confidence interval values from fit_conf_interval
lines(conf_x, fit_conf_interval[,2], col="darkgray", lty=2)
lines(conf_x, fit_conf_interval[,3], col="darkgray", lty=2)
ggpubr
function for scatter plot has an add-on argument
for adding a regression line, but not a confidence interval. On the
other hand, the correlation coefficient (stat_cor
) and the
equation for the regression line (stat_regline_equation
)
can be shown on the diagram at specific coordinates with
label.x
and label.y
.
ggscatter(dia_metadata_combi_df,
x = "length_growth_pace_during_year_one", y = "weight_growth_pace_during_year_one",
add = "reg.line",
xlab="Length growth", ylab="Weight growth"
) +
stat_cor(label.x = c(12), label.y = c(4)) +
stat_regline_equation(label.x = c(12), label.y = c(4.3))
## Warning: Removed 23 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 23 rows containing non-finite outside the scale range
## (`stat_cor()`).
## Warning: Removed 23 rows containing non-finite outside the scale range
## (`stat_regline_equation()`).
## Warning: Removed 23 rows containing missing values or values outside the scale range
## (`geom_point()`).
The ggplot2 package has the geom_smooth
function to fit
curves to the data. Here, we can specify, that we want a linear model
and a confidence interval of 95%. It can be combined with the
ggpubr
functions to show the correlation coefficient,
etc.
ggplot(dia_metadata_combi_df, aes(length_growth_pace_during_year_one, weight_growth_pace_during_year_one)) +
geom_point(aes(shape = csection, color = gender)) +
geom_smooth(method = "lm", formula = 'y ~ x', level = 0.95) +
xlab("Length growth (cm/year)") +
ylab("Weight growth (kg/year)") +
labs(color = "Gender", shape = "C-section")
## Warning: Removed 23 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 23 rows containing missing values or values outside the scale range
## (`geom_point()`).
stat_cor(label.x = c(32), label.y = c(7))
## geom_text: parse = TRUE, na.rm = FALSE
## stat_cor: label.x.npc = left, label.y.npc = top, label.x = 32, label.y = 7, label.sep = , , method = pearson, alternative = two.sided, output.type = expression, digits = 2, r.digits = 2, p.digits = 2, r.accuracy = NULL, p.accuracy = NULL, cor.coef.name = R, na.rm = FALSE
## position_identity
Task: One can fit separate regression lines to
different groups in the data by passing the grouping variables, for
example as color
, in the aesthetic argument for
geom_smooth
. Plot a diagram that shows how the 3 year
weight ~ 3 year length growth pace correlates for the different genders
based on data in dia_metadata_combi_df
. Display the
correlation coefficients and line equations.
ggplot(dia_metadata_combi_df, aes(length_growth_pace_during_three_years, weight_growth_pace_during_three_years, color = gender)) +
geom_point(aes(shape = csection)) +
geom_smooth(aes(color = gender), method = "lm", formula = 'y ~ x', level = 0.95) +
stat_cor(label.x = c(16.5,12.1), label.y = c(2.9,4.1)) +
stat_regline_equation(label.x = c(16.5,12.1), label.y = c(3.1,4.4)) +
xlab("Length growth (cm/year)") +
ylab("Weight growth (kg/year)")
## Warning: Removed 76 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 76 rows containing non-finite outside the scale range
## (`stat_cor()`).
## Warning: Removed 76 rows containing non-finite outside the scale range
## (`stat_regline_equation()`).
## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_point()`).
Task: Using the ob/ob mice data, plot the body fat
and lean mass change from ob_mice_lm_week_df
against each
other, faceting on experimental groups. Fit regression lines, display
the correlation coefficients and line equations.
ggplot(ob_mice_lm_week_df, aes(body_fat_change, lean_body_mass_change, color = cond_f)) +
geom_point() +
geom_smooth(aes(color = cond_f), method = "lm", formula = 'y ~ x', level = 0.95) +
facet_wrap(vars(cond_f), scales = "free") +
stat_cor(label.y = c(17,-0.25,5.5)) +
stat_regline_equation() +
xlab("Body fat change") +
ylab("Lean mass change")
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_cor()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_regline_equation()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).