Tidyverse, ggplot2 and ggpubr

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(readr)
library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.3.3

Parsing metadata from comma separated file

For the first part of this exercise, we are going to use metadata from a study conducted by Kashani et al. (2019) on leptin-deficient (ob/ob) mouse. They compared the change in body mass and glucose-response in ad-libitum fed ob/ob mice (adlib) compared to wild-type (con), and compared to ob/ob mice fed the same amount as the control mice (pair).

The data is in a csv and we use a function from the readr package to read it into a data frame. Data frames are R data objects, where each column is one specific type (fx. character, numeric, logic). The read_csv function can choose the correct type for each column based on the data it reads in, but it can also be specified via the col_types parameter.

ob_mice_df <- read_csv("Kashani2019.subset.csv")

## Rows: 323 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Sample_ID, Week, condition, team, fecal_date
## dbl (10): Sample_number, week_experiment, animal_id, body_fat, body_weight, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ob_mice_df

Hint: You can inspect the different objects in RStudio either by double-clicking on them in the Environment view, or by ctrl+click on the variable names in the code-chunk.

Bar-plots and error-bars

Visualisation of mean or median values, for example for replicates in experimental groups, is often done with bar-plots. Bar-plots take categorical data on the x-axis and numeric data on the y-axis. The variance could be denoted by adding error-bars. Here, we shall look at the change in body weight in each experimental group of mice.

The categories on the axises of plots are by default alphabetically ordered. If we would like to change the order, we can encode the string variable into factors, and change their underlying levels with the forcats package.

f <- as.factor(ob_mice_df$condition)
# re-order the values of f by listing them in the desired order
f <- fct_relevel(f, c("con", "pair", "adlib"))
# re-label the values
ob_mice_df$cond_f <- fct_recode(f, Control = "con", `Ad-lib` = "adlib", Pair = "pair")

For ggplot2, the means and standard deviation (SD) needs to be pre-computed for the groups. First, let’s filter the data frame to only contain data points from the last week of the experiment.

ob_mice_last_week_df <- ob_mice_df %>%
  filter(! is.na(body_weight)) %>%  # select only the weeks with measurements
  group_by(cond_f, team) %>% # group for combination of condition and team
  filter(week_experiment == max(week_experiment)) %>% # take only the last measurements
  ungroup()

After grouping the values by condition, we use summarise function from dplyr to calculate values for the last grouping variable. For example, if we were to group by condition and team here, the mean and SD in the returned data frame would be for the mice in each team in each condition.

mean_wg_end_df <- ob_mice_last_week_df %>% 
  group_by(cond_f) %>% 
  summarise(mean_wg_wk=mean(body_weight_change, na.rm = T), se_wg_wk=sd(body_weight_change, na.rm = T)) %>% ungroup()

mean_wg_end_df

Now we can draw the bar plot, using geom_bar:

ggplot(mean_wg_end_df, aes(x = cond_f, y = mean_wg_wk, fill = cond_f)) + 
  geom_bar(position = position_dodge(), stat = "identity") + # not stacked
  geom_errorbar(aes(ymin = mean_wg_wk-se_wg_wk, ymax = mean_wg_wk+se_wg_wk), # define min/max y-value
                  width = .2,
                  position = position_dodge(.9)) +
  theme_classic() +
  xlab("Condition") +
  ylab("Mean absolute weight change") +
  guides(fill = "none") # remove label for fill option as they are noted on y-axis

On the other hand, ggpubr has built-in functions to derive mean and SD from the data: add = "mean_sd" in ggbarplot. It can also add the individual data points, as required by some journals, to the plot. Try dropping the faceting argument!

ggbarplot(ob_mice_last_week_df,
 x = "cond_f", y = "body_weight_change",
 fill = "cond_f", # use the formatted labels for the groups
 add = c("mean_sd", "jitter"), # add error bar and individual points
 palette = c("#c2a5cf", "#a6dba0", "#e08214"), # manually define color palette for fill
 position = position_dodge(),
 facet.by = "team", # separate the two teams of mice
 xlab="Condition",
 ylab="Mean weight change") +
guides(fill = "none")

Let’s look at how lean mass changes during the experiment.

ob_mice_lm_week_df <- ob_mice_df %>%
  filter(! is.na(lean_body_mass)) %>%  # select only the weeks with measurements
  group_by(cond_f, team) %>% # group for combination of condition and team
  filter(week_experiment == max(week_experiment)) %>% # take only the last measurements
  ungroup()
ob_mice_lm_week_df

Task: Create a new code-chunk by pressing ctrl+alt+i and write the code to plot the mean and sd of lean_body_mass_change for each cond_f per team using ggpubr. Color by condition, using any palette from https://github.com/nanxstats/ggsci by just passing the palette name via the palette = NULL parameter.

ggbarplot(ob_mice_lm_week_df,
 x = "cond_f", y = "lean_body_mass_change",
 color = "cond_f",
 add = "mean_sd", palette = "npg",
 position = position_dodge(),
 facet.by = "team",
 xlab="Condition",
 ylab="Lean mass change") +
labs(color = "Condition")

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_summary()`).

Histograms, box-plots and violin-plots

The distribution of certain attributes is usually more complex, than it could be conveyed using a simple bar-plot. Histograms are for visualising counts of continuous values grouped into bins. Let’s use geom_histogram from ggplot2 to plot the distribution of weight growth pace during the first year in our cohorts.

ggplot(ob_mice_df, aes(auc_glucose)) +
  geom_histogram(aes(fill = cond_f), bins = 50) + # divide data range into 50 bins
  ylab("Count") +
  xlab("Glucose level") +
  theme_bw() +
  theme(legend.position="bottom") +
  labs(fill = NULL) +
  ggtitle("Glucose response")

## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_bin()`).

Is it possible to tell from a glance at the above plot, if there is difference in glucose tolerance between the mice?

Task: facet_grid(rows=NULL, cols=NULL) is for faceting the plot onto a grid, where rows and columns could be specified using the vars(column-name) construct. Add it to the plot above, to visualise the groups (and teams) separately. To allow different ranges on the y-axis in the grids, facet_grid also takes the scales = "free_y" argument.

ggplot(ob_mice_df, aes(auc_glucose)) +
  geom_histogram(aes(fill = cond_f), bins = 20) +
  facet_grid(vars(cond_f), vars(team), scales = "free_y") +
  ylab("Count") +
  xlab("Glucose response AUC") +
  theme_bw() +
  theme(legend.position="bottom") +
  labs(fill = NULL)

## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_bin()`).

Instead of stacking the bars of the histogram on top of each other, ggpubr shows them overlapping with partial transparency. In addition, markers for mean or median values could be added.

gghistogram(ob_mice_df,
   x = "auc_glucose",
   bins = 30,
   add = "median", # add vertical line to median value
   rug = FALSE, # switch this to TRUE to see what it does
   fill = "cond_f", # groups for color
   palette = c("#a6dba0", "#c2a5cf", "#e08214"),
   facet.by = "team",
   ylab="Count",
   xlab="Glucose response") +
labs(fill = "Condition") +
theme(legend.position="bottom") +
guides(colour = "none") # remove legend for median line

## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_bin()`).

While now we can see, that there is a difference in the median values for the glucose response between mice, we don’t know, if it is a significant difference. The graph is also quite busy, with multiple overlaps.

To offer a solution for both of these issues, we can turn to box and violin plots, that display categorical values on the x-axis and numeric values on the y-axis. A boxplot displays the quartiles of values for each category: the mid-line is drawn at 50% probability, and the sides of the box represent 25% and 75%. Outlier values are shown as individual dots.

Meanwhile a violin plot draws the kernel density of the values. The two can be combined into one plot with ggplot, by providing the shared aesthetics values in the main function. stat_compare_means function from ggpubr can add a p-value from selected statistical test, such as Kruskal-Wallis (default), t-test, Wilcoxon-test, ANOVA, etc.

ggplot(ob_mice_df, aes(cond_f, auc_glucose)) +
  geom_violin() + # draw a violin plot
  geom_boxplot(aes(color= cond_f), width = 0.1) + # add a thinner boxplot
  xlab("Condition") +
  ylab("Glucose response") +
  theme_bw() +
  theme(legend.position="bottom") +
  guides(color = "none") +
stat_compare_means(method = "anova")

## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_compare_means()`).

Task: Use ggplot2 to visualise the distribution of absolute weight change for each condition (factor!) in ob_mice_last_week_df.

ggplot(ob_mice_last_week_df, aes(cond_f, body_weight_change)) +
  geom_violin() +
  geom_boxplot(aes(color= cond_f), width = 0.1) +
  xlab("Condition") +
  ylab("Weight change") +
  theme_bw() +
  theme(legend.position="bottom") +
  guides(color = "none")

ggpubr can also combine box and violin plots, the ggviolin is the base-plot in this case, and the boxes can be added on top either with add = NULL or as additional ggplot2 component.

ggviolin(ob_mice_df, x = "cond_f", y = "auc_glucose",
   color = "cond_f",
   add = "boxplot",
   palette = c("#a6dba0","#c2a5cf", "#e08214"),
   ylab="Weight change",
   xlab="Condition")

## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

## Warning: Removed 291 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Next, we define the base-plot with ggpubr that we are going to use in the next few steps:

boxplot_wg_p <- ggboxplot(ob_mice_last_week_df, x = "cond_f", y = "body_weight_change",
   fill = "cond_f",
   palette = c("#a6dba0","#c2a5cf", "#e08214"),
   ylab="Weight change",
   xlab="Condition") +
labs(fill = NULL) +
theme(legend.position="bottom")
show(boxplot_wg_p)

The stat_compare_means function adds p-values from different statistical tests. One can replace the numeric p-values with symbols to denote significance by passing label = after_stat(p.signif) in the aesthetic argument. You can remove this, to see how it looks without! Furthermore, we can specify which group should be used as the reference for each comparison: ref.group = "Control".

boxplot_wg_p + stat_compare_means(aes(label = after_stat(p.signif)),
                                  ref.group = "Control",
                                  method = "t.test")

Individual pairs of categories could also be specified, and supplied to the comparisons argument in the function.

wg_comparisons <- list( c("Control", "Pair"), c("Control", "Ad-lib"))
boxplot_wg_p + stat_compare_means(comparisons = wg_comparisons, method = "t.test")

Let’s make another plot, comparing the lean mass change for each condition, in the two teams of animals. We are adding pairwise comparisons to the plot using t-tests with geom_pwc.

# create base plot object
bxplot_multi_p <- ggboxplot(ob_mice_lm_week_df, x = "cond_f", y = "lean_body_mass_change",
   fill = "cond_f",
   palette = c("#a6dba0", "#c2a5cf", "#e08214"),
   ylab="Lean mass change",
   xlab="Teams", 
   facet.by="team") +
  geom_pwc(method = "t.test")
bxplot_multi_p

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_pwc()`).

And if you go “Wait, there are too many comparisons, shouldn’t we account for multiple testing?”. You are completely right, and there is a function in ggpubr for that too: ggadjust_pvalue. Correction methods available are “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”. Note: this function is not applied to a plot with +. Instead, the plot object needs to be passed to it.

# pass plot to function
ggadjust_pvalue(
  bxplot_multi_p,
  p.adjust.method = "fdr",
  label = "{p.adj.format}{p.adj.signif}", # formatting of the text label
  hide.ns = T #hide non-significant values
)

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_pwc()`).

Task: Plot and compare the glucose response (AUC) in each condition, using both box and violin plots. Use symbols to show significance.

ggplot(ob_mice_df %>% filter(! is.na(auc_glucose)), aes(cond_f, auc_glucose)) +
  geom_violin() +
  geom_boxplot(aes(color= cond_f), width = 0.1) +
  xlab("Condition") +
  ylab("Glucose response") +
  theme_bw() +
  theme(legend.position="bottom") +
  guides(color = "none") +
stat_compare_means(aes(label = after_stat(p.signif)), method = "t.test", comparisons = list( c("Control", "Pair"), c("Control", "Ad-lib"), c("Pair", "Ad-lib")))

Data from Excel spreadsheet

For the next set of plots, we are going to use metadata from the DIABIMMUNE cohorts, stored in an Excel spreadsheet. The project aimed to explore the “hygiene hypothesis” and built cohorts of newborns from Estonia, Finland and Russia.

The data is organised over multiple sheets, which need to be read into separate data frames. Like the parsers used previously, the read_xlxs function can choose the correct type for each column, but it can also be specified via the col_types parameter.

dia_metadata_samples_df <- read_xlsx("diabimmune_combined_metadata.xlsx", sheet="Samples")
dia_metadata_birth_df <- read_xlsx("diabimmune_combined_metadata.xlsx", sheet="Pregnancy, birth")

Task: Parse the “Growth” sheet from the same spreadsheet into a variable called dia_metadata_growth_df!

dia_metadata_growth_df <- read_xlsx("diabimmune_combined_metadata.xlsx", sheet="Growth")

The metadata contains subjects from three different cohorts. The assignments are in the “Samples” sheet, which we read into R already. As there are multiple samples per subject, we need to extract the distinct values for “subjectID” to get an overview. Despite them being unique, we should in this case also include the variables “country” and “cohort” in the function, in order to preserve these columns in the returned data frame. Note: We can also specify .keep_all = TRUE to keep all columns.

dia_subjects_df <- dia_metadata_samples_df %>% distinct(subjectID, country, cohort)

The simplest way to piece together data frames that share columns (typically containing some kind of unique identifier), is to use mutating joins (https://dplyr.tidyverse.org/reference/mutate-joins.html).
Here, we are going to use full_join, which keeps all records from both data frame, and left_join, which keeps only the records from the data frame on the right that have matching counterparts in the left. The shared column can be specified either as by = "subjectID" if the column name is also identical in the two data frames, or as by = c("subjectID" = "ID") if different.

dia_metadata_combi_df <- full_join(dia_metadata_birth_df, dia_subjects_df, by = "subjectID")

Task: Add code to also left-join dia_metadata_growth_df to dia_metadata_combi_df, and inspect the resulting data frame in the viewer.

dia_metadata_combi_df <-  dia_metadata_combi_df %>% left_join(dia_metadata_growth_df, by = "subjectID")

Scatter plots and regression

Two continuous values are typically plotted against each other using a scatter plot. In addition, marker colors and shapes could be used to display categorical values. There are a limited number of colors, but even more limited number of shapes to use in R.

Task: Switch the attributes for shape and color! Does the resulting plot confer a different message, than before?

ggplot(dia_metadata_combi_df, aes(x = length_growth_pace_during_year_one, y = weight_growth_pace_during_year_one)) +
  geom_point(aes(shape = csection, color = gender)) +
  scale_shape_manual(values = c(1,3)) +
  xlab("Length growth") +
  ylab("Weight growth")

## Warning: Removed 23 rows containing missing values or values outside the scale range
## (`geom_point()`).

Sometimes, we are interested to see if two attributes correlate, or not. Calculating the Pearson or Spearman correlation between values gives us numerical answers to this question, but visualising it with a fitted line helps in interpreting the association.

cor.test(dia_metadata_combi_df$length_growth_pace_during_year_one, dia_metadata_combi_df$weight_growth_pace_during_year_one, method = "spearman")

## Warning in
## cor.test.default(dia_metadata_combi_df$length_growth_pace_during_year_one, :
## Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  dia_metadata_combi_df$length_growth_pace_during_year_one and dia_metadata_combi_df$weight_growth_pace_during_year_one
## S = 1653045, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4960927

Basic R functions lm, for fitting a linear model, and plot to plot it, give lot of control over the parameters of the fitted model and confidence interval, but a very basic plot.

# fit a linear model with lm(y ~ x, data_frame)
wl_lm <- lm(weight_growth_pace_during_year_one ~ length_growth_pace_during_year_one, dia_metadata_combi_df)
# initialize new x points for calculating y' and confidence interval on
conf_x = seq(min(dia_metadata_combi_df$length_growth_pace_during_year_one, na.rm = T),max(dia_metadata_combi_df$length_growth_pace_during_year_one, na.rm = T),by = 0.05)
conf_x_df = data.frame(length_growth_pace_during_year_one=conf_x)
# predict fit and 95% conf. interval
fit_conf_interval <- predict(wl_lm, newdata=conf_x_df, interval="confidence",
                         level = 0.95)

# plot data
plot(dia_metadata_combi_df$length_growth_pace_during_year_one, dia_metadata_combi_df$weight_growth_pace_during_year_one,
     xlab="Length growth", ylab="Weight growth")
# add fitted regression to plot
abline(wl_lm, col="blue")
# add lower and upper confidence interval values from fit_conf_interval
lines(conf_x, fit_conf_interval[,2], col="darkgray", lty=2)
lines(conf_x, fit_conf_interval[,3], col="darkgray", lty=2)

ggpubr function for scatter plot has an add-on argument for adding a regression line, but not a confidence interval. On the other hand, the correlation coefficient (stat_cor) and the equation for the regression line (stat_regline_equation) can be shown on the diagram at specific coordinates with label.x and label.y.

ggscatter(dia_metadata_combi_df,
  x = "length_growth_pace_during_year_one", y = "weight_growth_pace_during_year_one",
  add = "reg.line",
  xlab="Length growth", ylab="Weight growth"
  ) +
  stat_cor(label.x = c(12), label.y = c(4)) +
  stat_regline_equation(label.x = c(12), label.y = c(4.3))

## Warning: Removed 23 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 23 rows containing non-finite outside the scale range
## (`stat_cor()`).

## Warning: Removed 23 rows containing non-finite outside the scale range
## (`stat_regline_equation()`).

## Warning: Removed 23 rows containing missing values or values outside the scale range
## (`geom_point()`).

The ggplot2 package has the geom_smooth function to fit curves to the data. Here, we can specify, that we want a linear model and a confidence interval of 95%. It can be combined with the ggpubr functions to show the correlation coefficient, etc.

ggplot(dia_metadata_combi_df, aes(length_growth_pace_during_year_one, weight_growth_pace_during_year_one)) +
  geom_point(aes(shape = csection, color = gender)) +
  geom_smooth(method = "lm", formula = 'y ~ x', level = 0.95)  +
  xlab("Length growth (cm/year)") +
  ylab("Weight growth (kg/year)") +
  labs(color = "Gender", shape = "C-section")

## Warning: Removed 23 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 23 rows containing missing values or values outside the scale range
## (`geom_point()`).

  stat_cor(label.x = c(32), label.y = c(7))

## geom_text: parse = TRUE, na.rm = FALSE
## stat_cor: label.x.npc = left, label.y.npc = top, label.x = 32, label.y = 7, label.sep = , , method = pearson, alternative = two.sided, output.type = expression, digits = 2, r.digits = 2, p.digits = 2, r.accuracy = NULL, p.accuracy = NULL, cor.coef.name = R, na.rm = FALSE
## position_identity

Task: One can fit separate regression lines to different groups in the data by passing the grouping variables, for example as color, in the aesthetic argument for geom_smooth. Plot a diagram that shows how the 3 year weight ~ 3 year length growth pace correlates for the different genders based on data in dia_metadata_combi_df. Display the correlation coefficients and line equations.

ggplot(dia_metadata_combi_df, aes(length_growth_pace_during_three_years, weight_growth_pace_during_three_years, color = gender)) +
  geom_point(aes(shape = csection)) +
  geom_smooth(aes(color = gender), method = "lm", formula = 'y ~ x', level = 0.95) +
  stat_cor(label.x = c(16.5,12.1), label.y = c(2.9,4.1)) +
  stat_regline_equation(label.x = c(16.5,12.1), label.y = c(3.1,4.4))  +
  xlab("Length growth (cm/year)") +
  ylab("Weight growth (kg/year)")

## Warning: Removed 76 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 76 rows containing non-finite outside the scale range
## (`stat_cor()`).

## Warning: Removed 76 rows containing non-finite outside the scale range
## (`stat_regline_equation()`).

## Warning: Removed 76 rows containing missing values or values outside the scale range
## (`geom_point()`).

Task: Using the ob/ob mice data, plot the body fat and lean mass change from ob_mice_lm_week_df against each other, faceting on experimental groups. Fit regression lines, display the correlation coefficients and line equations.

ggplot(ob_mice_lm_week_df, aes(body_fat_change, lean_body_mass_change, color = cond_f)) +
  geom_point() +
  geom_smooth(aes(color = cond_f), method = "lm", formula = 'y ~ x', level = 0.95) +
  facet_wrap(vars(cond_f), scales = "free") +
  stat_cor(label.y = c(17,-0.25,5.5)) +
  stat_regline_equation()  +
  xlab("Body fat change") +
  ylab("Lean mass change")

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_cor()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_regline_equation()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).