Visualising multivariate data

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(readr)
library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.3.3

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.3.3

library(phyloseq)
library(umap)

## Warning: package 'umap' was built under R version 4.3.2

library(Rtsne)

## Warning: package 'Rtsne' was built under R version 4.3.3

library(vegan)

## Warning: package 'vegan' was built under R version 4.3.3

## Loading required package: permute
## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.3.3

## This is vegan 2.6-6.1

library(microbiome)

## 
## microbiome R package (microbiome.github.com)
##     
## 
## 
##  Copyright (C) 2011-2022 Leo Lahti, 
##     Sudarshan Shetty et al. <microbiome.github.io>
## 
## 
## Attaching package: 'microbiome'
## 
## The following object is masked from 'package:vegan':
## 
##     diversity
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## 
## The following object is masked from 'package:base':
## 
##     transform

library(FactoMineR)

## Warning: package 'FactoMineR' was built under R version 4.3.3

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(ggheatmapper)

## Loading required package: patchwork

## Warning: package 'patchwork' was built under R version 4.3.3

set.seed(1234)

In the following exercises, we are going to use 16s amplicon sequencing data from Gnanasekaran et al. (2023), wherein they inoculated two types of media in automated anaerobic multi-stage in vitro gut fermentors with synthetic human gut microbial community in two quantitative compositions (Equal and Fecal). First, we are only going to look at the samples from the BSC medium. Let’s parse the tables containing the amplicon-sequencing variants’ abundance data (OTU table) and the metadata.

sym_otu_bsc_df <- read.delim("otu_table_bsc.tsv", sep = "\t")
sym_metadata_bsc_df <- read.delim("metadata_bsc.tsv", sep = "\t")

Thereafter, we need to do some data wrangling, to gain data objects that contain only numerical values. We also calculate relative abundances from the count data, while we are there.

# separate the abundance count part from the taxonomy part in otu_table
sym_abund_df <- sym_otu_bsc_df %>% select(starts_with("Final."))
sym_tax_df <- sym_otu_bsc_df %>% select(-starts_with("Final."))

# calculate relative abundance
calc_relab <- function(x) x / sum(x)
sym_relab_df <- sym_abund_df %>% mutate(across(colnames(sym_abund_df), ~ calc_relab(.x)))

# add tax_ID as rownames to the new data frames 
rownames(sym_abund_df) <- sym_otu_bsc_df$tax_ID
rownames(sym_relab_df) <- sym_otu_bsc_df$tax_ID

# transpose the data frame, resulting in a matrix compatible with the vegan package functions
comm_df <- t(sym_abund_df)
relAb_df <- t(sym_relab_df)

The vegan R package contains widely used functions for community ecology.

Log-ratio transformed values

Amplicon and shotgun sequencing data are compositional, and to transform them to the Euclidean space, log-ratio transformations are used. Abundance data is also frequently sparse, i.e. there are taxa and sample combinations, where the observed abundance is zero. Naturally, these values need to be handled in some manner, before their log-ratios can be computed. They could be replaced with a small value, for example by taking the half of the smallest value in the table, or with a value sampled from an appropriate distribution. The latter is what cmultRepl function does, while also removing taxa that are zero in more than 80% of the samples. This is going to leave us with only approx. half of the taxa in our original dataset, highlighting how sparse this data is. (Note: this function also modifies the non-zero values to a small degree, but their ratios should be preserved.)

# fill zeros using Square root (SQ) Bayesian Replacement method
# samples - attribute
z_comm_df <- zCompositions::cmultRepl(relAb_df, output = "prop", method = "SQ", z.warning=0.8 )

## Warning in zCompositions::cmultRepl(relAb_df, output = "prop", method = "SQ", : Column no. 14 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 22 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 25 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 29 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 36 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 38 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 39 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 42 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 45 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 49 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 55 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 56 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 57 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 58 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 59 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 62 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 63 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 64 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 65 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 66 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 67 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 68 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 69 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 70 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 71 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 72 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 73 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 74 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 75 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 76 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 77 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 78 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 79 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 80 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 81 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 82 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 83 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 84 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 85 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 86 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 87 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 88 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 89 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 90 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 91 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 92 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 93 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 94 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).

## No. adjusted imputations:  486

z_comm_matrix <- as.matrix(z_comm_df)

The centered log-ratio is calculated for each element of a (sample) vector by taking the log of the value divided by the geometric mean of the elements in the vector. The resulting data is centered and scaled, but the variables are not independent, which limits the type of analyses one can perform on CLR values.

# centered log-ratio transformation
?clr

## No documentation for 'clr' in specified packages and libraries:
## you could try '??clr'

zclr_comm_obj <- compositions::clr(z_comm_matrix)

Heatmaps

Heatmaps are useful to visualise numerical data on a grid, typically containing observations (samples) on one axis and attributes on the other. The stats package has a heatmap function. It also draws a dendogram based on complete hierarchical clustering on the Euclidean distance of the sample vectors.

heatmap(zclr_comm_obj, Colv = NA)

Let’s turn to ggplot2 to plot a more customised heatmap. First, we need to convert the rmult object into a data frame, convert row names into a column, and join the metadata table via the shared sample column!

zclr_comm_df <- as.data.frame(zclr_comm_obj)
zclr_comm_metadata_df <- zclr_comm_df %>% 
  rownames_to_column("sample") %>% 
  left_join(sym_metadata_bsc_df, by = "sample")

Unfortunately, the heatmap function in ggplot2 does not have an option for clustering, thus if we would like to have the labels on the axes in some order, we need to set it manually by using factors. Here, we first order the samples by SHIME compartment and experiment day, then use that in the factor definition.

xaxis_order <- {zclr_comm_metadata_df %>% arrange(Compartment_unit, Day)}$Sample_ID
?factor

## starting httpd help server ... done

zclr_comm_metadata_df$sampleid_f <- factor(zclr_comm_metadata_df$Sample_ID, levels = xaxis_order)

Thereafter, the data frame needs to be converted into a long format with pivot_longer: the columns containing the CLR values (selected with starts_with("Seq")) are melted into two columns: one containing the taxon identifier (from the column name), and the other the CLR value. All other columns remain.

zclr_comm_metadata_long <- pivot_longer(zclr_comm_metadata_df, cols = starts_with("Seq"), names_to = "tax_ID", values_to = "clr_v")
zclr_comm_metadata_long

Next, we are going to construct labels for the y-axis. In order to have unique names, we are going to concatenate the genus to the taxon identifier.

# replace NA values with empty strings
taxonomy_labels_df <- sym_tax_df %>% mutate(Genus=ifelse(is.na(Genus), "", Genus))
# concatenate genus and taxon identifier
taxonomy_labels_df %<>% mutate(taxa_name=paste(Genus, tax_ID, sep=" "))
# join to the long data frame via "tax_ID"
zclr_comm_metadata_long <- left_join(zclr_comm_metadata_long, taxonomy_labels_df)

## Joining with `by = join_by(tax_ID)`

zclr_comm_metadata_long

After these preparations, we can finally call ggplot2 on the long data frame. The function for adding the heatmap geometry is geom_tile, and (surprisingly) we can pass numerical values to the fill option. With this palette, the brighter the colors, the higher the abundance.

ggplot(zclr_comm_metadata_long) +
  geom_tile(aes(sampleid_f, taxa_name, fill = clr_v)) +
  scale_fill_viridis_c(name = "CLR abundance", option = "A") + # use the magma palette from viridis
  scale_y_discrete(limits = rev) + # reverse the y-axis
  theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) + # rotate the x-axis labels
  xlab("Sample") +
  ylab("ASV")

As noted before, there is no native option in ggplot2 to add dendograms to the heatmap. However, another package, called ggheatmapper, offers this, and many other features. The warmer tones are above average in abundance, and the cooler ones are below average.

# generate a vector of the column names we are going to use on the x-axis
otus <- colnames(zclr_comm_df) 
# group the data frame rows by SHIME compartment
zclr_comm_metadata_compartmentGrouped <- zclr_comm_metadata_df %>% group_by(Compartment)
p <- ggheatmap(zclr_comm_metadata_compartmentGrouped,
    colv = "Sample_ID", # column to use for y-axis
    rowv = otus, # column list to use in x-axis
    hm_colors = 'RdYlBu', # heatmap palette, from ColorBrewer
    show_dend_row = FALSE, # don't show ASV clustering
    show_colnames = TRUE,
    show_rownames = FALSE,
    group_colors = c(`PC` = "#c51b8a", `DC` = "#7d5ba6"),
    group_prop = 0.02,
    colors_title = "CLR abundance")

## Adding missing grouping variables: `Compartment`

show(p)

We can build onto the previous plot object (p), for example adding tracks showing categorical data with add_tracks.

p2 <- add_tracks(p,
           track_columns = c("Day"),
           track_colors = list(Day = 'Greys'),
           track_prop = 0.05)

## Adding missing grouping variables: `Compartment`

p2 +
  plot_layout(guides = 'collect') # collar the guides

Task: Add another track, showing the Community_Type, to the existing plot. Call the new plot p2! Choose an appropriate palette from ColorBrewer.

p2 <- add_tracks(p2,
           track_columns = c("Community_Type"),
           track_colors = list(Community_Type = 'RdPu'),
           track_prop = 0.05)

## Adding missing grouping variables: `Compartment`

p2 +
  plot_layout(guides = 'collect')

The ggheatmapper package also makes it possible to extract the data from a clustered heatmap, and attach a diagram displaying numerical data for the same x-axis via the observations column. Here, we just select three random ASVs and visualise their CLR values over the x-axis.

table_p2 <- get_data(p2) # extract data
table_p2_long <- table_p2 %>%
  ungroup() %>% #ungroup data frame
  select(observations, Compartment, Seq2, Seq4, Seq11) %>% #select relevant columns
  pivot_longer(cols = c(-observations, -Compartment), names_to = "taxa_ID", values_to = "clr_v") # melt abundance columns into long format, keeping observations and Compartment columns
plt_abund <- ggplot(table_p2_long, aes(observations, clr_v, color = taxa_ID, group = taxa_ID)) + # use ggplot to plot diagram
  geom_line() +
  facet_wrap(vars(Compartment), scales = "free_x") +
  scale_y_continuous(position = "right") +
  guides(color = FALSE) +
  ylab("Abundance") +
  theme_sparse()

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plt_abund

The additional plots sharing an axis with the heatmap can be aligned to the central heatmap with align_to_hm:

gghm_complete <- p2 %>%
  align_to_hm(plt_abund, newplt_size_prop = 0.3)

gghm_complete +
  plot_layout(guides = 'collect')

Dimensionality reduction methods

Omics data is multi-dimensional and multivariate analysis methods are required to interrogate it. In order to visualise or cluster the data points, we need to reduce the number of variables, while preserving the structure and characteristics of the data. Various methods exist for reducing dimensionality, and they can be divided into two categories, based on whether they faithfully preserve the distances between the input vectors or not.

Note: These methods are just visualization techniques. They do not assess statistically separation or correlation. For that, statistical test need to be performed.

Distance-preserving methods

Distances between samples

Typically, these methods either require a distance or dissimilarity matrix as input, or they calculate it themselves. For count or abundance data, the vegdist function from the vegan package can calculate distances using several methods used in ecology. The function takes a data frame with samples in rows and taxa in columns.

# calculate dissimilarity of the samples using Bray-Curtis distance
sym_otu_bray_dist <- vegdist(comm_df,method="bray", binary=FALSE)
# also calculate for a subset of OTUs
sym_otu_bray_s20_dist <- vegdist(comm_df[,1:20],method="bray", binary=FALSE)

Principal Co-ordinates Analysis or Metric Multidimensional scaling

This method projects the dissimilarity matrix into a lower-dimensional space, while preserving the relative distances between the samples. cmdscale is part of the stats module in R, and the resulting matrix converted into a data frame can be visualised on a scatterplot with ggplot2.

Change the coloring variable to Compartment . Now, do the groups separate?

# perform the MDS on the sym_otu_bray_dist matrix
?cmdscale
sym_otu_bray_MDS <- data.frame(cmdscale(sym_otu_bray_dist), k = 2)

# join the metadata to the resulting data frame
sym_otu_bray_MDS$sample <- rownames(sym_otu_bray_MDS)
sym_otu_bray_MDS <- left_join(sym_otu_bray_MDS, sym_metadata_bsc_df, by = "sample")

ggplot(sym_otu_bray_MDS, aes(X1, X2)) +
  geom_point(aes(color=Community_Type)) +
  ggtitle("Synthetic community MDS", subtitle = "Bray-Curtis") +
  xlab("Dim 1") +
  ylab("Dim 2") +
  geom_text_repel(aes(label = Day), size = 3.0) # place non-overlapping text labels

## Warning: ggrepel: 24 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Task: Use the jaccard method with binary=TRUE in vegdist() to compute the Jaccard-distance between the samples, and plot the PCoA of the resulting distance matrix. Color the points based on their community type and shape them for the SHIME compartment. Add labels from the Day column to the plot with geom_text_repel()!

sym_otu_manhattan_dist <- vegdist(comm_df,method="jaccard", binary=T)
sym_otu_manhattan_MDS <- data.frame(cmdscale(sym_otu_manhattan_dist))
sym_otu_manhattan_MDS$sample <- rownames(sym_otu_manhattan_MDS)
sym_otu_manhattan_MDS <- left_join(sym_otu_manhattan_MDS, sym_metadata_bsc_df, by = "sample")

ggplot(sym_otu_manhattan_MDS, aes(X1, X2)) +
  geom_point(aes(shape=Compartment, color = Community_Type)) +
  ggtitle("Synthetic community MDS", subtitle = "Jaccard") +
  xlab("Dim 1") +
  ylab("Dim 2") +
  geom_text_repel(aes(label = Day), size = 3.0)

Non-metric Multidimensional Scaling

NMDS generally takes any data-suitable distance matrix, and rank orders the distances. This makes it a non-parametric method, suitable for non-normal distributions. Here, the metaMDS function from the vegan package accepts community matrix as input, and calculates the Bray-Curtis distance from it. Thereafter, envfit is used to fit loading vectors onto the sym_otu_NMDS ordination object. Loading vectors show how samples along that vector have increased values of that particular attribute. In this case, the longer the vector, the stronger the association.

?metaMDS
sym_otu_NMDS <- metaMDS(comm_df[,1:20], distance="bray", engine = "monoMDS", trace=FALSE, trymax=100, tidy = T)
sym_otu_NMDS_fit <- envfit(sym_otu_NMDS, comm_df[, 1:20])

In order to visualise the ordination with ggplot2, coordinate vectors for the points (called sites) and loadings (vectors) need to be extracted from the objects.

sym_otu_NMDS_scores <- data.frame(scores(sym_otu_NMDS, display = "sites"), Compartment = as.factor(sym_metadata_bsc_df$Compartment)) #also add the column with the compartment information

sym_otu_NMDS_vectors <- as.data.frame(scores(sym_otu_NMDS_fit, display = "vectors"))

# add tax_ID column to the loadings data frame
sym_otu_NMDS_vectors <- cbind(sym_otu_NMDS_vectors, tax_ID = rownames(sym_otu_NMDS_vectors))

# construc the base scatterplot
nmds_plot <- ggplot(sym_otu_NMDS_scores) + # points data frame is the base
  geom_point(aes(x = NMDS1, y = NMDS2, colour = Compartment)) +
  coord_fixed()

# add the loadings
?geom_segment
nmds_plot +
  geom_segment(data = sym_otu_NMDS_vectors, # loadings are layered on top
               aes(x = 0, xend = NMDS1, y = 0, yend = NMDS2),
               arrow = arrow(length = unit(0.25, "cm")), colour = "grey") +
  geom_text_repel(data = sym_otu_NMDS_vectors, # loadings are labeled
                  aes(x = NMDS1, y = NMDS2, label = tax_ID),
                  size = 3)

Task: Perform a left join on the sym_otu_NMDS_vectors data frame, to add the taxonomical information from taxonomy_labels_df to it. Re-plot the biplot above, label with the unique labels of the OTUs.

sym_otu_NMDS_vectors <- left_join(sym_otu_NMDS_vectors, taxonomy_labels_df, by = c("tax_ID"))

nmds_plot + ## need aspect ratio of 1!
  geom_segment(data = sym_otu_NMDS_vectors,
               aes(x = 0, xend = NMDS1, y = 0, yend = NMDS2),
               arrow = arrow(length = unit(0.25, "cm")), colour = "grey") +
  geom_text_repel(data = sym_otu_NMDS_vectors, aes(x = NMDS1, y = NMDS2, label = taxa_name),
            size = 3)

## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

If we are only interested in the top loadings, the data frame containing them could be filtered to only include the longest ones.

# calculate the length of the loading vectors and rank them
sym_otu_NMDS_vectors_sub <- sym_otu_NMDS_vectors %>%
  mutate(eucdist=(NMDS1**2+NMDS2**2)**0.5) %>%
  mutate(assoc_rank = row_number(desc(eucdist))) %>%
  filter(assoc_rank < 11)

nmds_loadings_plot <- nmds_plot + 
  geom_segment(data = sym_otu_NMDS_vectors_sub,
               aes(x = 0, xend = NMDS1, y = 0, yend = NMDS2),
               arrow = arrow(length = unit(0.25, "cm")), colour = "grey") +
  geom_text_repel(data = sym_otu_NMDS_vectors_sub, aes(x = NMDS1, y = NMDS2, label = taxa_name),
            size = 3)
nmds_loadings_plot

If we would like to highlight how the samples in the different categories separate, ellipses could be added to show the distribution of the points for the different groups.

nmds_loadings_plot +
  stat_ellipse(aes(x = NMDS1, y = NMDS2, colour = Compartment))

Another method to visualise the spread of the samples is the convex hull method, which applies a convex polygon encompassing all points within its border.

nmds_loadings_plot +
  stat_chull(aes(x = NMDS1, y = NMDS2, fill = Compartment), 
             alpha = 0.3, 
             geom = "polygon") +
  scale_fill_brewer(palette = "Set1",
                      name = '')

Distance non-preserving methods

These methods aim to recapture data structure locally and/or globally.

Uniform Manifold Approximation and Projection (UMAP)

The umap function takes or calculates a distance matrix, which then it uses to determine a graph of the local structure (based on nearest neighbors) and projects this into two dimensions while trying to preserve global structure. Note: the `n_neighbors` parameter needs to be set according to the number of data points. Try to set it to different values below 15!

?umap
sym_umap_est <- umap(comm_df, n_neighbors=15, input="data", preserve.seed = T)

# extract the points from the returned object
sym_umap_layout_df <- data.frame(sym_umap_est$layout)
sym_umap_layout_df$sample <- rownames(sym_umap_layout_df)
sym_umap_layout_df <- left_join(sym_umap_layout_df, sym_metadata_bsc_df)

## Joining with `by = join_by(sample)`

ggplot(sym_umap_layout_df, aes(x=X1, y=X2, color = Compartment, shape=Community_Type)) +
  geom_point() +
    xlab("Dim 1") +
    ylab("Dim 2") +
  geom_text_repel(aes(label = Day), size = 3) +
  ggtitle("Synthetic community UMAP", subtitle = "Community matrix") +
  labs(shape = "Community type")

Task: Use the previously computed Jaccard-distance matrix as an input to the umap() function. For this, set input="dist". Do the samples show a different clustering, than before?

sym_umap_est <- umap(as.matrix(sym_otu_manhattan_dist), input="dist", preserve.seed = T, n_neighbors=4)
sym_umap_layout_df <- data.frame(sym_umap_est$layout)
sym_umap_layout_df$sample <- rownames(sym_umap_layout_df)
sym_umap_layout_df <- left_join(sym_umap_layout_df, sym_metadata_bsc_df)

## Joining with `by = join_by(sample)`

ggplot(sym_umap_layout_df, aes(x=X1, y=X2, color = Compartment, shape=Community_Type)) +
  geom_point() +
    xlab("Dim 1") +
    ylab("Dim 2") +
  geom_text_repel(aes(label = Day), size = 3) +
  ggtitle("Synthetic community UMAP", subtitle = "Jaccard") +
  labs(shape = "Community type")

## Warning: ggrepel: 11 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

t-SNE

t-SNE is a variation of Stochastic Neighbor Embedding, which uses a neighborhood graph to embed the data points onto low-dimensional planes while ensuring that the distribution of the data is similar to the original. It has a parameter, called perplexity that relates to the number of neighbors in the graph, and controls how the local and global structure information is balanced in the optimization. There is a web-page highlighting the importance of choosing this value and the number of optimization steps.

set.seed(1234)
sym_tsne_obj <- Rtsne::Rtsne(comm_df, perplexity = 2)

sym_tsne_df <- data.frame(DIM1 = sym_tsne_obj$Y[,1], DIM2 = sym_tsne_obj$Y[,2])
sym_tsne_df$sample <- rownames(comm_df)
sym_tsne_df %<>% left_join(sym_metadata_bsc_df)

## Joining with `by = join_by(sample)`

ggplot(sym_tsne_df, aes(DIM1, DIM2)) +
  geom_point(aes(color = Compartment)) +
  geom_text_repel(aes(label=Day), size = 2)

Task: Gradually increase the value of perplexity up to 15! Does the plot change?

Principal Component Analysis (PCA)

PCA takes set of multivariate vectors and identifies the principal components that capture the most variance. These principal components are linearly independent, and arranged in the order of decreasing variance. The variance captured by each component is usually displayed on the axis label. The function prcomp takes a “(centered and possibly scaled) data matrix”, which in our case is the relative abundance matrix. Note: CLR transformed data is mathematically not suitable for PCA.

# compute the PCA object
?prcomp
sym_rda <- prcomp(relAb_df)

# the center list contains the loading vectors
# we can change their names in the rda object to the previously constructed unique labels
loadings_annot <- left_join(data.frame(tax_ID = names(sym_rda$center)), taxonomy_labels_df)

## Joining with `by = join_by(tax_ID)`

dimnames(sym_rda$rotation)[[1]] <- loadings_annot$taxa_name
names(sym_rda$center) <- loadings_annot$taxa_name

The factoextra package has functions for visualising a prcomp returned object.

# plot the top contributing variables only
fviz_pca_var(sym_rda,
             col.var = "black", # color of the arrows
             select.var = list(name = NULL, cos2 = NULL, contrib = 5), # show only first 5 variables with highest contribution
             label = "var", repel = T) # add labels for variables

# plot the points on the first two principal components
fviz_pca_ind(sym_rda,
             label="none", # no labels
             habillage=sym_metadata_bsc_df$Community_Type, # data groups to color
             #pointsize = 1,
             addEllipses=TRUE, ellipse.level=0.95, # add ellipses
             ellipse.type="t", # "convex" or multivariate distribution
             mean.point = FALSE, # do not put a slightly larger points as the group average
             palette = "Dark2") # use ColorBrewer palette

We can also combine both the loadings and the points on a biplot.

# plot the whole biplot
fviz_pca_biplot(sym_rda,
                repel = T,
                select.var = list(name = NULL, cos2 = NULL, contrib = 3), # first 3 loading vectors
                label = "var",
                habillage=sym_metadata_bsc_df$Compartment, mean.point = FALSE)

Task: Isometric log-ratio transformation is appropriate for use in PCA, however the transformed data vectors do not correspond to the original ones. Thus, interpretation is challenging. Compute the ILR of abundances in z_comm_matrix, perform a PCA on it, and plot the first two components on a biplot, samples colored by the Compartment!

sym_ilr_rda <- prcomp(compositions::ilr(z_comm_matrix))

fviz_pca_biplot(sym_ilr_rda,
                select.var = list(name = NULL, cos2 = NULL, contrib = 3), label = "var",
                habillage=sym_metadata_bsc_df$Compartment, mean.point = FALSE)

Task: An expanded dataset from the SHIME experiment is loaded below into RStudio. The column Setup contains the joint type for the fermentation medium and colon compartment. Create a heatmap, and a figure with a dimensionality reduction method to demonstrate that the samples do not cluster by Community_Type, but do by Medium.

sym_otu_shime_df <- read.delim("otu_table_shime.tsv", sep = "\t")
sym_metadata_shime_df <- read.delim("metadata_shime.tsv", sep = "\t")