library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(magrittr)
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
library(readr)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.3.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.3
library(phyloseq)
library(umap)
## Warning: package 'umap' was built under R version 4.3.2
library(Rtsne)
## Warning: package 'Rtsne' was built under R version 4.3.3
library(vegan)
## Warning: package 'vegan' was built under R version 4.3.3
## Loading required package: permute
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.3.3
## This is vegan 2.6-6.1
library(microbiome)
##
## microbiome R package (microbiome.github.com)
##
##
##
## Copyright (C) 2011-2022 Leo Lahti,
## Sudarshan Shetty et al. <microbiome.github.io>
##
##
## Attaching package: 'microbiome'
##
## The following object is masked from 'package:vegan':
##
## diversity
##
## The following object is masked from 'package:ggplot2':
##
## alpha
##
## The following object is masked from 'package:base':
##
## transform
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.3.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggheatmapper)
## Loading required package: patchwork
## Warning: package 'patchwork' was built under R version 4.3.3
set.seed(1234)
In the following exercises, we are going to use 16s amplicon sequencing data from Gnanasekaran et al. (2023), wherein they inoculated two types of media in automated anaerobic multi-stage in vitro gut fermentors with synthetic human gut microbial community in two quantitative compositions (Equal and Fecal). First, we are only going to look at the samples from the BSC medium. Let’s parse the tables containing the amplicon-sequencing variants’ abundance data (OTU table) and the metadata.
sym_otu_bsc_df <- read.delim("otu_table_bsc.tsv", sep = "\t")
sym_metadata_bsc_df <- read.delim("metadata_bsc.tsv", sep = "\t")
Thereafter, we need to do some data wrangling, to gain data objects that contain only numerical values. We also calculate relative abundances from the count data, while we are there.
# separate the abundance count part from the taxonomy part in otu_table
sym_abund_df <- sym_otu_bsc_df %>% select(starts_with("Final."))
sym_tax_df <- sym_otu_bsc_df %>% select(-starts_with("Final."))
# calculate relative abundance
calc_relab <- function(x) x / sum(x)
sym_relab_df <- sym_abund_df %>% mutate(across(colnames(sym_abund_df), ~ calc_relab(.x)))
# add tax_ID as rownames to the new data frames
rownames(sym_abund_df) <- sym_otu_bsc_df$tax_ID
rownames(sym_relab_df) <- sym_otu_bsc_df$tax_ID
# transpose the data frame, resulting in a matrix compatible with the vegan package functions
comm_df <- t(sym_abund_df)
relAb_df <- t(sym_relab_df)
The vegan
R package contains widely used functions for
community ecology.
Amplicon and shotgun sequencing data are compositional, and to
transform them to the Euclidean space, log-ratio transformations are
used. Abundance data is also frequently sparse, i.e. there are taxa and
sample combinations, where the observed abundance is zero. Naturally,
these values need to be handled in some manner, before their log-ratios
can be computed. They could be replaced with a small value, for example
by taking the half of the smallest value in the table, or with a value
sampled from an appropriate distribution. The latter is what
cmultRepl
function does, while also removing taxa that are
zero in more than 80% of the samples. This is going to leave us with
only approx. half of the taxa in our original dataset, highlighting how
sparse this data is. (Note: this function also modifies the non-zero
values to a small degree, but their ratios should be preserved.)
# fill zeros using Square root (SQ) Bayesian Replacement method
# samples - attribute
z_comm_df <- zCompositions::cmultRepl(relAb_df, output = "prop", method = "SQ", z.warning=0.8 )
## Warning in zCompositions::cmultRepl(relAb_df, output = "prop", method = "SQ", : Column no. 14 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 22 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 25 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 29 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 36 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 38 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 39 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 42 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 45 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 49 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 55 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 56 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 57 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 58 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 59 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 62 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 63 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 64 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 65 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 66 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 67 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 68 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 69 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 70 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 71 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 72 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 73 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 74 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 75 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 76 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 77 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 78 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 79 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 80 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 81 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 82 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 83 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 84 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 85 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 86 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 87 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 88 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 89 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 90 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 91 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 92 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 93 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## Column no. 94 containing >80% zeros/unobserved values deleted (see arguments z.warning and z.delete).
## No. adjusted imputations: 486
z_comm_matrix <- as.matrix(z_comm_df)
The centered log-ratio is calculated for each element of a (sample) vector by taking the log of the value divided by the geometric mean of the elements in the vector. The resulting data is centered and scaled, but the variables are not independent, which limits the type of analyses one can perform on CLR values.
# centered log-ratio transformation
?clr
## No documentation for 'clr' in specified packages and libraries:
## you could try '??clr'
zclr_comm_obj <- compositions::clr(z_comm_matrix)
Heatmaps are useful to visualise numerical data on a grid, typically
containing observations (samples) on one axis and attributes on the
other. The stats
package has a heatmap function. It also
draws a dendogram based on complete hierarchical clustering on the
Euclidean distance of the sample vectors.
heatmap(zclr_comm_obj, Colv = NA)
Let’s turn to ggplot2
to plot a more customised heatmap.
First, we need to convert the rmult
object into a data
frame, convert row names into a column, and join the metadata table via
the shared sample
column!
zclr_comm_df <- as.data.frame(zclr_comm_obj)
zclr_comm_metadata_df <- zclr_comm_df %>%
rownames_to_column("sample") %>%
left_join(sym_metadata_bsc_df, by = "sample")
Unfortunately, the heatmap function in ggplot2
does not
have an option for clustering, thus if we would like to have the labels
on the axes in some order, we need to set it manually by using factors.
Here, we first order the samples by SHIME compartment and experiment
day, then use that in the factor definition.
xaxis_order <- {zclr_comm_metadata_df %>% arrange(Compartment_unit, Day)}$Sample_ID
?factor
## starting httpd help server ... done
zclr_comm_metadata_df$sampleid_f <- factor(zclr_comm_metadata_df$Sample_ID, levels = xaxis_order)
Thereafter, the data frame needs to be converted into a long format
with pivot_longer
: the columns containing the CLR values
(selected with starts_with("Seq")
) are melted into two
columns: one containing the taxon identifier (from the column name), and
the other the CLR value. All other columns remain.
zclr_comm_metadata_long <- pivot_longer(zclr_comm_metadata_df, cols = starts_with("Seq"), names_to = "tax_ID", values_to = "clr_v")
zclr_comm_metadata_long
Next, we are going to construct labels for the y-axis. In order to have unique names, we are going to concatenate the genus to the taxon identifier.
# replace NA values with empty strings
taxonomy_labels_df <- sym_tax_df %>% mutate(Genus=ifelse(is.na(Genus), "", Genus))
# concatenate genus and taxon identifier
taxonomy_labels_df %<>% mutate(taxa_name=paste(Genus, tax_ID, sep=" "))
# join to the long data frame via "tax_ID"
zclr_comm_metadata_long <- left_join(zclr_comm_metadata_long, taxonomy_labels_df)
## Joining with `by = join_by(tax_ID)`
zclr_comm_metadata_long
After these preparations, we can finally call ggplot2
on
the long data frame. The function for adding the heatmap geometry is
geom_tile
, and (surprisingly) we can pass numerical values
to the fill option. With this palette, the brighter the colors, the
higher the abundance.
ggplot(zclr_comm_metadata_long) +
geom_tile(aes(sampleid_f, taxa_name, fill = clr_v)) +
scale_fill_viridis_c(name = "CLR abundance", option = "A") + # use the magma palette from viridis
scale_y_discrete(limits = rev) + # reverse the y-axis
theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) + # rotate the x-axis labels
xlab("Sample") +
ylab("ASV")
As noted before, there is no native option in ggplot2
to
add dendograms to the heatmap. However, another package, called
ggheatmapper
, offers this, and many other features. The
warmer tones are above average in abundance, and the cooler ones are
below average.
# generate a vector of the column names we are going to use on the x-axis
otus <- colnames(zclr_comm_df)
# group the data frame rows by SHIME compartment
zclr_comm_metadata_compartmentGrouped <- zclr_comm_metadata_df %>% group_by(Compartment)
p <- ggheatmap(zclr_comm_metadata_compartmentGrouped,
colv = "Sample_ID", # column to use for y-axis
rowv = otus, # column list to use in x-axis
hm_colors = 'RdYlBu', # heatmap palette, from ColorBrewer
show_dend_row = FALSE, # don't show ASV clustering
show_colnames = TRUE,
show_rownames = FALSE,
group_colors = c(`PC` = "#c51b8a", `DC` = "#7d5ba6"),
group_prop = 0.02,
colors_title = "CLR abundance")
## Adding missing grouping variables: `Compartment`
show(p)
We can build onto the previous plot object (p
), for
example adding tracks showing categorical data with
add_tracks
.
p2 <- add_tracks(p,
track_columns = c("Day"),
track_colors = list(Day = 'Greys'),
track_prop = 0.05)
## Adding missing grouping variables: `Compartment`
p2 +
plot_layout(guides = 'collect') # collar the guides
Task: Add another track, showing the
Community_Type
, to the existing plot. Call the new plot
p2
! Choose an appropriate palette from ColorBrewer.
p2 <- add_tracks(p2,
track_columns = c("Community_Type"),
track_colors = list(Community_Type = 'RdPu'),
track_prop = 0.05)
## Adding missing grouping variables: `Compartment`
p2 +
plot_layout(guides = 'collect')
The ggheatmapper
package also makes it possible to
extract the data from a clustered heatmap, and attach a diagram
displaying numerical data for the same x-axis via the
observations
column. Here, we just select three random ASVs
and visualise their CLR values over the x-axis.
table_p2 <- get_data(p2) # extract data
table_p2_long <- table_p2 %>%
ungroup() %>% #ungroup data frame
select(observations, Compartment, Seq2, Seq4, Seq11) %>% #select relevant columns
pivot_longer(cols = c(-observations, -Compartment), names_to = "taxa_ID", values_to = "clr_v") # melt abundance columns into long format, keeping observations and Compartment columns
plt_abund <- ggplot(table_p2_long, aes(observations, clr_v, color = taxa_ID, group = taxa_ID)) + # use ggplot to plot diagram
geom_line() +
facet_wrap(vars(Compartment), scales = "free_x") +
scale_y_continuous(position = "right") +
guides(color = FALSE) +
ylab("Abundance") +
theme_sparse()
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plt_abund
The additional plots sharing an axis with the heatmap can be aligned
to the central heatmap with align_to_hm
:
gghm_complete <- p2 %>%
align_to_hm(plt_abund, newplt_size_prop = 0.3)
gghm_complete +
plot_layout(guides = 'collect')
Omics data is multi-dimensional and multivariate analysis methods are required to interrogate it. In order to visualise or cluster the data points, we need to reduce the number of variables, while preserving the structure and characteristics of the data. Various methods exist for reducing dimensionality, and they can be divided into two categories, based on whether they faithfully preserve the distances between the input vectors or not.
Note: These methods are just visualization techniques. They do not assess statistically separation or correlation. For that, statistical test need to be performed.
Typically, these methods either require a distance or dissimilarity
matrix as input, or they calculate it themselves. For count or abundance
data, the vegdist
function from the vegan
package can calculate distances using several methods used in ecology.
The function takes a data frame with samples in rows and taxa in
columns.
# calculate dissimilarity of the samples using Bray-Curtis distance
sym_otu_bray_dist <- vegdist(comm_df,method="bray", binary=FALSE)
# also calculate for a subset of OTUs
sym_otu_bray_s20_dist <- vegdist(comm_df[,1:20],method="bray", binary=FALSE)
This method projects the dissimilarity matrix into a
lower-dimensional space, while preserving the relative distances between
the samples. cmdscale
is part of the stats
module in R, and the resulting matrix converted into a data frame can be
visualised on a scatterplot with ggplot2
.
Change the coloring variable to Compartment
. Now, do
the groups separate?
# perform the MDS on the sym_otu_bray_dist matrix
?cmdscale
sym_otu_bray_MDS <- data.frame(cmdscale(sym_otu_bray_dist), k = 2)
# join the metadata to the resulting data frame
sym_otu_bray_MDS$sample <- rownames(sym_otu_bray_MDS)
sym_otu_bray_MDS <- left_join(sym_otu_bray_MDS, sym_metadata_bsc_df, by = "sample")
ggplot(sym_otu_bray_MDS, aes(X1, X2)) +
geom_point(aes(color=Community_Type)) +
ggtitle("Synthetic community MDS", subtitle = "Bray-Curtis") +
xlab("Dim 1") +
ylab("Dim 2") +
geom_text_repel(aes(label = Day), size = 3.0) # place non-overlapping text labels
## Warning: ggrepel: 24 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Task: Use the jaccard
method with
binary=TRUE
in vegdist()
to compute the
Jaccard-distance between the samples, and plot the PCoA of the resulting
distance matrix. Color the points based on their community type and
shape them for the SHIME compartment. Add labels from the
Day
column to the plot with
geom_text_repel()
!
sym_otu_manhattan_dist <- vegdist(comm_df,method="jaccard", binary=T)
sym_otu_manhattan_MDS <- data.frame(cmdscale(sym_otu_manhattan_dist))
sym_otu_manhattan_MDS$sample <- rownames(sym_otu_manhattan_MDS)
sym_otu_manhattan_MDS <- left_join(sym_otu_manhattan_MDS, sym_metadata_bsc_df, by = "sample")
ggplot(sym_otu_manhattan_MDS, aes(X1, X2)) +
geom_point(aes(shape=Compartment, color = Community_Type)) +
ggtitle("Synthetic community MDS", subtitle = "Jaccard") +
xlab("Dim 1") +
ylab("Dim 2") +
geom_text_repel(aes(label = Day), size = 3.0)
NMDS generally takes any data-suitable distance matrix, and rank
orders the distances. This makes it a non-parametric method, suitable
for non-normal distributions. Here, the metaMDS
function
from the vegan
package accepts community matrix as input,
and calculates the Bray-Curtis distance from it. Thereafter,
envfit
is used to fit loading vectors onto the
sym_otu_NMDS
ordination object. Loading vectors show how
samples along that vector have increased values of that particular
attribute. In this case, the longer the vector, the stronger the
association.
?metaMDS
sym_otu_NMDS <- metaMDS(comm_df[,1:20], distance="bray", engine = "monoMDS", trace=FALSE, trymax=100, tidy = T)
sym_otu_NMDS_fit <- envfit(sym_otu_NMDS, comm_df[, 1:20])
In order to visualise the ordination with ggplot2
,
coordinate vectors for the points (called sites) and loadings (vectors)
need to be extracted from the objects.
sym_otu_NMDS_scores <- data.frame(scores(sym_otu_NMDS, display = "sites"), Compartment = as.factor(sym_metadata_bsc_df$Compartment)) #also add the column with the compartment information
sym_otu_NMDS_vectors <- as.data.frame(scores(sym_otu_NMDS_fit, display = "vectors"))
# add tax_ID column to the loadings data frame
sym_otu_NMDS_vectors <- cbind(sym_otu_NMDS_vectors, tax_ID = rownames(sym_otu_NMDS_vectors))
# construc the base scatterplot
nmds_plot <- ggplot(sym_otu_NMDS_scores) + # points data frame is the base
geom_point(aes(x = NMDS1, y = NMDS2, colour = Compartment)) +
coord_fixed()
# add the loadings
?geom_segment
nmds_plot +
geom_segment(data = sym_otu_NMDS_vectors, # loadings are layered on top
aes(x = 0, xend = NMDS1, y = 0, yend = NMDS2),
arrow = arrow(length = unit(0.25, "cm")), colour = "grey") +
geom_text_repel(data = sym_otu_NMDS_vectors, # loadings are labeled
aes(x = NMDS1, y = NMDS2, label = tax_ID),
size = 3)
Task: Perform a left join on the
sym_otu_NMDS_vectors
data frame, to add the taxonomical
information from taxonomy_labels_df
to it. Re-plot the
biplot above, label with the unique labels of the OTUs.
sym_otu_NMDS_vectors <- left_join(sym_otu_NMDS_vectors, taxonomy_labels_df, by = c("tax_ID"))
nmds_plot + ## need aspect ratio of 1!
geom_segment(data = sym_otu_NMDS_vectors,
aes(x = 0, xend = NMDS1, y = 0, yend = NMDS2),
arrow = arrow(length = unit(0.25, "cm")), colour = "grey") +
geom_text_repel(data = sym_otu_NMDS_vectors, aes(x = NMDS1, y = NMDS2, label = taxa_name),
size = 3)
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
If we are only interested in the top loadings, the data frame containing them could be filtered to only include the longest ones.
# calculate the length of the loading vectors and rank them
sym_otu_NMDS_vectors_sub <- sym_otu_NMDS_vectors %>%
mutate(eucdist=(NMDS1**2+NMDS2**2)**0.5) %>%
mutate(assoc_rank = row_number(desc(eucdist))) %>%
filter(assoc_rank < 11)
nmds_loadings_plot <- nmds_plot +
geom_segment(data = sym_otu_NMDS_vectors_sub,
aes(x = 0, xend = NMDS1, y = 0, yend = NMDS2),
arrow = arrow(length = unit(0.25, "cm")), colour = "grey") +
geom_text_repel(data = sym_otu_NMDS_vectors_sub, aes(x = NMDS1, y = NMDS2, label = taxa_name),
size = 3)
nmds_loadings_plot
If we would like to highlight how the samples in the different categories separate, ellipses could be added to show the distribution of the points for the different groups.
nmds_loadings_plot +
stat_ellipse(aes(x = NMDS1, y = NMDS2, colour = Compartment))
Another method to visualise the spread of the samples is the convex hull method, which applies a convex polygon encompassing all points within its border.
nmds_loadings_plot +
stat_chull(aes(x = NMDS1, y = NMDS2, fill = Compartment),
alpha = 0.3,
geom = "polygon") +
scale_fill_brewer(palette = "Set1",
name = '')
These methods aim to recapture data structure locally and/or globally.
The umap
function takes or calculates a distance matrix,
which then it uses to determine a graph of the local structure (based on
nearest neighbors) and projects this into two dimensions while trying to
preserve global structure. Note: the `n_neighbors` parameter needs to be
set according to the number of data points. Try to set it to different
values below 15!
?umap
sym_umap_est <- umap(comm_df, n_neighbors=15, input="data", preserve.seed = T)
# extract the points from the returned object
sym_umap_layout_df <- data.frame(sym_umap_est$layout)
sym_umap_layout_df$sample <- rownames(sym_umap_layout_df)
sym_umap_layout_df <- left_join(sym_umap_layout_df, sym_metadata_bsc_df)
## Joining with `by = join_by(sample)`
ggplot(sym_umap_layout_df, aes(x=X1, y=X2, color = Compartment, shape=Community_Type)) +
geom_point() +
xlab("Dim 1") +
ylab("Dim 2") +
geom_text_repel(aes(label = Day), size = 3) +
ggtitle("Synthetic community UMAP", subtitle = "Community matrix") +
labs(shape = "Community type")
Task: Use the previously computed Jaccard-distance
matrix as an input to the umap()
function. For this, set
input="dist"
. Do the samples show a different clustering,
than before?
sym_umap_est <- umap(as.matrix(sym_otu_manhattan_dist), input="dist", preserve.seed = T, n_neighbors=4)
sym_umap_layout_df <- data.frame(sym_umap_est$layout)
sym_umap_layout_df$sample <- rownames(sym_umap_layout_df)
sym_umap_layout_df <- left_join(sym_umap_layout_df, sym_metadata_bsc_df)
## Joining with `by = join_by(sample)`
ggplot(sym_umap_layout_df, aes(x=X1, y=X2, color = Compartment, shape=Community_Type)) +
geom_point() +
xlab("Dim 1") +
ylab("Dim 2") +
geom_text_repel(aes(label = Day), size = 3) +
ggtitle("Synthetic community UMAP", subtitle = "Jaccard") +
labs(shape = "Community type")
## Warning: ggrepel: 11 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
t-SNE is a variation of Stochastic Neighbor Embedding, which uses a
neighborhood graph to embed the data points onto low-dimensional planes
while ensuring that the distribution of the data is similar to the
original. It has a parameter, called perplexity
that
relates to the number of neighbors in the graph, and controls how the
local and global structure information is balanced in the optimization.
There is a web-page
highlighting the importance of choosing this value and the number of
optimization steps.
set.seed(1234)
sym_tsne_obj <- Rtsne::Rtsne(comm_df, perplexity = 2)
sym_tsne_df <- data.frame(DIM1 = sym_tsne_obj$Y[,1], DIM2 = sym_tsne_obj$Y[,2])
sym_tsne_df$sample <- rownames(comm_df)
sym_tsne_df %<>% left_join(sym_metadata_bsc_df)
## Joining with `by = join_by(sample)`
ggplot(sym_tsne_df, aes(DIM1, DIM2)) +
geom_point(aes(color = Compartment)) +
geom_text_repel(aes(label=Day), size = 2)
Task: Gradually increase the value of
perplexity
up to 15! Does the plot change?
PCA takes set of multivariate vectors and identifies the principal
components that capture the most variance. These principal components
are linearly independent, and arranged in the order of decreasing
variance. The variance captured by each component is usually displayed
on the axis label. The function prcomp
takes a “(centered
and possibly scaled) data matrix”, which in our case is the relative
abundance matrix. Note: CLR transformed data is mathematically not
suitable for PCA.
# compute the PCA object
?prcomp
sym_rda <- prcomp(relAb_df)
# the center list contains the loading vectors
# we can change their names in the rda object to the previously constructed unique labels
loadings_annot <- left_join(data.frame(tax_ID = names(sym_rda$center)), taxonomy_labels_df)
## Joining with `by = join_by(tax_ID)`
dimnames(sym_rda$rotation)[[1]] <- loadings_annot$taxa_name
names(sym_rda$center) <- loadings_annot$taxa_name
The factoextra
package has functions for visualising a
prcomp
returned object.
# plot the top contributing variables only
fviz_pca_var(sym_rda,
col.var = "black", # color of the arrows
select.var = list(name = NULL, cos2 = NULL, contrib = 5), # show only first 5 variables with highest contribution
label = "var", repel = T) # add labels for variables
# plot the points on the first two principal components
fviz_pca_ind(sym_rda,
label="none", # no labels
habillage=sym_metadata_bsc_df$Community_Type, # data groups to color
#pointsize = 1,
addEllipses=TRUE, ellipse.level=0.95, # add ellipses
ellipse.type="t", # "convex" or multivariate distribution
mean.point = FALSE, # do not put a slightly larger points as the group average
palette = "Dark2") # use ColorBrewer palette
We can also combine both the loadings and the points on a biplot.
# plot the whole biplot
fviz_pca_biplot(sym_rda,
repel = T,
select.var = list(name = NULL, cos2 = NULL, contrib = 3), # first 3 loading vectors
label = "var",
habillage=sym_metadata_bsc_df$Compartment, mean.point = FALSE)
Task: Isometric log-ratio transformation is
appropriate for use in PCA, however the transformed data vectors do not
correspond to the original ones. Thus, interpretation is challenging.
Compute the ILR of abundances in z_comm_matrix
, perform a
PCA on it, and plot the first two components on a biplot, samples
colored by the Compartment!
sym_ilr_rda <- prcomp(compositions::ilr(z_comm_matrix))
fviz_pca_biplot(sym_ilr_rda,
select.var = list(name = NULL, cos2 = NULL, contrib = 3), label = "var",
habillage=sym_metadata_bsc_df$Compartment, mean.point = FALSE)
Task: An expanded dataset from the SHIME experiment
is loaded below into RStudio. The column Setup
contains the
joint type for the fermentation medium and colon compartment. Create a
heatmap, and a figure with a dimensionality reduction method to
demonstrate that the samples do not cluster by
Community_Type
, but do by Medium
.
sym_otu_shime_df <- read.delim("otu_table_shime.tsv", sep = "\t")
sym_metadata_shime_df <- read.delim("metadata_shime.tsv", sep = "\t")