2 Packages

2.1 Introduction

Recent advances in ‘omics’ technologies have revolutionized biological research providing vast amounts of heterogeneous data and have facilitated the study of gene expression, proteins, metagenomics and metabolites, and the evaluation of their relationship with health and diseases status or any target trait. However, the integration of this heterogeneous information is not trivial and several statistical methods have been developed to deal with this problem (Mariette and Villa-Vialaneix 2017; Rohart et al. 2017; Argelaguet et al. 2018).

STATIS-ACT (Plantes 1976 ; Escoufier, Cazes, and others 1976) suitable for analyzing multiple data and comparing multiple subspaces. It is nothing more than a PCA extension to analyze multiple datasets of variables collected from the same individual or the same variables collected from multiple individuals (Lavit et al. 1994). The original method has some constraints, since can only be used with continuous data and the subspace generated (i.e. the compromise) can only set relationships between common elements in all datasets, i. e. observations (samples) or variables (genes/transcripts/proteins/metabolites/microbial communities). With the original method, it is not possible to establish relationships between observations and variables, nor to select variables.

Here, we present Link-HD, an R package to integrate multiple heterogeneous datasets based on STATIS. Our approach is a generalization of multidimensional scaling, which uses distance matrices for numerical, categorical or compositional variables. It was enhanced with Regression Biplot (Gabriel 1971; Gower and Hand 1995) and differential abundance testing to perform variable selection and to identify OTUs that differ between two or more categories. We also provide a “taxa set enrichment analysis” to identify enriched genera/families based on selected variables.

2.2 Methods

Our package was designed to integrate multiple source of “omics” data from a holistic view. However, it was specially tuned for integrate multiple microbial communities and can deal with compositional data: including centered log ratio and cumulative-sum scaling (CSS) (Paulson et al. 2013) transformations. Link-HD also incorporates variable selection and visualization tools that provide a better insight into the structure of the communities and their relationship with the target traits.

Link-HD pipeline overview. First, raw data are transformed using cumulative sum scaling (CSS) (Paulson et al. 2013) or centered log ratio transformation (clr) (Aitchison 1982). Second, distance and cross -product matrices are obtained to assess the similarity between datasets. Third, the compromise uses the first eigenvectors from RV to build the common configuration W matrix. The intra-structure step is the eigen-decomposition of W, where each observation is projected onto the common configuration. In this step, clusters of observations can be obtained. Finally, relationships between clusters and targets can identify selected variables responsible for the attained structure. (B) Link-HD includes visualization tools to explore relationships between data. In this example from the rumen microbiota it is shown: cross-product matrices relationships (upper left panel), ‘ruminotypes’ from the consensus structure (upper right panel) and the abundance between variables responsible of the obtained structure (lower panel).

2.2.1 STATIS: a general overview

STATIS-ACT (Plantes 1976; Escoufier, Cazes, and others 1976) aims is to compare and analyze the relationships between data that share a common collection of observations, but each may have different variables. The data consists of \(k\) matrices \(X_{[k]}\) of \(n \times j_{[k]}\) , where \(n\) is the number of observations and \(j_{[k]}\) the number of variables measured in \(k-th\) matrix. The idea behind STATIS is to obtain a compromise matrix, which is a linear combination of the matrices from the cross-prod tables (i.e.\(W_k=X_{[k]} \times X_{[k]}^T\)) . The analysis of this matrix gives a two/three -dimensional representation (principal axes maps) that can be used to interpret its structure. The methodology is implemented in three main phases [2]:

Inter-structure: It measures the similarity between the \(W_k\) configurations of the matrices. This phase is equivalent to defining a distance among the corresponding scalar product matrices.
Compromise: It generates an optimal weight coefficient for each \(k\) cross-prod matrix (\(W_k\)), in order to construct a common structure for all configurations by solving an optimization problem.
Intra-structure: It is a generalized PCA analysis of the structure obtained in the previous stage.

STATIS overview. This figure shows the three steps of the STATIS: Inter-structure, Compromise and Intra-structure. In the inter-structure, the similarity dataset structure is analyzed and each \(W_{[k]}\) matrix is then projected onto the first and second components of the \(Q\) matrix. The compromise uses the first eigenvectors from \(Q\) to build the common configuration \(W\) matrix, which is nothing more than a weighted sum of all \(W_{[k]}\). Finally, the intra-structure step is the eigen-decomposition of \(W\) and each observation is projected onto the first and second components of the \(A\) matrix.

The mathematics behind the methodology can be described as follows. The first task requires computations of \(k\) configurations of symmetrical matrices \(W_{[k]}\) for each \(X_{[k]}\) matrix of dimension \(n \times j_{[k]}\) according to equation ((1)):

\[\begin{equation} \tag{1} W_{[k]}=X_{[k]}M_{[k]}X^\top_{[k]} \end{equation}\]

where \(W_{[k]}\) is a square matrix of size \(n\); \(M_{[k]}\) is a diagonal matrix including the variable weights, which is generally the same matrix for all the \(k\) matrices \(M=M_{[k]}=\frac{1}{j_{[k]}}I_{j_{[k]}}\) where \(I_{j_{[k]}}\) is the identity matrix of size \(j_{[k]}\); and \(X^\top_{[k]}\) denotes the transpose of \(X_{[k]}\) matrix. Then, the similarity between the two configurations \(k\) and \(k^\prime\) can be established computing the Hilbert-Smith product of equation (\(<.>_{HS}\), (Robert and Escoufier 1976)) presented in equation (2) or in its normalized version \(\rho_{k,k^\prime}\).

\[\begin{eqnarray} \tag{2} \langle W_{[k]} \arrowvert W_{[k^\prime]}\rangle_{HS} = \left\{ \begin{array}{cl} tr(DW_{[k]}DW_{[k^\prime]}) & if~k \neq k^\prime\\ ||W_{[k]}||^2 & if~k = k^\prime \end{array} \right.\\ \tag{3} \rho_{k,k^\prime}=\frac{\langle W_{[k]} \arrowvert W_{[k^\prime]} \rangle_{HS}}{\sqrt{\langle W_{[k]} \arrowvert W_{[k]} \rangle_{HS}\langle W_{[k^\prime]} \arrowvert W_{[k^\prime]} \rangle_{HS}}} \end{eqnarray}\]

\(tr(.)\) is the trace of the argument matrix; \(D\) is the square weighted observation matrix of size \(n\), which in general is \(D=\frac{1}{n}I_n\); and \(||.||^2\) is the classic Euclidean norm-2 of the argument matrix. In this context, equation (2) can be geometrically interpreted as the scalar product between two positive semi-definite matrices, which is proportional to the cosine of the angle between them.

The inter-structure’s objective is to decide whether or not there is a common structure to all representative \(W_{[k]}\)’s objects. The scalar products \(\rho_{k,k^\prime}\) can be organized in a \(k \times k\) positive semi-definite matrix \(R\) and their spectral decomposition obtains the similarity analysis among the configurations as a generalized correlation coefficient.

However, the formulation from the (1) has some restrictions, i.e., it only can be computed when all observations are continuous. This constraint can be averted by replacing the scalar product (1) between observations by a generalized distance cross-product matrix. In Link-HD, we implemented euclidean, canberra, pearson, pearsonabs, spearman, spearmanabs, mahalanobis, jaccard, simple matching and Aitchison distances. Then, each \(W_k\) is obtained as follows:

\[\begin{eqnarray} \tag{4} W_{k}=\frac{1}{2} H_k D^2 H'_k \\ \tag{5} H=I-\frac{1}{n} 11' \end{eqnarray}\]

whre \(D^2\) in (4) is a distance matrix computed from the raw data (\(X\)) and \(W_k\) is the cross-product transformation of \(D\).

The compromise, \(W\), is the matrix that summarize all of \(W_{k}\) can be obtained by a weighted mean of the \(W_{[k]}\) configurations (Lavit 1988):

\[\begin{eqnarray} \tag{6} W=\sum_{k=1}^{K}\alpha_{[k]}W_{[k]}\\ \tag{7} \alpha=(1^{T}q_{1})^{-1} \times q_{1} \end{eqnarray}\]

where, \(\alpha\), is the vector with optimal weights, which is obtained by solving an optimization problem consisting in find \(\alpha\)’s that minimize the distance between each \(W_k\) and the common configuration \(W\) (\(D=\sum_{i=1}^k \lVert W_i-\alpha_iW\rVert\))restricted to \(\alpha'\alpha=1\). The solution is quite easy and is given by the first eigenvector of the similarity matrix between matrices (\(R\)), being \(\alpha_{[k]}\), the weight coefficient for the \(k\) configuration.

As the compromise is a cross-product matrix, it can be eigen-decomposed. Thus, it is possible to obtain a graphical representation from their Principal Components Analysis (PCA) where the scores are the best common representation for the set of \(K\) matrices. This step is called the intra-structure. Moreover, the scores can be used to carry out a cluster analysis.

2.3 Selecting variables

2.3.1 Regression Biplot approach

Classical STATIS is neither designed to establish relationships between observations and variables, nor to perform variable selection. However, the general regression biplot formulation from Gower and Hand (1995) averts these limitations.

\[\begin{equation} X\approx AB^\top+\varepsilon \tag{8} \end{equation}\]

The equation (8) can be understood as a multivariate regression of \(X\) on the coordinates of rows \(A\), when they are fixed, or a multivariate regression of \(X^T\) on the coordinates of the variables \(B\), if they are fixed. Then, we could easily define a regression biplot through a classical linear model, i. e., \(E(X)=AB^\top\) if the response has a normal distribution, or using a generalized linear model \(E(X)=g(\mu)=AB^\top\) when the response variable has an exponential family distribution and the link function used is \(g(.)\) (Greenacre 2010). Here, we used the ideas from (8) to project all variables into the common subspace obtained by the interstructure step. Variables are selected using the accuracy of the regression biplot, which is measured as the proportion of explained variance by each regression (adjusted r squared from the model).

2.3.2 Differential abundance approach

Link-HD also implements a differential abundance analysis which is computed using a rank-based nonparametric Kruskal-Wallis test (Kruskal and Wallis 1952). Variables are selected after applying a procedure that controls the false discovery rate (FDR) (Benjamini and Hochberg 1995).

2.4 Usage

2.5 Software Installation

You can install LinkHD from Bioconductor

Congrats! The software is ready to use! To find the help for each function, just execute:

 ?? LinKData()

2.6 Software generalities

The software requires a list of data frames as input that include the common elements in rows. The Read_Data function facilitates the proccess of reading data. This function has only one argument,i.e. the path to the parent folder which contains all the data to be used in the analysis.

Otherwise, you can read all your dataset as usual (using read.table, read.csv or any read function) and store them into a list.

Warning! Please remember to name the list with significant names for the experiment. Please make sure all data has the same row_names, since the main LinkHD function checks row names before starting the analysis and it stops if they don’t match.

2.7 Examples

We illustrate Link-HD using a public data from the TARA Ocean expedition (https://oceans.taraexpeditions.org/en/m/about-tara/les-expeditions/tara-oceans/) and the ruminal metataxonomic communities (including bacteria, archaea and protozoa data).

2.8 Results

We have integrated environmental, structural and funcional information from prokaryotic communities for the TARA Ocean data. In concordance with the original publication and results obtained with mixKernel (Mariette and Villa-Vialaneix 2017), Link-HD reproduces the relevant role of temperature of the Proteobacteria phyla on the structure of this ecosystem. The next lines show how to perform Tara Ocean analysis:

library(LinkHD)

data("Taraoceans")
data(Taraoceans)
pro.phylo <- Taraoceans$taxonomy[ ,"Phylum"]
TaraOc<-list(Taraoceans$phychem,as.data.frame(Taraoceans$pro.phylo),as.data.frame(Taraoceans$pro.NOGs))
TaraOc_1<-scale(TaraOc[[1]])
Normalization<-lapply(list(TaraOc[[2]],TaraOc[[3]]),function(x){DataProcessing(x,Method="Compositional")})
colnames(Normalization[[1]])=pro.phylo
colnames(Normalization[[2]])=Taraoceans$GO
TaraOc<-list(TaraOc_1,Normalization[[1]],Normalization[[2]])
names(TaraOc)<-c("phychem","pro_phylo","pro_NOGs")
TaraOc<-lapply(TaraOc,as.data.frame)
TaraOc<-lapply(TaraOc,as.data.frame)
Tara_Out<-LinkData(TaraOc,Scale =FALSE,Center=FALSE,Distance = c("ScalarProduct","Euclidean","Euclidean"))

To obtain the compromise plot, execute:

CompromisePlot(Tara_Out)+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
                                                               panel.background = element_blank(), axis.line = element_line(colour = "black"))

Correlation plot between the three different source of information:

CorrelationPlot(Tara_Out) +
 theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"))

The correlation between pro_NOGs and pro_phylo is higher than with phychemestry variables.

Selecting the most important variables:

Selection<-VarSelection(Tara_Out,TaraOc,Crit="Rsquare",perc=0.95)

Frecuency of selected variables by type:

library(reshape2)


Ds<-as.data.frame(table(Selection@Variables))
ggplot(Ds, aes(x=Var1, y = Freq)) +
  geom_bar(stat="identity", fill="steelblue") +
  ggpubr::rotate_x_text(45)

The manual of our package has further analysis for this dataset.

2.9 Ruminotypes analysis

The heart of our work is the integration of multiple microbial communities in livestock and to study their relation with methane emission level. The datasets are available in the Ruminotypes object.

library(LinkHD)
data("Ruminotypes")

Ruminotypes has bacteria, archaea and protozoa data from 65 Holstein cows. In addition, the object has several phenotypes which were measured on the same animals. We will focus here on the evaluation between the subspace generated by the 3 communities and their role in methane emmision yield (\(CH_{4}y\), g \(CH_{4}\)/kg feed intake).

library("LinkHD")
library("ggplot2")
options(width =50)

data("Ruminotypes")

2.10 Data Processing

The DataProcessing function of Link-HD, may be implemented in two differents ways. The ‘Standard’ option normalizes the data as usual (i.e., mean-scale). The ‘Compositional’ option is recommended for microbiome data, as it applies a centered log-ratios transformation on the dataset (Aitchison 1982). Note that you can use any custom normalization and skip this step.

Normalization<-lapply(list(Ruminotypes$`16_S`,Ruminotypes$Archaea,Ruminotypes$`18_S`),function(x){DataProcessing(x,Method="Compositional")})
Dataset<-Normalization
names(Dataset)<-c("16_S","Archaea","18_S")

Link-HD enables as input MultiAssayExperiment data, a Bioconductor constructor that combines multiple data elements from different experiment.

library(LinkHD)
library(MultiAssayExperiment)

## Warning: package 'MultiAssayExperiment' was built
## under R version 3.6.3

## Loading required package: SummarizedExperiment

## Loading required package: GenomicRanges

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB,
##     clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply,
##     parCapply, parLapply, parLapplyLB,
##     parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame,
##     basename, cbind, colnames, dirname,
##     do.call, duplicated, eval, evalq,
##     Filter, Find, get, grep, grepl,
##     intersect, is.unsorted, lapply, Map,
##     mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, Position,
##     rank, rbind, Reduce, rownames, sapply,
##     setdiff, sort, table, tapply, union,
##     unique, unsplit, which, which.max,
##     which.min

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## Loading required package: GenomeInfoDb

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material;
##     view with 'browseVignettes()'. To cite
##     Bioconductor, see 'citation("Biobase")',
##     and for packages 'citation("pkgname")'.

## Loading required package: DelayedArray

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## Loading required package: BiocParallel

## 
## Attaching package: 'DelayedArray'

## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs,
##     rowMins, rowRanges

## The following objects are masked from 'package:base':
## 
##     aperm, apply, rowsum

data("Ruminotypes")
Normalization<-lapply(list(Ruminotypes$`16_S`,Ruminotypes$Archaea,Ruminotypes$`18_S`),function(x){DataProcessing(x,Method="Compositional")})
Dataset<-Normalization
names(Dataset)<-c("16_S","Archaea","18_S")
Dataset<-MultiAssayExperiment(experiments = Dataset)

## ExperimentList contains data.frame or DataFrame,
##   potential for errors with mixed data types

2.11 Linking all source of variations: obtaining the compromise

LinkData is the main function of our package, which carries out the integration step. After applying the centered log-ratio transformation to the compositional data (i.e., microbiome) in the preprocessing step, using the Euclidean distance means that the retrieved distance is Aitchison, a suitable and interpretable metric for the compositing data (Aitchison 1982). Our method also enables other popular metric in microbial ecology studies, such as Bray-Curtis.

Upon obtaining the integrative sub-space, the package can be used to explore the relationship with external phenotypes, as well as to apply unsupervised analysis to explore the whole- data structure. In our example, the samples were stratified in the same way as the Ruminotype structure outlined in the literature (Kittelmann et al. 2013; Danielsson et al. 2017; Ramayo-Caldas et al. 2019). The algorithm used in this step was the partitioning around medioids (pam), which is a more robust version of Kmeans (Park and Jun 2009).

The Scale and Center arguments are applicable if the datasets have not been previously normalized. The nCluster argument is used to automatically cluster the data in \(W\).

Output<-LinkData(Dataset,Distance=rep("euclidean",3),Scale = FALSE,Center=FALSE,nCluster = 3)
#variance explained by compromise coordinates
CompromisePlot(Output)+theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour="black"))

The color indicates the quality of representation of each individual into the two dimensional space from the compromise PCA (more light blue means that the observation is better represented in this space).

The data could be optionally a list of S4 ExpressionSet objects. The advantage of ExpressionSet is their capability to describe samples and features in the experiment, as well as to provide more information related to the whole experiment and data-processing. To use ExpressionSet class, please run:

B<-ExpressionSet()
Dataset1<-Normalization
A<-lapply(Dataset1,as.matrix)
B<-lapply(A,ExpressionSet)
Output_<-LinkData(B,Distance=rep("euclidean",3),Scale = FALSE,Center=FALSE,nCluster = 3)

2.12 Obtaining the projection of each study into the common configuration

The distribution of the samples within each input dataset (equivalent to coordinates principal analysis) can be viewed trough the Globalplot option. This plot is the projection of each individual table into the compromise space.

GlobalPlot(Output)

2.13 Relationships between communities

The Multivariate Correlation Coefficient (RV) estimates the relationship between each microbial community. The results suggest a strong relationship between the prokaryotic communities of Archaea and Bacteria (16_S).

CorrelationPlot(Output)

2.14 Sample stratification in ruminotypes-like clusters

Link-HD recovers a Ruminotypes-like cluster structure (including archaea, bateria and protozoa) of the rumen ecosystem as previously reported (Ramayo-Caldas et al. 2019).

Note that in order to discover the underlying structure in any dataset, you should explore several number of clusters and use a method to obtain the optimum number of groups like e.g., the Silhouette coefficient (Rousseeuw 1987). It could be easily done using extant R functions.

mydata=Output@Compromise.Coords
#set colors
cols <- c("1" = "red", "2" = "blue", "3" = "orange")
mydata$fit.cluster<-as.factor(mydata$fit.cluster)
p<-ggplot(mydata, aes(Dim.1,Dim.2)) +
geom_point(aes(colour = fit.cluster))
p + scale_colour_manual(values = cols) + theme_classic()

It is possible to evaluate the relationship between the structure obtained and external variables (i.e., host phenotype information). Since previous studies (Kittelmann et al. 2013; Danielsson et al. 2017; Ramayo-Caldas et al. 2019) reported a link between the structure of the rumen microbiome and \(CH_4\) emission, here we explore the differences on \(CH_4y\) emission levels between the three clusters of samples. We find significant differences between C1 and C2 and C2 and C3, i.e., the cows clustered within C2 and C3 and C1 and C2 differ in their \(CH_4\) emission levels.

library(lsmeans)
Dat<-data.frame("y"=Ruminotypes$phenotype$ch4y,"E"=as.factor(mydata$fit.cluster))
s=lm(y~E,data=Dat)
#summary(s)
lsmeans(s, "E")

##  E lsmean    SE df lower.CL upper.CL
##  1   23.6 0.567 58     22.5     24.7
##  2   26.3 0.688 58     25.0     27.7
##  3   23.1 0.650 58     21.8     24.4
## 
## Confidence level used: 0.95

pairs(lsmeans(s, "E"))

##  contrast estimate    SE df t.ratio p.value
##  1 - 2      -2.734 0.891 58 -3.068  0.0091 
##  1 - 3       0.485 0.863 58  0.562  0.8405 
##  2 - 3       3.219 0.946 58  3.401  0.0034 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

2.15 Variable selection

We deliver two alternative methods for variable selection: a regression biplot and a differential abundance testing.

As we have explained in the regression biplot section, a linear regression approach can be used to obtain the projection coefficients of all the variables (i.e., OTUs) in the compromise structure. The proportion of explained variance by each regression (\(R^2\)) can be used as a metric to select the most important variables (i.e., those that explain most of the variance).

The second approach is based on differential abundance analysis, which is computed through a non-parametric Kruskall-Wallis test.

Note that the data projected into the common structure are not the raw, but the clr-transformed.

The biplot procedure is implemented through the Select_Var function. The Crit argument refers to the criteria used to select the variables and the perc is to establish the percentage of variables to be selected. Intercept is set to FALSE.

Select_Var<-VarSelection(Output,Data=Dataset,Crit = "Rsquare",perc=0.9)

The differential abundance procedure is implemented in the dAB function as ifollows.

M<-dAB(Output,Data=Dataset)

2.16 Taxonomic aggregation and Taxon Set Enrichment Analysis

The taxonomic classification of the OTUs can be used to aggregate the OTUs at the user defined taxonomic level (i.e., family, genus). We have developed a function to establish whether a group of selected OTUs is ‘enriched’ for a particular taxonomic level by using an hypergeometric test. This function is called OTU2Taxa and can be used in the list of OTUs selected by either approaches (biplot or differential abundance)

The following lines partition the selected variables from the Biplot methodology by each table, i.e., which selected variables come from Archaea, Bacteria and protozoa.

#Select only the selected variables form Select_Var object
SelectedB_16S<-Select_Var@Variables[Select_Var@VarTable=="16_S"]
SelectedB_Archaea<-Select_Var@Variables[Select_Var@VarTable=="Archaea"]
SelectedB_18S<-Select_Var@Variables[Select_Var@VarTable=="18_S"]

While this one does the same for the variables selected from dAB:

Selec_16S<-as.matrix(Ruminotypes[[1]][,colnames(Dataset[[1]])%in%names(M[[1]])])
Selec_Archaea<-as.matrix(Ruminotypes[[2]][,colnames(Dataset[[2]])%in%names(M[[2]])])
Selec_18S<-as.matrix(Ruminotypes[[3]][,colnames(Dataset[[3]])%in%names(M[[3]])])

The following figure represents the bacterial profile of the most important (enriched) variables according to the Biplot procedure aggregated at the family level.

library(reshape2)
SignTaxa<-OTU2Taxa(Selection=Select_Var@VarTable,TaxonInfo=Ruminotypes$Taxa_16S,tableName="16_S",AnalysisLev = "Family")
melted_Table <- melt(SignTaxa$TotalUp1)
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#63D2E4",na.value="steelblue") +
  geom_bar(stat="identity", fill="#63D2E4")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(1.5)))

The following figure shows the enriched bacterial from the bacteria selected by the dAb analysis

SignTaxa<-OTU2Taxa(Selection=M,TaxonInfo=Ruminotypes$Taxa_16S,tableName="16_S",AnalysisLev = "Family")
melted_Table <- melt(SignTaxa$TotalUp1)
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  theme_classic()+
  geom_bar(stat="identity", fill=  "#56B4E9" )+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90) +
  theme(axis.text.x=element_text(size=rel(0.9)))

After aggregating the selected variables we can also explore their relative abundance patterns (i.e., most relevant genera/family) and their distribution between clusters.

In our example, it is reasonable to expect the highest differences between samples classified as C1 and C2 or C2 and C3 because these samples have opposite patterns along the first dimension of the common structure. The next examples are focused on members of Ruminococcaceae, Christensenellaceae and Succinivibrionaceae family, but the user can extend it to other family or taxonomic levels:

Rumin<-which(Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,]$Family=="Ruminococcaceae")
OTUs<-Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,][Rumin,]$OTUID
Rumin_Tot<-Dataset[['16_S']][,colnames(Dataset[['16_S']])%in%OTUs]
RuminSum<-apply(Rumin_Tot,1,sum)

Total<-data.frame(RuminSum,"Group"=as.factor(Output@Compromise.Coords$fit.cluster))
p <- ggplot(Total, aes(x=Group, y=RuminSum, fill=Group)) +
  geom_boxplot()+
  scale_fill_manual(values=c("Tomato", "Orange", "DodgerBlue"))+
  theme_classic()+ ggtitle(paste("Differences at level",rownames(melted_Table)[16]))
p

Christ<-which(Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,]$Family==rownames(melted_Table)[14])
OTUs<-Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,][Christ,]$OTUID
Christ_Tot<-Dataset[['16_S']][,colnames(Dataset[['16_S']])%in%OTUs]
ChristSum<-apply(Christ_Tot,1,sum)
Total<-data.frame(ChristSum,"Group"=as.factor(Output@Compromise.Coords$fit.cluster))
p <- ggplot(Total, aes(x=Group, y=ChristSum, fill=Group)) +
  geom_boxplot()+
  scale_fill_manual(values=c("Tomato", "Orange", "DodgerBlue"))+
  theme_classic()+ ggtitle(paste("Differences at level",rownames(melted_Table)[14]))
p

The following example represents an alternative view to the previous plot:

ggplot(Total, aes(x=Group, y=rownames(Total), fill=ChristSum)) +
  geom_tile() + scale_fill_gradient(low = "steelblue",
                                    high = "#FF0066",na.value="steelblue") +
  ggpubr::rotate_x_text(45)+
  theme_classic() +
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  theme(axis.text.y=element_text(size=rel(0.6)))

As illustrated in the previous figure, samples from the cluster 2 have a higher frequency of OTUs members of the family Christensenellaceae.

Succi<-which(Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,]$Family=="Succinivibrionaceae")
OTUs<-Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,][Succi,]$OTUID
Succi_Tot<-Dataset[['16_S']][,colnames(Dataset[['16_S']])%in%OTUs]
SucciSum<-apply(Succi_Tot,1,sum)

Total<-data.frame(SucciSum,"Group"=as.factor(Output@Compromise.Coords$fit.cluster))
Total<-Total[- which(Total[,1]>1),]
p <- ggplot(Total, aes(x=Group, y=SucciSum, fill=Group)) +
  geom_boxplot()+
  scale_fill_manual(values=c("#e46563", "#357fe3", "#75e463"))+
  theme_classic()+ ggtitle(paste("Differences at level","Succinivibrionaceae"))+
   theme(axis.text.x=element_text(size=rel(1.5)),legend.title=element_text(size=rel(1.5)), 
        legend.text=element_text(size=rel(1.5)))
p

The members of Succinivibrionaceae showed a higher abundance on the lower CH4 emitting cows as previously reported (Kittelmann et al. 2013; Danielsson et al. 2017; Ramayo-Caldas et al. 2019).

Note that both approaches show the same patterns of enriched families: the three most relevant families associated with Ch4y emission (Ruminococcaceae, Anaerolineaceae and Succinivibrionaceae) are the same.

The following plots show the results for protozoa. The first corresponds to biplot selection strategy and the second one to dAB.

SignTaxa<-OTU2Taxa(Selection=Select_Var@VarTable,TaxonInfo=Ruminotypes$Taxa_18S,tableName="18_S")
melted_Table <- melt(SignTaxa$TotalUp1)
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#B848DB",na.value="steelblue") +
  geom_bar(stat="identity", fill="#B848DB")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(1.5)))

SignTaxa<-OTU2Taxa(Selection=M,TaxonInfo=Ruminotypes$Taxa_18S,tableName="18_S")
melted_Table <- melt(SignTaxa$TotalUp1[SignTaxa$TotalUp1>1])
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  geom_bar(stat="identity", fill=  "#56B4E9" )+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(0.8)))

In the following boxplot we show the distribution of the Diplodinim genus relative abundance along the clusters.

Dip<-which(Ruminotypes$Taxa_18S[Ruminotypes$Taxa_18S$OTUID%in%SelectedB_18S,]$Genus=="Diplodinium")
OTUs<-Ruminotypes$Taxa_18S[Ruminotypes$Taxa_18S$OTUID%in%SelectedB_18S,][Dip,]$OTUID
Dip_Tot<-Dataset[['18_S']][,colnames(Dataset[['18_S']])%in%OTUs]
DipSum<-apply(Dip_Tot,1,sum)

Total<-data.frame(DipSum,"Group"=as.factor(Output@Compromise.Coords$fit.cluster))


p <- ggplot(Total, aes(x=Group, y=DipSum, fill=Group)) +
  geom_boxplot()+
  scale_fill_manual(values=c("Tomato", "Orange", "DodgerBlue"))+
  theme_classic()+ ggtitle(paste("Differences at level","Diplodinium"))
p

Alternativelly,

ggplot(Total, aes(x=Group, y=rownames(Total), fill=DipSum)) +
  geom_tile() + scale_fill_gradient(low = "steelblue",
                                    high = "#FF0066",na.value="steelblue") +
  ggpubr::rotate_x_text(45)+
  theme_classic() +
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  theme(axis.text.y=element_text(size=rel(0.6)))

We noted similar pattern to the one observed at 16S rRNA gene level.

The following plots show the results for archaeas:

SignTaxa<-OTU2Taxa(Selection=Select_Var@VarTable,TaxonInfo=Ruminotypes$Taxa_Archaea,tableName="Archaea")
melted_Table <- melt(SignTaxa$TotalUp1)
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#46D6AD",na.value="steelblue") +
  geom_bar(stat="identity", fill="#46D6AD")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(1.5)))

SignTaxa<-OTU2Taxa(Selection=M,TaxonInfo=Ruminotypes$Taxa_Archaea,tableName="Archaea")
melted_Table <- melt(SignTaxa$TotalUp1)
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  geom_bar(stat="identity", fill=  "#56B4E9" )+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(0.9)))

Differences in the relative abundance of Methanobrevibacter between the clusters of samples are represented in the next figure. As expected, lower relative abundance of Methanobrevibacter was observed in the clusters of the lower emitting samples.

Met<-which(Ruminotypes$Taxa_Archaea[Ruminotypes$Taxa_Archaea$OTU%in%SelectedB_Archaea,]$Genus==levels(Ruminotypes$Taxa_Archaea$Genus)[4])
OTUs<-Ruminotypes$Taxa_Archaea[Ruminotypes$Taxa_Archaea$OTU%in%SelectedB_Archaea,][Met,]$OTU
Met_Tot<-Dataset[['Archaea']][,colnames(Dataset[['Archaea']])%in%OTUs]
MetSum<-apply(Met_Tot,1,sum)
Total<-data.frame(MetSum,"Group"=as.factor(Output@Compromise.Coords$fit.cluster))
p <- ggplot(Total, aes(x=Group, y=MetSum, fill=Group)) +
  geom_boxplot()+
  scale_fill_manual(values=c("Tomato", "Orange", "DodgerBlue"))+
  theme_classic()+ ggtitle(paste("Differences at level","Methanobrevibacter"))
p

ggplot(Total, aes(x=Group, y=rownames(Total), fill=MetSum)) +
  geom_tile() + scale_fill_gradient(low = "steelblue",
                                    high = "#FF0066",na.value="steelblue") +
  ggpubr::rotate_x_text(45)+
  theme_classic() +
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  theme(axis.text.y=element_text(size=rel(0.6)))

2.17 Taxon set enrichment analysis from a user-defined OTUs’ list

OTU2Taxa function can be used to aggregate OTUs into their taxonomic characteristics and to compute the most significant (“enrichment”) genera/families to any OTUs’ list, i.e., a list obtained by another methodology. For instance, in Ramayo-Caldas et al. (2019) the authors selected two OTUs list by applying SPLS and Discriminant SPLS (SPLS DA). Here, we show how to use these external dataset with OTU2Taxa:

#library("readxl")
#library("XLConnect")
#library("openxlsx")

# Download the list from supplementary material: 
#temp = tempfile(fileext = ".xls")
#dataURL <- "https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2Fjbg.12427&file=jbg12427-sup-0004-TableS4.xls"
#download.file(dataURL, destfile=temp, mode='wb')
#SPLS_DA<-read_excel(temp,sheet=1)
#SPLS<-read_excel(temp,sheet=2)

#Create a list with selected OTU. Note that if you have only one table, the list should have lenght equal to 1. 

#Selected_OTU<-list("SPLS"=SPLS$OTU,"SPLS_DA"=SPLS_DA$OTU)
# OTUs have to be the lists names:
#names(Selected_OTU$SPLS)<-SPLS$OTU
#names(Selected_OTU$SPLS_DA)<-SPLS_DA$OTU


#TaxonInfo is the same than 16_S taxon info into Ruminotypes dataset: 
#Apply OTU2Taxa
#SignTaxa<-OTU2Taxa(Selection=Selected_OTU,TaxonInfo=Ruminotypes$Taxa_16S,tableName="SPLS",AnalysisLev = "Family")
#SignTaxa2<-OTU2Taxa(Selection=Selected_OTU,TaxonInfo=Ruminotypes$Taxa_16S,tableName="SPLS_DA",AnalysisLev = "Family")



#melted_Table <- melt(SignTaxa2$TotalUp1)
#ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
#scale_fill_gradient(low = "steelblue",high = "#FF0066",na.value="steelblue") +
#  theme_classic()+
#  geom_bar(stat="identity", fill=  "#ff661a" )+
#  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
#  ggpubr::rotate_x_text(90) +
#  theme(axis.text.x=element_text(size=rel(0.9)))

2.18 Alternative omics layers: NCI-60 transcriptome datasets

The following example consists in the illustrative analysis of a subset of NCI-60 cell line panel (Shankavaram et al. 2009; Reinhold et al. 2012), which is widely used in anti-cancer drug screening. The whole dataset contains 60 human tumour cell lines derived from patients with Leukemia, Melanoma, Lung, Colon, Central Nervous System, Ovarian, Renal, Breast and Prostate cancers. The gene expression of these cell lines was measured using four different microarrays platforms: Agilent mRNA (Whole Human Genome Microarray, \(4 \times 44K\)) and Affymetrix (HGU 95, HGU 133 and HGU 133plus 2.0). The toy dataset we use here is freely available in the omicade4 package (Meng et al. 2013).

library(omicade4)
data(NCI60_4arrays)
#check data dimensions
lapply(NCI60_4arrays, dim)

## $agilent
## [1] 300  60
## 
## $hgu133
## [1] 298  60
## 
## $hgu133p2
## [1] 268  60
## 
## $hgu95
## [1] 288  60

As Link-HD requires common elements in rows, matrices need to be transpose prior to apply LinkData function:

data<-lapply(NCI60_4arrays, function(x){t(x)})
lapply(data,dim)

## $agilent
## [1]  60 300
## 
## $hgu133
## [1]  60 298
## 
## $hgu133p2
## [1]  60 268
## 
## $hgu95
## [1]  60 288

Output<-LinkData(data,Distance=rep("euclidean",3),Scale = TRUE,Center=TRUE,nCluster = 9)
#variance explained by compromise coordinates
CompromisePlot(Output)+theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour="black"))

Note that Link-HD proportion of explained variance by the two first components (PC1=17.11% and PC=12.73%) is simmilar to the one reported by (Meng et al. 2013).

To better undestand the samples structure, it is possible to color them by tumoral tissues:

cancer_type <- rownames(data$agilent)
#To select strings before the "."
cancer_type <- sapply(strsplit(cancer_type, split="\\."),function(x){x[1]})

mydata=Output@Compromise.Coords
#set colors
cols <- c("#ffcc00","#ff0066","#00cc99","#0000ff","#ffff00","#ffb3b3","#1affa3","#660033","#26004d")
mydata$fit.cluster<-as.factor(mydata$fit.cluster)
p<-ggplot(mydata, aes(Dim.1,Dim.2)) +
geom_point(aes(colour = cancer_type))
p + scale_colour_manual(values = cols) + theme_classic()

Meanwhile, the following figure represents the color according to the Link-HD clusters assignation:

mydata=Output@Compromise.Coords
#set colors
cols <- c("#ffcc00","#ff0066","#00cc99","#0000ff","#ffff00","#ffb3b3","#1affa3","#660033","#26004d")
mydata=Output@Compromise.Coords
mydata$fit.cluster<-as.factor(mydata$fit.cluster)
p<-ggplot(mydata, aes(Dim.1,Dim.2)) +
geom_point(aes(colour = fit.cluster))
p + scale_colour_manual(values = cols) + theme_classic()

Link-HD obtains the underlying structure from the consensus subspace. We have implemented our method picking out nine (the number of tissues in NCI-60 dataset) and three clusters, respectively. In the first case, Link-HD recovers the structure of melanoma, CNS and Renal tissues. For the three clusters structure, the first axis divides the tissues into two groups with leukemia, colon and two breast cancer tissues (yellow in the plot below) and the rest of carcinoma lines (pink).

Finally, the second axis splits these carcinoma and leukemia cell-lines from melanoma. The structure of the first axis (which explains the greatest variation (17,11%),Dim.1) is associated with the separation of epithelial and mesenchymal mechanisms, which in turn, is related with the cancer prognosis (Meng et al. 2013).

Output<-LinkData(data,Distance=rep("euclidean",3),Scale = TRUE,Center=TRUE,nCluster = 3)
mydata=Output@Compromise.Coords
#set colors
cols <- c("#ffcc00","#ff0066","#00cc99")
mydata=Output@Compromise.Coords
mydata$fit.cluster<-as.factor(mydata$fit.cluster)
p<-ggplot(mydata, aes(Dim.1,Dim.2)) +
geom_point(aes(colour = fit.cluster))
p + scale_colour_manual(values = cols) + theme_classic()

The RV plot shows that agilent data has a different profile with respect with the other 3 microarrays platforms (all the RV correlations between agilent and the other 3 datasets are less than 0.8). I.e., it should be expected than the information from agilent will be different from that provided by the other three platforms.

CorrelationPlot(Output)

We also apply Select_Var function in order to select the most important variables, responsible of the obtained structure.

Select_Var<-VarSelection(Output,Data=data,Crit = "Rsquare",perc=0.9)

Select_Var@Variables

##   [1] "LHFP"       "WWC2"       "LEPREL1"   
##   [4] "PVR"        "VAMP3"      "LOC344978" 
##   [7] "BACE1"      "FERMT1"     "FKBP10"    
##  [10] "TBC1D2"     "SYNM"       "KRT8P23"   
##  [13] "HERC1"      "SC65"       "ANXA3"     
##  [16] "MLF1"       "PTGES"      "TOR2A"     
##  [19] "OCLN"       "PPIC"       "GPC6"      
##  [22] "GMFG"       "ETV3"       "C9orf30"   
##  [25] "C19orf39"   "S100B"      "PTPN21"    
##  [28] "SMAD3"      "BACE2"      "GNPTAB"    
##  [31] "CA12"       "HSPBAP1"    "ARAP3"     
##  [34] "ZFP36"      "PDLIM2"     "DUSP4"     
##  [37] "FAM57A"     "HELZ"       "ECHDC2"    
##  [40] "SPEG"       "C1orf131"   "ATP1A3"    
##  [43] "C10orf90"   "C3orf59"    "PRKD1"     
##  [46] "F3"         "CA14"       "CPNE7"     
##  [49] "FGD3"       "RAB32"      "KAT2B"     
##  [52] "COPS7A"     "S100A1"     "ERBB2"     
##  [55] "GP9"        "SERPINF1"   "DUSP3"     
##  [58] "THAP10"     "LOC729732"  "CTNS"      
##  [61] "STEAP2"     "EPHA4"      "HDGFRP3"   
##  [64] "FGF1"       "ETS1"       "MEF2C"     
##  [67] "VEPH1"      "TCEAL2"     "STX5"      
##  [70] "VGF"        "ST8SIA1"    "MORN2"     
##  [73] "LOC442157"  "FAM69B"     "GDPD5"     
##  [76] "ANKRD9"     "NAAA"       "EDC4"      
##  [79] "C4orf32"    "STX1A"      "LOC441862" 
##  [82] "LOC728855"  "ALDH3A2"    "2.HOXB2"   
##  [85] "VAV1"       "GPNMB"      "2.C6orf218"
##  [88] "S100B"      "S100A1"     "GPR137B"   
##  [91] "2.CSPG4"    "EGFR"       "2.DKK3"    
##  [94] "CD3D"       "2.LAMA4"    "GMFG"      
##  [97] "2.SPARC"    "2.KLHL6"    "2.RHOH"    
## [100] "CPN1"       "2.GAPDHS"   "EDNRB"     
## [103] "2.POU3F2"   "2.EGFR"     "LYST"      
## [106] "SERPINF1"   "2.C1orf228" "HOXB2"     
## [109] "C6orf218"   "CSPG4"      "DKK3"      
## [112] "LAMA4"      "SPARC"      "KLHL6"     
## [115] "RHOH"       "GAPDHS"     "POU3F2"    
## [118] "EGFR"       "C1orf228"   "PLP1"      
## [121] "SOX10"      "ACP5"       "S100A1"    
## [124] "TRPV2"      "CITED1"     "FABP7"     
## [127] "MAGEA3"

We have compared the list of selected genes against the cancer related markers reported by (Meng et al. 2013). 37.61% (41/109) of the Link-HD selected genes were also reported in (Meng et al. 2013). Between them, we remark 20 melanoma-related markers (MAGEA3, S100A1, S100B, GPNMB, CSPG4,POU3F2,C10orf90,VGF,SERPINF1,GAPDHS,ACP5,CA14,ST8SIA1,FABP7,C6orf218,GPR137B,EDNRB,LYST,SOX10,TRPV2) and five leukemia markers (GMFG, FGD3, VAV1, ATP1A3, RHOH).

Finally, to gain insight into the overrepresented biological processes, we carried out a classical set enrichment analyses, Gene Ontology (GO) approach by using the Functional Gene Networks (FGNet) Bioconductor package (Aibar et al. 2015).

library(FGNet)
library("org.Hs.eg.db")
List<-union(union(union(toupper(colnames(data[[1]])),toupper(colnames(data[[2]]))),toupper(colnames(data[[3]]))),toupper(colnames(data[[4]])))

## we have 22756 different genes

geneLabels<-List
feaResults_topGO <- fea_topGO(Select_Var@Variables,genesUniverse = List ,geneIdType="SYMBOL", organism="Hs") 
feaResults <- feaResults_topGO
incidMat <- fea2incidMat(feaResults)
functionalNetwork(incidMat, plotTitleSub="Selected Genes Relationships",legendText = FALSE)

The Gene Enrichment Analysis returns a list with the top Gene Ontology (GO) terms. The first enriched proccess is tube morphogenesis, which can lead to catastrophic dysfunction if it occurs at the wrong time or place. Cancer mortality is associated with metastasis, which can be considered as a case of morphogenesis occurring at the wrong time and in the wrong place (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2854166/).

Angiogenesis Regulation, which is essential for tumor development and metastasis, Epithelial cell proliferation and Regulation of epithelial cell proliferation, that are involve in colonorectal neoplasms is also one of the top-5 pathways enriched here.

External tools for Gene Enrichment Analysis are also available. For instance,the results from IPA (Ingenuity Pathway Analysis) reveals several cancer - related pathways, such as the Regulation of the Epithelial-Mesenchymal Transition (EMT), which is crucial in the metastasis development. Moreover, we also identified biological functions related with Carcinomas and digestive cancer (Organismal Injury and Abnormalities Carcinoma,Gastrointestinal Disease, Organismal Injury and Abnormalities Digestive system cancer).

2.19 Final Remarks

In summary, we have developed Link-HD, an R package that combines some well-known multivariate methods to integrate heterogeneous datasets. We show the utility of Link-HD in three different datasets. Applied to data from the TARA Ocean expedition, Link-HD reproduces the relevance of members of Proteobacteria phyla and temperature on the functionality and stratification of this ecosystem as previously reported by mixKernel analysis. As for the rumen microbiota, the structure of the rumen microbiome obtained by Link-HD subspace is associated with \(CH_4\) emission. The method also provides more biological knowledge because it allowed the selection of variables and enrichment analysis. Our approach can also be used to integrate and explore others kind of ‘omics’ data as shown in the last example. A more detailed user guide is available in: https://lauzingaretti.github.io/LinkHD/ and Bioconductor

2.20 Appendix

List of LinkHD dependencies:

methods,
scales,
ggplot2,
dplyr,
cluster,
graphics, 
ggpubr, 
reshape,
gridExtra,
vegan,
data.table,
rio, 
MultiAssayExperiment

References

Aibar, Sara, Celia Fontanillo, Conrad Droste, and Javier De Las Rivas. 2015. “Functional Gene Networks: R/Bioc Package to Generate and Analyze Gene Networks Derived from Functional Enrichment and Clustering.” Bioinformatics 31 (10): 1686–8. https://doi.org/10.1093/bioinformatics/btu864.

Aitchison, John. 1982. “The Statistical Analysis of Compositional Data.” Journal of the Royal Statistical Society: Series B (Methodological) 44 (2). Wiley Online Library: 139–60.

Argelaguet, Ricard, Britta Velten, Damien Arnol, Sascha Dietrich, Thorsten Zenz, John C Marioni, Florian Buettner, Wolfgang Huber, and Oliver Stegle. 2018. “Multi-Omics Factor Analysis—a Framework for Unsupervised Integration of Multi-Omics Data Sets.” Molecular Systems Biology 14 (6). EMBO Press: e8124.

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B (Methodological) 57 (1). Wiley Online Library: 289–300.

Danielsson, Rebecca, Johan Dicksved, Li Sun, Horacio Gonda, Bettina Müller, Anna Schnürer, and Jan Bertilsson. 2017. “Methane Production in Dairy Cows Correlates with Rumen Methanogenic and Bacterial Community Structure.” Frontiers in Microbiology 8. Frontiers: 226.

Escoufier, Y, P Cazes, and others. 1976. “Opérateurs et Analyse Des Tableaux à Plus de Deux Dimensions.” Cahiers Du Bureau Universitaire de Recherche Opérationnelle 25: 61–89.

Gabriel, K. R. 1971. “The Biplot Graphic Display of Matrices with Application to Principal Component Analysis.” Biometrika 58 (3): 453. https://doi.org/10.2307/2334381.

Gower, John C, and David J Hand. 1995. Biplots. Vol. 54. CRC Press.

Greenacre, Michael J. 2010. Biplots in Practice. Fundacion BBVA.

Kittelmann, Sandra, Henning Seedorf, William A Walters, Jose C Clemente, Rob Knight, Jeffrey I Gordon, and Peter H Janssen. 2013. “Simultaneous Amplicon Sequencing to Explore Co-Occurrence Patterns of Bacterial, Archaeal and Eukaryotic Microorganisms in Rumen Microbial Communities.” PloS One 8 (2). Public Library of Science: e47879.

Kruskal, William H, and W Allen Wallis. 1952. “Use of Ranks in One-Criterion Variance Analysis.” Journal of the American Statistical Association 47 (260). Taylor & Francis Group: 583–621.

Lavit, Christine. 1988. “Presentation de La Méthode Statis Permettant L’analyse Conjointe de Plusieurs Tableaux de Données Quantitatives.” Cachiers de La Recherche Dévelopment 18: 49–60.

Lavit, Christine, Yves Escoufier, Robert Sabatier, and Pierre Traissac. 1994. “The ACT (STATIS method).” Computational Statistics & Data Analysis 18 (1): 97–119. https://doi.org/10.1016/0167-9473(94)90134-1.

Mariette, Jérôme, and Nathalie Villa-Vialaneix. 2017. “Unsupervised Multiple Kernel Learning for Heterogeneous Data Integration.” Bioinformatics 34 (6). Oxford University Press: 1009–15.

Meng, Chen, Bernhard Kuster, Aedin Culhane, and Amin Moghaddas Gholami. 2013. “A Multivariate Approach to the Integration of Multi-Omics Datasets.” BMC Bioinformatics.

Park, Hae-Sang, and Chi-Hyuck Jun. 2009. “A Simple and Fast Algorithm for K-Medoids Clustering.” Expert Systems with Applications 36 (2). Elsevier: 3336–41.

Paulson, Joseph N, O Colin Stine, Héctor Corrada Bravo, and Mihai Pop. 2013. “Differential Abundance Analysis for Microbial Marker-Gene Surveys.” Nature Methods 10 (12). Nature Publishing Group: 1200.

Plantes, Henri L’Hermier des. 1976. “Structuration Des Tableaux à Trois Indices de La Statistique: Théorie et Application d’une Méthode d’analyse Conjointe.” PhD thesis, Université des sciences et techniques du Languedoc.

Ramayo-Caldas, Yuliaxis, Laura Zingaretti, Milka Popova, Jordi Estellé, Aurelien Bernard, Nicolas Pons, Pau Bellot, et al. 2019. “Identification of Rumen Microbial Biomarkers Linked to Methane Emission in Holstein Dairy Cows.” Journal of Animal Breeding and Genetics. Wiley Online Library.

Reinhold, William C, Margot Sunshine, Hongfang Liu, Sudhir Varma, Kurt W Kohn, Joel Morris, James Doroshow, and Yves Pommier. 2012. “CellMiner: a web-based suite of genomic and pharmacologic tools to explore transcript and drug patterns in the NCI-60 cell line set.” Cancer Research 72 (14): 3499–3511. https://doi.org/10.1158/0008-5472.CAN-12-1370.

Robert, Paul, and Yves Escoufier. 1976. “A unifying tool for linear multivariate statistical methods: the RV-coefficient.” Applied Statistics. JSTOR, 257–65.

Rohart, Florian, Benoit Gautier, Amrit Singh, and Kim-Anh Lê Cao. 2017. “MixOmics: An R Package for ‘Omics Feature Selection and Multiple Data Integration.” PLoS Computational Biology 13 (11). Public Library of Science: e1005752.

Rousseeuw, Peter J. 1987. “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathematics 20. Elsevier: 53–65.

Shankavaram, Uma T, Sudhir Varma, David Kane, Margot Sunshine, Krishna K Chary, William C Reinhold, Yves Pommier, and John N Weinstein. 2009. “CellMiner: a relational database and query tool for the NCI-60 cancer cell lines.” BMC Genomics 10 (January): 277. https://doi.org/10.1186/1471-2164-10-277.

Link-HD: a versatile framework to explore and integrate heterogeneous data

March 23th, 2021

Contents

1 Version Info