4 Integrated/Quasi-Mediation

4.1 Load data and packages

Before starting, load required packages. The code in this section relies heavily on the functions from the R package LUCIDus (Peng et al. 2020).

Code

# General Packages:
library(EnvirOmix)
library(tidyverse)
library(table1)
# Load plotting packages:
library(ggplot2)
library(cowplot)
library(ComplexHeatmap)
library(ggh4x)
# Packages for Quasi-mediation:
#LUCIDus Version 3.0.1 is required for this analysis:
library(LUCIDus)
library(networkD3)
library(plotly)
library(htmlwidgets)
library(jsonlite)

And load the data:

Code

# Load R package
library(EnvirOmix)

# Load simulated data
data(simulated_data)

# Define exposure and outcome name
exposure <- simulated_data[["phenotype"]]$hs_hg_m_scaled
outcome  <- simulated_data[["phenotype"]]$ck18_scaled

# Get numeric matrix of covariates 
covs <- simulated_data[["phenotype"]][c("e3_sex_None", "hs_child_age_yrs_None")] 
covs$e3_sex_None <- ifelse(covs$e3_sex_None == "male", 1, 0)

# create list of omics data 
omics_lst <- simulated_data[-which(names(simulated_data) == "phenotype")]

# Create data frame of omics data
omics_df <- omics_lst |> 
  purrr::map(~as_tibble(.x, rownames = "name")) |>
  purrr::reduce(left_join, by = "name") |>
  column_to_rownames("name")

For this analysis, we will subset to only three omics layers to reduce the computational burden.

Code

# Three omics layers, Methylation (CpG), Transcriptome and miRNA
omics_df_analysis <- omics_df %>% 
  dplyr::select(contains("cg"), contains("TC"), contains("miR"))

The figures for quasi-mediation rely on several custom functions. The code for functions are provided in Chapter 6.

4.2 Early Integration

In the Early Integration model, genomic/exposomic exposures \(G\), other omics data \(Z\) and phenotype trait \(Y\) are integrated through a latent categorical variable \(X\), as implemented in the R package LUCIDus (Peng et al. 2020). Because \(X\) is an unobserved categorical variable, each category of \(X\) is interpreted as a latent cluster in the data, jointly defined by \(G\), \(Z\) and \(Y\). Let \(G\) be a \(N \times P\) matrix with columns representing genetic/environmental exposures, and rows representing the observations; \(Z\) be a \(N \times M\) matrix of omics data (for example, gene expression data, DNA methylation profiles and metabolomic data etc.) and \(Y\) be a \(N\)-length vector of phenotype trait. We further assume \(G\), \(Z\) and \(Y\) are measured through a prospective sampling procedure so we do not model the distribution of \(G\). All three measured components (\(G\), \(Z\) and \(Y\)) are linked by a latent variable \(X\) consisting of \(K\) categories. The distributions of \(X\) given \(G\), \(Z\) given \(X\) and \(Y\) given \(X\) are conditionally independent with each other. Let \(f(\cdot)\) denote the probability mass functions (PMF) for categorical random variables or the probability density functions (PDF) for continuous random variables. The joint log-likelihood of the LUCID model is constructed as:

\[ \begin{aligned} \log L({\Theta}) & = \sum_{i = 1}^N \log f({Z}_i, Y_i|{G}_i;{\Theta}) \\ & = \sum_{i = 1}^N \log \sum_{j = 1}^K f(X_i = j|{G}_i;{\Theta}) f({Z}_i| X_i = j; {\Theta}) f(Y_i|X_i = j; {\Theta}) \end{aligned} \] where \(\Theta\) is a generic notation for all parameters in Early Integration LUCID model. EM algorithm is implemented to estimate all parameters \(\Theta\) iteratively until convergence, and \(\Theta\) represents the \(G\) to \(X\), \(X\) to \(Z\), and \(X\) to \(Y\) associations,

We a-priori assigned the number of clusters to two. Early Integration LUCID estimates two omics specific clusters which represent an differential risks for the outcome. These omic profiles confer different risks of the outcome and represent each omics feature’s contribution to the exposure-outcome association.

4.2.1 Analysis: Early Integration

When using estimate_lucid() to fit the Early Integration LUCID model, we specify lucid_model = "early", and K is an integer representing number of latent clusters. G, Z, Y are the inputs for exposure, omics data matrix, and the outcome, respectively.CoY are the covariates to be adjusted for the \(X\) to \(Y\) association and CoG are the covariates to be adjusted for the \(G\) to \(X\) association. We specify useY = TRUE to construct supervised LUCID model. Otherwise, useY = FALSE will construct unsupervised LUCID model. init_par = "random" means that we initiate the parameters with random guess. We specify family = "normal" since the outcome is continuous. If the outcome is binary, we would specify family = "binary".The full code for the sankey_early_integration function is provided in Section 6.1.1

Code

G = exposure |> as.matrix()
Z = omics_df_analysis |> as.matrix()

fit1 <- estimate_lucid(lucid_model = "early",
                       G = G,
                       Z = Z,
                       Y = outcome, 
                       K = 2,
                       CoY = covs,
                       CoG = covs,
                       useY = TRUE,
                       init_par = "random",
                       family = "normal")
## .....

4.2.2 Sankey Diagram

Code

p1 <- sankey_early_integration(fit1, text_size = 20)

Figure 4.1: The Sankey Diagram for LUCID (Early Integration).

4.2.3 Early Integration Results

Omics profiles for each cluster determined using Early Integration LUCID. The full code for the plot_omics_profiles function is provided in Section 6.1.4

Code

plot_omics_profiles(fit1, "Early", omics_lst)

4.3 Intermediate Integration

In LUCID in parallel conducting intermediate integration, latent clusters \(X_a\) are estimated in each omics layer separately while integrating information from the genetic/environmental exposure \(G\) and the outcome \(Y\) by assuming no correlations across different omics layers. Let \(G\) be a \(N \times P\) matrix with columns representing genetic/environmental exposures, and rows representing the observations; there is a collection of \(m\) omics data, denoted by \(Z_1\), . . . , \(Z_a\), . . . , \(Z_m\) with corresponding dimensions \(p_1\), . . . , \(p_a\), . . . , \(p_m\). Each omics data \(Z_a\) is summarized by a latent categorical variable \(X_a\), which contains \(K_a\) categories. Each category is interpreted as a latent cluster (or subgroup) for that particular omics layer and \(Y\) be a \(N\)-length vector of phenotype trait. Let \(D\) be the generic notation for all observed data. The log likelihood of LUCID with multiple latent variables is constructed below,

\[ \begin{aligned} l(\boldsymbol{\Theta} \mid \boldsymbol{D}) & =\sum_{i=1}^{n} \log f\left(\boldsymbol{Z}_{1 i}, \ldots, \boldsymbol{Z}_{m i}, \boldsymbol{Y}_{i} \mid \boldsymbol{G}_{i} ; \boldsymbol{\Theta}\right) \\ & =\sum_{i=1}^{n} \log \left[\prod_{j_{1}=1}^{k_{1}} \cdots \prod_{j_{m}=1}^{k_{m}} f\left(\boldsymbol{Z}_{1 i}, \ldots, \boldsymbol{Z}_{m i}, X_{1 i}, \ldots, X_{m i}, Y_{i} \mid \boldsymbol{G}_{i} ; \boldsymbol{\Theta}\right)^{I\left(X_{1 i}=j_{1}, \ldots, X_{m i}=j_{m}\right)}\right] \\ & =\sum_{i=1}^{n} \sum_{j_{1}=1}^{k_{1}} \ldots \sum_{j_{m}=1}^{k_{m}} I\left(X_{1 i}=j_{1}, \ldots, X_{m i}=j_{m}\right) \log f\left(\boldsymbol{Z}_{1 i}, \ldots, \boldsymbol{Z}_{m i}, X_{1 i}, \ldots, X_{m i}, Y_{i} \mid \boldsymbol{G}_{i} ; \boldsymbol{\Theta}\right) \\ & =\sum_{i=1}^{n} \sum_{j_{1}=1}^{k_{1}} \ldots \sum_{j_{m}=1}^{k_{m}} I\left(X_{1 i}=j_{1}, \ldots, X_{m i}=j_{m}\right) \log \phi\left(Y_{i} \mid X_{1 i}, \ldots, X_{m i}, \boldsymbol{\delta}, \sigma^{2}\right) \\ & +\sum_{i=1}^{n} \sum_{a=1}^{m} \sum_{j_{1}=1}^{k_{1}} \ldots \sum_{j_{m}=1}^{k_{m}} I\left(X_{1 i}=j_{1}, \ldots, X_{m i}=j_{m}\right) \log \phi\left(\boldsymbol{Z}_{a i} \mid X_{a i}=j_{a}, \boldsymbol{\mu}_{a, j_{a}}, \boldsymbol{\Sigma}_{a, j_{a}}\right) \\ & +\sum_{i=1}^{n} \sum_{a=1}^{m} \sum_{j_{1}=1}^{k_{1}} \ldots \sum_{j_{m}=1}^{k_{m}} I\left(X_{1 i}=j_{1}, \ldots, X_{m i}=j_{m}\right) \log S\left(\boldsymbol{X}_{a i}=j_{a} \mid \boldsymbol{G}_{i}, \boldsymbol{\beta}_{a}\right) \end{aligned} \]

The log likelihood of LUCID in parallel is similar to that with Early integration LUCID. It is natural to follow the same principles of EM algorithm for Early integration LUCID with single intermediate variable.

We a-priori assigned the number of clusters for each omic layer to two. LUCID in parallel estimates omics specific clusters which represent an differential risks for the outcome within the layer. For each set of latent clusters within each omics layer, the corresponding omics profile was computed to identify the independent contribution of each omics layer to the exposure-outcome association.

4.3.1 Analysis: LUCID with 3 omics layers in parallel

When using estimate_lucid() to fit LUCID in parallel model, we specify lucid_model = "parallel", and K is a list with the same length as the number of omics layers and each element in the list is an integer representing number of latent clusters for the corresponding omics layer. max_itr = 200 means that the algorithm will stop when the iterations reaches 200 if still not reaching convergence. We specify useY = TRUE to construct supervised LUCID model. modelName is a vector of strings specifies the geometric model of omics data, the default modelName is “VVV”, but here “EEV” is more suitable.

Code

G = exposure %>% as.matrix()
Z = omics_lst[c(1:3)]
Y = outcome 

fit <- estimate_lucid(lucid_model = "parallel",
                      G = G, 
                      Z = Z, 
                      Y = Y,
                      K = rep(2, length(Z)), 
                      family = "normal", 
                      max_itr = 200, 
                      useY = TRUE, 
                      init_omic.data.model  = "EEV")

# Reorder the clusters
fit_reordered <- reorder_lucid_parallel(fit, reference = c(2,2,2))

4.3.2 Sankey Diagram

The full code for the plot_lucid_in_parellel_plotly function is provided in Section 6.1.2

Code

p2 <- plot_lucid_in_parallel_plotly(fit_reordered,
                                    sankey_colors = sankey_colors,
                                    text_size = 20,
                                    n_z_ftrs_to_plot = c(7,7,7))

Figure 4.2: The Sankey Diagram for LUCID in Parallel (Intermediate Integration).

4.3.3 Omics profiles for each cluster predicted by LUCID

The full code for the plot_omics_profiles function is provided in Section 6.1.4

Code

plot_omics_profiles(fit_reordered, "Intermediate", omics_lst)

4.4 Late Integration

For late integration, LUCID in serial is implemented, which successively links multiple single omics LUCID models using the Early Integration LUCID model. Let \(G\) be a \(N \times P\) matrix with columns representing genetic/environmental exposures, and rows representing the observations; there is an ordered collection of \(m\) omics data, denoted by \(Z_1\), . . . , \(Z_a\), . . . , \(Z_m\) with corresponding dimensions \(p_1\), . . . , \(p_a\), . . . , \(p_m\). Successive omics layers \(Z_a\) are linked by using each observations’ posterior inclusion probability (PIP) for latent clusters \(X_a\) in the initial LUCID model to be the “exposure” variable for each successive model. The omics layers are ordered in a sequential fashion based on the biological relationships between omics layers. For the first model, an unsupervised Early Integration LUCID model using \(G\) as the exposure and \(Z_1\) as the omics layer. The PIPs of the non-reference clusters were extracted and used as the input for the exposure for the following unsupervised Early Integration LUCID model. This procedure is iterated until the last omics layer \(Z_m\), for which a supervised Early Integration LUCID model is used to conduct integrated clustering while using PIPs from the previous LUCID model and information on the outcome \(Y\).

4.4.1 Analysis: LUCID with 3 omics layers in serial

When using estimate_lucid() to fit LUCID in serial model, we specify lucid_model = "serial", and K is a list with the same length as the number of ordered omics layers and each element in the list is an integer representing number of latent clusters for the corresponding omics layer. G, Z, Y are the inputs for exposure, omics data matrix, and the outcome, respectively. Note that here Z is list of omics layers (vectors). CoY are the covariates to be adjusted for the \(X\) to \(Y\) association and CoG are the covariates to be adjusted for the \(G\) to \(X\) association. We specify useY = TRUE to construct supervised LUCID in serial model. We specify family = "normal" since the outcome is continuous. If the outcome is binary, we would specify family = "binary". We specify Rho_Z_Mu = 10 and Rho_Z_Cov = .3 to use LASSO penalty to regularize cluster-specific means and variance for \(Z\). We will get a selection of \(Z\) features for each layer, and then we refit the LUCID in serial model with only selected \(Z\) features to construct the final model.

Code

set.seed(100)
G_Hg = exposure %>% as.matrix()
Z = list(omics_lst$methylome,
         omics_lst$transcriptome,
         omics_lst$miRNA)
Y_liv_inj = scale(outcome)

# Run the LUCID Model  
fit <- estimate_lucid(lucid_model = "serial",
                      G = G_Hg,
                      Z = Z,
                      Y = Y_liv_inj, 
                      CoY = covs,
                      CoG = covs,
                      K = list(2,2,2),
                      useY = TRUE,
                      family = "normal", 
                      Rho_Z_Mu = 10,
                      Rho_Z_Cov = .3)

#extract selected Z features to to refit lUCID in serial
selected_Z = vector(mode = "list", length = length(Z))
for (i in (1:length(Z))){
  fiti_select_Z = fit$submodel[[i]]$select$selectZ
  selected_Z[[i]] = Z[[i]][,fiti_select_Z]
}

#Refit ther LUCID in serial model with selected Z
fit <- estimate_lucid(lucid_model = "serial",
                      G = G_Hg,
                      Z = selected_Z,
                      Y = Y_liv_inj, 
                      CoY = covs,
                      CoG = covs,
                      K = list(2,2,2),
                      useY = TRUE,
                      init_par = "random",
                      family = "normal")

# Rename Exposure
fit1 <- fit$submodel[[1]]
fit1$var.names$Gnames[1] <- "Hg"
fit2 <- fit$submodel[[2]]
fit2$var.names$Gnames[1] <- "<b>Methylation\nProfile 1</b>"
fit3 <- fit$submodel[[3]]
fit3$var.names$Gnames[1] <- "<b>miRNA\nProfile 1</b>"

4.4.2 Sankey Diagram

The full code for the sankey_in_serial function is provided in Section 6.1.3.

Code

col_pal <- RColorBrewer::brewer.pal(n = 8, name = "Dark2")
color_pal_sankey <- matrix(c("exposure", "red",
                             "lc"      , "#b3d8ff",
                             "TC"      , col_pal[2],
                             "CpG"     , col_pal[1],
                             "miRNA"   , col_pal[3],
                             "outcome" , "grey"), 
                           ncol = 2, byrow = TRUE) %>%
  as_tibble(.name_repair = "unique") %>% 
  janitor::clean_names() %>%
  dplyr::rename(group = x1, color = x2)

p3<- sankey_in_serial(fit1, 
                      fit2, 
                      fit3, 
                      color_pal_sankey,
                      text_size = 24)

Figure 4.3: The Sankey Diagram for LUCID in Serial (Late Integration).