library(mofaCLL)
library(glmnet)
library(DESeq2)
This document illustrates how to use the function in the mofaCLL package to estimate the CLL-PD values for samples in a CLL cohort by using gene expression dataset.
The only input data needs is the gene expression profiling data from either mRNA sequencing or microarray. The rownames are gene identifiers and column names are sample identifiers. Currently, of Ensembl Gene ID or HGNC Gene Symbol are supported as gene identifiers.
For RNA sequencing data, either the variance stabilized counts generated by the from DESeq2 package or the transformed counts using from Limma package are recommended. For microarray data, the log2 transformed intensities are recommended.
Here’s an example of input dataset
filePath <- system.file("externalData/testCohort.tsv", package = "mofaCLL")
exprMat <- read.table(filePath, header = TRUE)
head(exprMat)
## CLL1 CLL2 CLL3 CLL4 CLL5 CLL6 CLL7 CLL8 CLL9 CLL10
## ENSG00000067048 16.17 8.69 16.31 15.58 15.96 8.92 16.57 16.56 8.37 15.81
## ENSG00000012817 15.79 8.58 15.53 15.40 15.36 8.60 15.91 16.18 8.33 16.21
## ENSG00000125740 14.46 17.97 14.99 10.21 10.88 13.14 17.76 14.78 8.28 8.73
## ENSG00000129824 15.18 8.61 15.58 15.00 15.29 8.54 15.51 14.82 7.81 15.34
## ENSG00000114374 14.85 8.57 14.93 14.80 14.83 8.60 14.85 15.30 7.81 15.30
## ENSG00000157654 14.28 7.81 7.81 14.39 14.87 7.81 15.51 7.81 7.81 7.81
## CLL11 CLL12 CLL13 CLL14 CLL15 CLL16 CLL17 CLL18 CLL19 CLL20
## ENSG00000067048 17.65 17.09 16.65 7.81 15.71 8.80 16.56 17.26 15.89 15.72
## ENSG00000012817 15.31 16.25 15.78 8.45 15.67 8.77 16.56 16.58 16.31 15.50
## ENSG00000125740 18.13 17.16 17.35 9.37 9.89 13.36 10.58 12.11 9.34 9.45
## ENSG00000129824 16.04 15.15 16.01 8.38 15.36 8.37 15.43 16.27 15.53 15.17
## ENSG00000114374 14.83 15.43 14.95 8.21 14.84 8.33 15.54 14.74 15.10 14.94
## ENSG00000157654 7.81 7.81 7.81 7.81 7.81 7.81 7.81 7.81 7.81 15.07
## CLL21 CLL22 CLL23 CLL24 CLL25 CLL26 CLL27 CLL28 CLL29 CLL30
## ENSG00000067048 17.43 11.03 17.05 16.20 16.53 17.34 16.08 16.18 16.08 7.81
## ENSG00000012817 15.85 9.61 16.15 15.75 15.97 15.64 15.91 16.52 15.80 7.81
## ENSG00000125740 17.61 15.08 13.36 9.22 16.25 14.02 10.16 9.27 9.75 9.12
## ENSG00000129824 15.14 9.50 14.88 15.57 15.26 14.26 15.36 15.54 14.78 7.81
## ENSG00000114374 15.60 9.31 15.41 14.94 14.92 15.35 15.23 15.47 14.99 7.81
## ENSG00000157654 12.23 7.81 7.81 7.81 7.81 13.80 7.81 7.81 15.26 7.81
## CLL31 CLL32 CLL33 CLL34 CLL35 CLL36 CLL37 CLL38 CLL39 CLL40
## ENSG00000067048 8.60 8.99 15.86 11.74 16.12 17.43 16.01 16.54 8.59 15.99
## ENSG00000012817 8.26 7.81 16.24 9.99 15.97 15.94 16.64 16.31 8.69 15.83
## ENSG00000125740 9.55 17.96 9.46 17.68 11.32 18.05 12.61 9.16 12.70 9.40
## ENSG00000129824 7.97 8.61 15.24 9.85 15.31 15.09 15.35 15.87 8.56 15.68
## ENSG00000114374 8.14 8.51 15.03 9.44 15.27 15.06 15.17 15.19 8.39 14.94
## ENSG00000157654 14.46 7.81 14.62 7.81 15.19 7.81 7.81 7.81 7.81 7.81
## CLL41 CLL42 CLL43 CLL44 CLL45 CLL46 CLL47 CLL48 CLL49 CLL50
## ENSG00000067048 15.08 8.74 16.14 8.09 16.16 8.31 16.34 15.85 8.08 16.36
## ENSG00000012817 15.02 8.41 15.56 7.81 15.97 8.28 15.94 15.75 7.81 16.23
## ENSG00000125740 11.25 17.13 16.47 11.28 16.57 10.99 12.70 8.89 18.44 9.47
## ENSG00000129824 15.24 7.81 15.61 7.81 15.40 8.17 15.17 15.45 7.81 15.72
## ENSG00000114374 14.15 8.60 14.82 7.81 14.94 8.05 15.24 14.81 7.81 15.14
## ENSG00000157654 15.08 7.81 13.60 7.81 7.81 13.90 7.81 13.91 7.81 7.81
The expression matrix uses Ensembl Gene ID as row names, and the values are variance stabilized counts by using DESeq2.
The CLL-PD values for the samples in the expression matrix can be simply estimated by the function CLLPDestimate.
estimateResult <- CLLPDestimate(exprMatrix = exprMat, identifier = "ensembl_gene_id",
topVariant = 5000, normalize = TRUE, repeats =20)
## Preparing model.
## Running repeated cross-validation to select features.
## Done.
The parameter identifier can be “ensembl_gene_id” or “gene_symbol”, depends on the type of gene identifiers used for the expression matrix.
Sometimes, filtering out the genes with very low variance can increase the model performance. Users can either filter out the gene with low variance by themselves or specifying a number to the topVariant parameter. The a number is specified, the rows of the expression matrix will be firstly order by their variance, decreasing and only the top n rows will be used.
The parameter normazlied specifies whether each row-wise z-score for the input matrix should be used. The normalization is generally recommended, unless the input matrix is already a z-score matrix.
repeats specifies the number of repetitions for the cross-validation. Higher number will generally lead to better stability but takes longer time. Normally a number between 20 to 100 should be adequate.
The CLLPDestimate function uses the same process described in method part of the paper, “Multi-omics data integration identifies mTOR-MYC-OXPHOS as a driver of aggressive chronic lymphocytic leukemia” by Lu et al.
Briefly, CLLPDestimate will firstly subset the built-in expression matrix to a reduced one that contains the gene/probes presented in both the built-in and user-specified matrix. Then a repeated cross-validation on LASSO linear regression models will be performed to select a best linear model to predict CLL-PD in the built-in cohort using reduced expression matrix. Finally, the selected model will be apply to the user-specified expression matrix and predict CLL-PD for the samples in user-specified data.
The output of CLLPDestimate is a list object contain three elements.
estimated_CLLPD is a numeric vector that contains the estimated CLL-PD values for the samples in the user-specified expression matrix
estimateResult$estimated_CLLPD
## CLL1 CLL2 CLL3 CLL4 CLL5 CLL6
## -1.11275846 0.49445746 0.27993290 0.46206060 0.91827085 0.35717329
## CLL7 CLL8 CLL9 CLL10 CLL11 CLL12
## 0.84197487 -1.49763544 0.68789266 -0.14921374 0.89116942 -0.99260430
## CLL13 CLL14 CLL15 CLL16 CLL17 CLL18
## -0.67288180 -0.43644256 0.76418465 0.32113403 -0.61010454 -0.08169747
## CLL19 CLL20 CLL21 CLL22 CLL23 CLL24
## -0.35581581 0.73826306 -0.99403700 0.64542647 -0.13791426 1.16847419
## CLL25 CLL26 CLL27 CLL28 CLL29 CLL30
## -0.22265733 -0.11725065 0.14242524 0.40714166 0.01639723 -0.31305876
## CLL31 CLL32 CLL33 CLL34 CLL35 CLL36
## -0.12477571 -0.57128203 -1.05719401 -0.30242835 -0.05853350 -0.96496883
## CLL37 CLL38 CLL39 CLL40 CLL41 CLL42
## -0.44297343 -0.16093513 -0.21049599 0.54082550 0.80756548 -0.38374740
## CLL43 CLL44 CLL45 CLL46 CLL47 CLL48
## 1.24205477 -0.48091045 0.50723385 -0.26793221 -0.10127222 0.74941138
## CLL49 CLL50
## -1.19545672 -0.36503360
featureCoefficient is a table that contains the features with non-zero coefficients used by the selected model to estimate CLL-PD
head(estimateResult$featureCoefficient)
## coefficient
## ENSG00000198046 -0.006585768
## ENSG00000198142 -0.011415088
## ENSG00000196436 -0.027711229
## ENSG00000145908 -0.009570052
## ENSG00000101493 -0.014926453
## ENSG00000171105 -0.018273277
featureCoefficient is numeric vector that contains the variance explained values (R2 values) of CLL-PD from repeated cross-validations in the built-in cohort. Because the expression data can be generate by using different platforms with different number of genes measured. This value can be a quality check to see whether the genes provided in the user-specified matrix are good enough to re-capture CLL-PD in the original built-in cohort.
estimateResult$trainingR2
## [1] 0.9934208 0.9938126 0.9887950 0.9916307 0.9930227 0.9866115 0.9841956
## [8] 0.9841956 0.9873871 0.9925591 0.9894107 0.9911594 0.9858620 0.9841956
## [15] 0.9873871 0.9841956 0.9873871 0.9736088 0.9873871 0.9916307
As we mentioned in our paper, we caution that our current operationalization of the CLL-PD either from the MOFA analysis or from gene expression data CLLPDestimate is unlikely to be optimal. Due to the fact that the platforms the user may use can be very different to the RNAseq platform we used for the training cohort. Furthermore, the estimated CLL-PD values in the user-specified cohort are relative numbers that indicates the relative aggressiveness of the CLL samples within the user-specified cohort. Therefore, we consider the current state of this work as providing a proof of concept that will allow further refinement into a robustly measurable biomarker.