University of California, Los Angeles



Tutorial for the randomGLM R package: prediction of gene traitsLin Song, Steve Horvath1. Binary outcome prediction1.1 Data preparation and RGLM predictionHere we use an empirical gene expression data set from the package to explain how RGLM [1] work. We start by loading required libraries and preparing data. The brain cancer data set contains a training set (55 samples across 5000 gene features), a test set (65 samples across 5000 features) but no outcomes. We use genes as outcomes instead.# load required librarylibrary(randomGLM)# load datadata(brainCancer)# check datadim(brainCancer$train)dim(brainCancer$test)N = ncol(brainCancer$train)# sample a quantitative gene trait from all N genes (features), and exclude it from the feature spaceset.seed(1)traitIndx = sample(1:N, 1)y = brainCancer$train[, traitIndx]ytest = brainCancer$test[, traitIndx]x = brainCancer$train[, -traitIndx]xtest = brainCancer$test[, -traitIndx]# Since some of the column names are not unique we change themcolnames(x) = make.names(colnames(x), unique = TRUE)colnames(xtest) = make.names(colnames(xtest), unique = TRUE);# generate binary outcomes by dichotomizing continuous outcomes at the median. Binary outcomes can be factors or numbers.y = as.factor(ifelse(y>median(y), 1, 0))ytest = as.factor(ifelse(ytest>median(ytest), 1, 0))In above, we randomly select a gene from all 5000 genes as the outcome. In practice, users can change the random seed to generate different gene traits and thus different data sets to play with. Now we have training data x, training outcome y, test data xtest and test outcome ytest. Prediction is done as follows. Make sure to set “classify=TRUE” for binary outcome prediction. This example should take less than 2 minutes.# RGLM prediction using default parameter choices. To learn how to changer parameter values, look at the tutorial RGLMparameterTuningTutorial.docx.# nThreads= 1 assumes that the computer has only 1 core.RGLM = randomGLM(x,y,xtest,classify=TRUE, keepModels=TRUE, nThreads=1)# the following alternative code could be faster if your machine has 6 cores or more#RGLM = randomGLM(x,y,xtest,classify=TRUE, keepModels=TRUE, nThreads=6)Comment: Here we ignore the warning messages. The warnings indicate that there are problems with fitting a glm model.1.2 Predictor outputs# check out-of-bag prediction in the training datapredictedOOB = RGLM$predictedOOBtable(predictedOOB, y)# y# predictedOOB 0 1# 0 24 6# 1 4 21# check test predictionpredictedTest = RGLM$predictedTesttable(predictedTest, ytest)# ytest# predictedTest 0 1# 0 28 3# 1 5 29# test set prediction can be obtained by the following as well, only works when setting keepModels=TRUE in randomGLM functionpredictedTest = predict(RGLM, newdata = xtest, type="class")Accuracy can be calculated by hand or using the accuracyMeasures function in the WGCNA R package. Here, the test set prediction accuracy is 0.88; sensitivity is 0.91; specificity is 0.85.## test set accuracy measureslibrary(WGCNA)# Note: some computer systems support multi-threading, but it is not #enabled within WGCNA in R. To allow multi-threading within WGCNA with #all available cores, use allowWGCNAThreads()# If you get an error message, please ignore it.out = accuracyMeasures(table(predictedTest==0, ytest==0))out# Measure Value#1 Error.Rate 0.1230769#2 Accuracy 0.8769231#3 Specificity 0.8484848#4 Sensitivity 0.9062500#5 NegativePredictiveValue 0.9032258#6 PositivePredictiveValue 0.8529412#7 FalsePositiveRate 0.1515152#8 FalseNegativeRate 0.0937500#9 Power 0.9062500#10 LikelihoodRatioPositive 5.9812500#11 LikelihoodRatioNegative 0.1104911#12 NaiveErrorRate 0.4923077Users can extract GLMs constructed in each bag. For example, the following codes indicate that the 119th, 973rd and 2974th features are selected into the model of bag 1, and the model is lnp1-p= 7979.74+2.00Feature119+ 5.22Feature973- 36.52Feature2974.# indices of features selected into the model of bag 1 RGLM$featuresInForwardRegression[[1]]Output X200660_at X202291_s_at X212145_atFeature.1 119 973 2974# Model coefficients of bag 1.coef(RGLM$models[[1]])Output(Intercept) X200660_at X202291_s_at X212145_at 7979.738216 2.002009 5.220940 -36.515803We recommend the following variable importance measure of RGLM. It is the number of times each feature is selected into GLMs across bags. In this example, we use default total number of bags = 100. Most genes were not used at all for prediction in any bags (4830 out of 4999). There is one gene selected 10 times across 100 bags, and 3 genes each selected 9 times across 100 bags. These genes have the biggest variable importance, and they contribute most to outcome prediction.# check variable importancevarImp = RGLM$timesSelectedByForwardRegressiontable(varImp)# 0 1 2 3 4 5 6 7 9 10 #4830 108 37 8 4 6 1 1 3 1# number of features used for predictionsum(varImp>0)#169# Create a data frame that reports the variable importance measure of each feature.datvarImp=data.frame(Feature=as.character(dimnames(RGLM$timesSelectedByForwardRegression)[[2]]),timesSelectedByForwardRegression= as.numeric(RGLM$timesSelectedByForwardRegression))#Report the 20 most significant featuresdatvarImp[rank(-datvarImp[,2],ties.method="first")<=20,]# Barplot of the importance measures for the 20 most important features datVarImpSelected=datvarImp[rank(-datvarImp[,2],ties.method="first")<=20, ]par(mfrow=c(1,1), mar=c(4,8,3,1))barplot(datVarImpSelected[,2],horiz=TRUE,names.arg= datVarImpSelected[,1],xlab="Feature Importance",las=2,main="Most significant features for the RGLM",cex.axis=1,cex.main=1.5,cex.lab=1.5)1.3 RGLM thinningPreviously, RGLM achieves prediction accuracy 0.87 with 169 features. Here we want to reduce the number of features used for prediction while remaining good predictive accuracy. RGLM thinning does this. This procedure removes features with small variable importance, and only keeps important features for prediction. As shown in the following, if we set thinning threshold as 1, only 61 features with variable importance>1 are kept for prediction. But prediction accuracy remains almost the same (0.86).threshold = 1# thinned RGLMthinRGLM = thinRandomGLM(RGLM, threshold)# number of features remained in predictorthinN = sum(thinRGLM$timesSelectedByForwardRegression>0)thinN# [1] 61# test set prediction after thinningpredicted = predict.randomGLM(thinRGLM, xtest, type="class")accuracyMeasures(table(predicted, ytest))# Measure Value#1 Error.Rate 0.1384615#2 Accuracy 0.8615385#3 Specificity 0.9062500#4 Sensitivity 0.8181818#5 NegativePredictiveValue 0.8285714#6 PositivePredictiveValue 0.9000000#7 FalsePositiveRate 0.0937500#8 FalseNegativeRate 0.1818182#9 Power 0.8181818#10 LikelihoodRatioPositive 8.7272727#11 LikelihoodRatioNegative 0.2006270#12 NaiveErrorRate 0.49230772. Continuous outcome predictionWe still use the same brain cancer data set for illustration. Note that gene traits are not dichotomized this time, and we set “classify=FALSE” in the randomGLM function.# start a new R session# load required librarylibrary(randomGLM)# load datadata(brainCancer)# check datadim(brainCancer$train)dim(brainCancer$test)N = ncol(brainCancer$train)# sample a quantitative gene trait from all N genes (features) and exclude it from the feature spaceset.seed(1)traitIndx = sample(1:N, 1)y = brainCancer$train[, traitIndx]ytest = brainCancer$test[, traitIndx]x = brainCancer$train[, -traitIndx]xtest = brainCancer$test[, -traitIndx]# RGLM prediction with 1 thread. # Consider choosing nThreads=NULL or a larger integer to speed it up.RGLM=randomGLM(x,y, xtest, classify=FALSE, keepModels=TRUE,nThreads=1)# define the out-of-bag predictionpredictedOOB = RGLM$predictedOOB.response# What is the OOB estimate of the accuracy? Recall that for a continuous trait, the accuracy is defined as the correlation between OOB prediction and the true, observed training set outcome.as.numeric(cor(predictedOOB, y, use="p"))# [1] 0.8712458# define the test set predictionpredictedTest = RGLM$predictedTest.response# What is the test set estimate of the accuracy? Recall that for a continuous trait, the accuracy is defined as the correlation between test set prediction and the true, observed test set outcome.as.numeric(cor(predictedTest, ytest, use="p"))# [1] 0.870373Both the OOB estimate and the test set estimate of the prediction accuracy show that the RGLM is quite accuracy (both around 0.87).References1. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics 14:5 PMID: 23323760DOI: 10.1186/1471-2105-14-5. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download