Steve Horvath UCLA – Horvath Lab UCLA



Tutorial for the randomGLM R package:Interpretation of the RGLM predictorLin Song, Steve HorvathIn this tutorial, we show how to select important features from RGLM and how to interpret the ensemble predictor. We use the small, round blue cell tumors (srbct) data set [1,2] as an example training data set. It is composed of the gene expression profiling of 2308 genes across 63 observations. The data can be found on our webpage at . No test set is needed.1. Data preparation# load required packagelibrary(randomGLM)# download data from webpage and load it.# Importantly, change the path to the data and use back slashes /setwd("C:/Users/Horvath/Documents/CreateWebpage/RGLM/Tutorials/")load("srbct.rda")# check datadim(srbct$x)table(srbct$y)# 1 2 # 40 23x = srbct$xy = srbct$y# number of featuresN = ncol(x)# define function misclassification.rate, for accuracy calculationif (exists("misclassification.rate") ) rm(misclassification.rate);misclassification.rate=function(tab){num1=sum(diag(tab))denom1=sum(tab)signif(1-num1/denom1,3)}2. RGLM predictionFirst we do RGLM prediction with default parameter settings. The prediction accuracy is 0.984, with 1 out of 63 observations being misclassified.RGLM = randomGLM(x, y, classify=TRUE, keepModels=TRUE)tab1 = table(y, RGLM$predictedOOB)# 1 2# 1 40 0# 2 1 22# accuracy1-misclassification.rate(tab1)#[1] 0.98413. Feature selection We define the variable importance measure as the times a feature is selected by forward regression across all bags (here nBags=100). In this application, a total of 83 features have been used for prediction. Among them, the features that are repetitively selected (big varImp values) are the most important ones. Here, we take the top 10 most important features which are selected at least 5 times in forward regression across 100 bags. These 10 features form the basis of RGLM interpretation. Note that users can decide the number of most importance features to keep according to their needs.# variable importance measurevarImp = RGLM$timesSelectedByForwardRegressionsum(varImp>0)# 83table(varImp)#varImp# 0 1 2 3 4 5 6 7 8 9 10 14 15 17 #2225 52 12 6 3 1 2 1 1 1 1 1 1 1 # select most important featuresimpF = colnames(x)[varImp>=5]impF# [1] "G246" "G545" "G566" "G1074" "G1319" "G1327" "G1389" "G1954" "G2050"#[10] "G2117"4. RGLM interpretation We build a single GLM to explain the outcome with the 10 most important features only. G566 and G1327 are negatively associated with the outcome, while other features are positively associated with the outcome.# build single GLM model with most important featuresmodel1 = glm(y~., data=as.data.frame(x[, impF]), family = binomial(link='logit'))model1#Coefficients:#(Intercept) G246 G545 G566 G1074 G1319 # -29.2645 7.1445 5.0429 -6.9307 3.1406 0.7925 # G1327 G1389 G1954 G2050 G2117 # -2.1011 4.9900 9.3048 1.3402 3.8649 5. Compare single model prediction with original RGLM predictionIn this section, we aim to see how the prediction from the above single model using top 10 most important features correspond to the original RGLM prediction. In other words, how well does a single model pick up the signal from the RGLM ensemble? To ensure a fair comparison, we use the unbiased out-of-bag (OOB) prediction of original RGLM. For the single model, we should not use the above model1 directly to make prediction for the srbct data, because features in model1 were selected based on the same data set and thus bias the prediction. Instead, we use the leave-one-out (LOO) prediction.# compare the performance of single model with most important features and original RGLM# out-of-bag prediction probabilities from RGLMpredRGLM = RGLM$predictedOOB.response[,2]# define function to calculate leave-one-out prediction of single modelLOOlogistic = function(y, x, impF){ nLoops = length(y) predLOO = rep(NA, nLoops) for (ind in 1:nLoops) { model = glm(y[-ind]~., data=as.data.frame(x[-ind, impF]), family = binomial(link='logit')) predLOO[ind] = predict(model, newdata=as.data.frame(x[ind, impF, drop=F]), type="response") rm(model) } predLOO}# leave-one-out predictive prob of single modelpredLOO = LOOlogistic(y, x, impF)# leave-one-out prediction accuracy of single model1-misclassification.rate(table(y, round(predLOO)))#[1] 0.9841# Single model LOO prediction achieves the same accuracy as RGLM, and it misclassified the same one observation as RGLM did.# plotlibrary(WGCNA)pdf("/home/telebaby/Desktop/gene_screening/package/interpret.pdf")verboseScatterplot(predLOO, predRGLM, xlab = paste("LOO predictive prob of a single model with", length(impF), "most important features"),ylab = "RGLM OOB predictive prob",cex.lab=1.2,cex.axis=1.2)abline(lm(predRGLM~predLOO), lwd=2)dev.off()This figure shows the LOO predictive probabilities for observations to have outcome “2” using a single model with 10 most important features (x-axis) against the RGLM OOB predictive probabilities. Apparently, the single model makes very similar predictions to the original RGLM (cor=0.97, p-value = 3.6*10-39). Therefore in this application, a single model after RGLM feature selection achieves good prediction accuracy and is very easy and straightforward to interpret.6. RGLM model coefficientsUsers may also want to look at RGLM model coefficients and follow up on those features with large coefficients on average. This could be done as follows.# get coefficients of GLM models# check coefficients of RGLM bag 1coef(RGLM$models[[1]])# OUTPUT#(Intercept) G1954 # -61.53254 158.88250 # create matrix of coefficients of features across bagsnBags = length(RGLM$featuresInForwardRegression)coefMat = matrix(0, nBags, RGLM$nFeatures)for (i in 1:nBags){ coefMat[i, RGLM$featuresInForwardRegression[[i]]] = RGLM$coefOfForwardRegression[[i]]}# check mean coefficients of features across bagscoefMean = apply(coefMat, 2, mean)names(coefMean) = colnames(x)summary(coefMean)# Min. 1st Qu. Median Mean 3rd Qu. Max. #-44.67000 0.00000 0.00000 0.07888 0.00000 31.27000coefMean[impF]# G246 G545 G566 G1074 G1319 G1327 G1389 # 8.122109 7.947393 21.349161 4.599621 18.207984 31.269950 7.282620 # G1954 G2050 G2117 # 7.522164 -14.200134 24.644747References1. Khan J,Wei JS, Ringner M, Saal LH, Ladanyi M,Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 2001, 7(6):673–679, [].2. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics 14:5 PMID: 23323760DOI: 10.1186/1471-2105-14-5. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download