WordPress.com



#REGRESSION SCRIPT ----#From:#A practical guide to machine learning for neurosurgeons #Acta Neurochirurgica#Authors: Victor E. Staartjes (Zurich), Julius M. Kernbach (Aachen)#Contact for questions or clarification:#victoregon.staartjes@usz.ch#January 2020#Executed on: R Version 3.6.2 "Dark and Stormy Night" (64 bit)#OS = Windows 10; CPU = 16 physical cores @ 3.40 GHz; GPU = NVIDIA 980 Ti; RAM = 64 GB#Total Runtime (4 CPU Cores): 7 minutes; 26 seconds#1 SETUP & PRE-PROCESSING DATA ----#1.1 Load packages ----if (!require("pacman")) install.packages("pacman")pacman::p_load("readxl", "caret", "Hmisc", "doParallel", "mlbench", "rlang", "dplyr", "VIM", "randomForest", "devtools", "gam", "elasticnet") #These packages are required to execute this code. If unavailable and you have an internet connection, they will be installed automatically.#If you desire to update R to the latest version, delete the hashtag (#) in the next line of code, and run the code. You will then be guided through the update process.#pacman::p_load("installr"); updateR(fast = T, GUI = T, quit_R = T, install_R = T, keep_old_packages = T, copy_packages = T, start_new_R = T, silent = T, print_R_versions = T)#1.2 Import data from .xlsx ----#In Excel, make sure that your dependent (Target/Endpoint) variable is at the very last column of your Excel database#Also, convert any categorical variables into numerical form, i.e. 1/0 or 1/2/3/4/etc. instead of strings/text#Both are the case in the Glioblastoma database#Import the dataset from the downloaded Excel using the RStudio GUI, OR runt this code: df <- as.data.frame(read_excel("GlioblastomaData.xlsx")) #Replace with the path to the downloaded Glioblastoma dataset!!!#1.3 Check data format ----#Continous variables should be "numeric" in R, and categorical ones should be "factor" in R str(df) #For the Glioblastoma dataset, we see that we have 22 variables, but that IDH, MGMT, Male, TERTp, Midline, Comorbidity, Epilepsy, PriorSurgery, Married, ActiveWorker, Chemotherapy, HigherEducation, and the binary outcome TwelveMonths are still handled as "numeric".#1.4 Reformat the categorical variables to "factor" ----cols <- c("IDH", "MGMT", "Male", "TERTp", "Midline", "Comorbidity", "Epilepsy", "PriorSurgery", "Married", "ActiveWorker", "Chemotherapy", "HigherEducation", "TwelveMonths")df[cols] <- lapply(df[cols], factor)str(df) #We now see that all categorical variables are correctly labeled as "factor", and R will handle them correctly now#1.5 Remove unncessary columns ---- #In this Glioblastoma survival regression example, the column "TwelveMonths" is a binary version of our endpoint of interest - Survival (Survival in months)) - and would thus be highly correlated with our endpoint. We must therefore delete it and any other highly multicollinear variables.df <- df[,-22] #TwelveMonths was in the 22nd column, so we remove itstr(df) #And check the data again - We still have 10'000 patients, but now 21 instead of 22 variables.#1.6 OPTIONAL Enable multicore processing ----#If you have a device with multiple CPU cores, you can enable multicore parallel processing by entering the number of cores to be allocated. For example, on our test PC, we have 16 physical cores with 32 virtual cores, so we could choose to allocate 28 to model training. Here we choose 4 as a common number of cores.#If you want to enable multicore processing, delete the two hashtags below!cl <- makePSOCKcluster(4) #Enter number of cores to be allocatedregisterDoParallel(cl)#1.7 Split training and testing set. Here, we choose a 80%/20% random split ----set.seed(123) #Setting seeds allows reproducibility dt = sort(sample(nrow(df), nrow(df)*.80)) #Allocate 80% split train <-df[dt,] #Split test <-df[-dt,]train <- train[sample(nrow(train)),] #Shuffletest <- test[sample(nrow(test)),]#Check approximate equal distribution of the endpoint - Here we see that survival was virtually equal in training and testing setshist(train$Survival)hist(test$Survival)#1.8 If missing data is present, impute missing data using a KNN Imputer (K = 5) ----#Choose a KNN imputer because later on, a KNN Imputer will be co-trained with the prediction model to impute missing data for future predictions#Again, make sure that your dependent (Target/Endpoint) variable is at the very last column of your Excel databasetrainx <- train[,1:(ncol(train)-1)]Survival <- train[,ncol(train)]trainx <- VIM::kNN(trainx, k = 5, imp_var = F) #We see that the Glioblastoma database contains no missing data - thus, no imputation was performedtrain <- cbind(trainx, Survival) #1.9 Variable Selection using Recursive Feature Elimination ----set.seed(123) #Set a seedrfecontrol <- rfeControl(functions = caretFuncs, method = "boot", number = 25, rerank = F, verbose = T, allowParallel = T) #RFE using generalized linear models (GLM), in 25-fold bootstrapset.seed(123)RFE <- rfe(Survival ~ ., data = train, sizes=c(10:ncol(train)-1), rfeControl=rfecontrol, method = "glm", trControl = trainControl(method = "boot"))print(RFE) #Resultsplot(RFE, type=c("g", "o")) #Plot performance over #varspredictors(RFE) #Variables to be kept#Apparently, this combination of 16 selected variables explains the highest amoung of variance in our endpoint. #Thus, we should keep only these 16 independent variables, and our dependent variable, and remove the 4 "garbage"/"low-impact"/"multicollinear" independent variables that were not selected by RFE#Keep only the RFE-selected 16 variables & our endpoint "TwelveMonths" keepvars <- c("Size", "Comorbidity", "Age", "Chemotherapy", "TERTp", "Caseload", "Midline", "KPS", "MGMT", "IDH", "RadiotherapyDose", "PriorSurgery", "Male", "Epilepsy", "Income", "Height", "Survival")train <- train[keepvars]#1.10 Get a final overview ----summary(train)summary(test)#2 MODEL TRAINING ----#2.1 Set Up the Training and Resampling Structure ----#As a standard, we choose 5-fold cross validation as the resampling method. ctrl <- trainControl(method = "cv", number = 5, allowParallel = T, returnResamp = "all")#2.2.1 Train Model #1: (Generalized) Linear Model (LM) ----set.seed(123)lmfit <- caret::train(Survival ~ ., data = train, method = "glm", trControl = ctrl, tuneLength = 25, metric = "RMSE", preProcess = "knnImpute", na.action = na.pass)#We automatically tune our models based on RMSE, and co-train a KNN imputer that then automatically imputes any potential missing data for future samples.LM <- as.data.frame(cbind(min(lmfit$results$RMSE), min(lmfit$results$MAE), max(lmfit$results$Rsquared)))colnames(LM) <- c("RMSE", "MAE", "R2")#Summaryprint(lmfit)print(LM)#Save Model Filesave(lmfit, file= "LM.Rdata")#2.2.2 Train Model #2: Generalized Additive Model (GAM) ----set.seed(123)gamfit <- caret::train(Survival ~ ., data = train, method = "gamLoess", trControl = ctrl, tuneLength = 25, metric = "RMSE", preProcess = "knnImpute", na.action = na.pass)GAM <- as.data.frame(cbind(min(gamfit$results$RMSE), min(gamfit$results$MAE), max(gamfit$results$Rsquared)))colnames(GAM) <- c("RMSE", "MAE", "R2")#Summaryprint(gamfit)print(GAM)#Save Model Filesave(gamfit, file= "GAM.Rdata")#2.2.3 Train Model #3: Lasso regression ----set.seed(123)lassofit <- caret::train(Survival ~ ., data = train, method = "lasso", trControl = ctrl, tuneLength = 25, metric = "RMSE", preProcess = "knnImpute", na.action = na.pass)LASSO <- as.data.frame(cbind(min(lassofit$results$RMSE), min(lassofit$results$MAE), max(lassofit$results$Rsquared)))colnames(LASSO) <- c("RMSE", "MAE", "R2")#Summaryprint(lassofit)print(LASSO)#Save Model Filesave(lassofit, file= "LASSO.Rdata")#2.2.4 Train Model #4: Ridge regression ----set.seed(123)ridgefit <- caret::train(Survival ~ ., data = train, method = "ridge", trControl = ctrl, tuneLength = 25, metric = "RMSE", preProcess = "knnImpute", na.action = na.pass)RIDGE <- as.data.frame(cbind(min(ridgefit$results$RMSE), min(ridgefit$results$MAE), max(ridgefit$results$Rsquared)))colnames(RIDGE) <- c("RMSE", "MAE", "R2")#Summaryprint(ridgefit)print(RIDGE)#Save Model Filesave(ridgefit, file= "RIDGE.Rdata")#2.2.5 Train Model #5: Random Forest (RF) ----set.seed(123)rffit <- caret::train(Survival ~ ., data = train, method = "rf", trControl = ctrl, tuneLength = 5, metric = "RMSE", preProcess = "knnImpute", na.action = na.pass)RF <- as.data.frame(cbind(min(rffit$results$RMSE), min(rffit$results$MAE), max(rffit$results$Rsquared)))colnames(RF) <- c("RMSE", "MAE", "R2")#Summaryprint(rffit)print(RF)#Save Model Filesave(rffit, file= "RF.Rdata")#3 MODEL EVALUATION AND SELECTION ----#3.1 Evaluate the models in comparison ----#Print Results Summaryresults <- as.data.frame(rbind(LM, GAM, LASSO, RIDGE, RF))rownames(results) <- c("Linear Model", "GAM", "LASSO", "RIDGE", "Random Forest")print(results)#Plot Resultsdev.new(width = 12, height = 7, noRStudioGD = T);par(mfrow=c(1,2)) ; barplot(as.matrix(results[,1:2]), cex.axis = 1, cex.names = 0.7, beside = T, col = c("darkgray", "black", "darkorange", "darkblue", "darkgreen")) ; abline(a = 0, b = 0) ; legend("topright", legend = rownames(results), col = c("darkgray", "black", "darkorange", "darkblue", "darkgreen"), pch = c(19), bty = "n", pt.cex = 0.7, cex = 0.7, text.col = "black", horiz = F); barplot(abs(as.matrix(results[,3])), xlab = "R2", beside = T, col = c("darkgray", "black", "darkorange", "darkblue", "darkgreen"), ylim = c(0,1)) ; abline(a = 1, b = 0) ;legend("top", legend = rownames(results), col = c("darkgray", "black", "darkorange", "darkblue", "darkgreen"), pch = c(19), bty = "n", pt.cex = 0.7, cex = 0.7, text.col = "black", horiz = F)dev.off()#Select a suitable model based on these performance measures derived from resampling during training#In this example, we might choose the GLM, GAM, or Ridge regression models - All have similarly low RMSE and MAE, and high R2. For this example, we chose the ridge regression model.#3.2 Select the model ----finalmodel <- ridgefit #Here we chose the Ridge regression model finalmodelstats <- RIDGE #And its training evaluation#3.3 Internally validate the selected final model on the hitherto unseen test set ----pred <- predict(finalmodel, test)RMSE <- RMSE(pred, test$Survival)MAE <- MAE(pred, test$Survival)R2 <- R2(pred, test$Survival)Final <- as.data.frame(cbind(RMSE, MAE, R2))#Summaryprint(finalmodel)print(Final)#Save Final Model Filesave(finalmodel, file= "FINALMODEL.Rdata")#4 Compile all evaluation data and figures required for reporting ----#4.1.1 Training performance (cross validation) measures ----print(finalmodelstats)#4.1.2 Training Quantile-Quantile Plot ----pred <- predict(finalmodel, train)qqplot(pred, train$Survival, xlab = "Prediction", ylab = "True Survival")#4.2.1 Testing (Internal Validation) performance measures ----print(Final)#4.2.2 Testing Quantile-Quantile plot ----pred <- predict(finalmodel, test) qqplot(pred, test$Survival, xlab = "Prediction", ylab = "True Survival")#4.3 Assess variable importance ----imp <- varImp(finalmodel, scale = T, nonpara = T)imp #Calculated variable importance based on R2plot(imp) #Plot variable importance ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download