Lippincott Williams & Wilkins



Supplementary Information on MethodologyIn this section, we provide more technical details on how our three models are estimated. Regularized logistic regression: In L1 regularized logistic regression, the key parameter to tune is . The parameter is the coefficient weighting the regularization term in the estimation procedure. It effectively controls the degree of shrinkage and sparsity which is a coefficient weighting the regularization term, with larger values inducing more sparsity and shrinkage. This parameter is typically tuned using k-fold cross-validation. Cross-validation is a technique used to tune hyperparameters of machine learning models, such as for L1 regularized logistic regression. In k-fold cross-validation, the training data set is divided into k equally sized subsets (“folds”). The overall procedure then works as follows:Select one of the k folds to serve as the test set, and the remaining k – 1 folds to serve as the training set.Use the k – 1 folds to estimate the model for different values of the hyperparameter. (In the case of regularized logistic regression, we would estimate the regularized logistic regression model for different values of .)For each value of the hyperparameter, compute the corresponding model’s performance on the test set. Steps 1 – 3 are repeated k times, with each of the k folds serving as the test set. For each value of the hyperparameter, the test set performance is averaged over the k held out folds. The final value of the hyperparameter is simply the value that gives the best performance averaged over the k folds. In our analysis, we tuned the regularization parameter using k-fold cross validation with k = 5 folds. We estimated all regularized logistic regression models using the glmnet package in R.ADDIN CSL_CITATION {"citationItems":[{"id":"ITEM-1","itemData":{"author":[{"dropping-particle":"","family":"Friedman","given":"Jerome","non-dropping-particle":"","parse-names":false,"suffix":""},{"dropping-particle":"","family":"Hastie","given":"Trevor","non-dropping-particle":"","parse-names":false,"suffix":""},{"dropping-particle":"","family":"Tibshirani","given":"Rob","non-dropping-particle":"","parse-names":false,"suffix":""}],"container-title":"R package version","id":"ITEM-1","issue":"4","issued":{"date-parts":[["2009"]]},"title":"glmnet: Lasso and elastic-net regularized generalized linear models","type":"article-journal","volume":"1"},"uris":[""]}],"mendeley":{"formattedCitation":"<sup>1</sup>","plainTextFormattedCitation":"1","previouslyFormattedCitation":"<sup>1</sup>"},"properties":{"noteIndex":0},"schema":""}1 Random forest: In a random forest, each classification tree is randomized by building it from a bootstrapped sample of the training data set and by restricting the set of features that may be selected for splitting to a random sample of all features.ADDIN CSL_CITATION {"citationItems":[{"id":"ITEM-1","itemData":{"DOI":"10.1023/A:1010933404324","author":[{"dropping-particle":"","family":"Breiman","given":"Leo","non-dropping-particle":"","parse-names":false,"suffix":""}],"container-title":"Machine Learning","id":"ITEM-1","issue":"1","issued":{"date-parts":[["2001"]]},"page":"5-32","publisher":"Kluwer Academic Publishers","title":"Random Forests","type":"article-journal","volume":"45"},"uris":[""]}],"mendeley":{"formattedCitation":"<sup>2</sup>","plainTextFormattedCitation":"2","previouslyFormattedCitation":"<sup>2</sup>"},"properties":{"noteIndex":0},"schema":""}2 The rationale for this randomization is two-fold. First, it is well-known in machine learning that individual classification trees can have poor predictive performance, and are often inferior to logistic regression.ADDIN CSL_CITATION {"citationItems":[{"id":"ITEM-1","itemData":{"DOI":"10.1007/978-1-4614-7138-7_2","author":[{"dropping-particle":"","family":"James","given":"Gareth","non-dropping-particle":"","parse-names":false,"suffix":""},{"dropping-particle":"","family":"Witten","given":"Daniela","non-dropping-particle":"","parse-names":false,"suffix":""},{"dropping-particle":"","family":"Hastie","given":"Trevor","non-dropping-particle":"","parse-names":false,"suffix":""},{"dropping-particle":"","family":"Tibshirani","given":"Robert","non-dropping-particle":"","parse-names":false,"suffix":""}],"id":"ITEM-1","issued":{"date-parts":[["2013"]]},"page":"15-57","publisher":"Springer","title":"Statistical Learning","type":"chapter"},"uris":[""]}],"mendeley":{"formattedCitation":"<sup>3</sup>","plainTextFormattedCitation":"3","previouslyFormattedCitation":"<sup>3</sup>"},"properties":{"noteIndex":0},"schema":""}3 By taking bootstrap samples of the original data set, growing a classification tree on each such bootstrapped sample, and aggregating the trees, one can improve the accuracy significantly.ADDIN CSL_CITATION {"citationItems":[{"id":"ITEM-1","itemData":{"DOI":"10.1007/BF00058655","ISBN":"0885-6125","ISSN":"0885-6125","PMID":"17634459","abstract":"Bagging predictors is a method for generating multiple versions of a pre-dictor and using these to get an aggregated predictor. The aggregation av-erages over the versions when predicting a numerical outcome and does a plurality v ote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classiication and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability o f the prediction method. If perturbing the learning set can cause signiicant changes in the predictor constructed, then bagging can improve accuracy.","author":[{"dropping-particle":"","family":"Breiman","given":"Leo","non-dropping-particle":"","parse-names":false,"suffix":""}],"container-title":"Machine Learning","id":"ITEM-1","issue":"2","issued":{"date-parts":[["1996"]]},"page":"123-140","publisher":"Kluwer Academic Publishers-Plenum Publishers","title":"Bagging predictors","type":"article-journal","volume":"24"},"uris":[""]}],"mendeley":{"formattedCitation":"<sup>4</sup>","plainTextFormattedCitation":"4","previouslyFormattedCitation":"<sup>4</sup>"},"properties":{"noteIndex":0},"schema":""}4 Second, by restricting the tree induction procedure to consider random samples of the features, one can reduce the correlation between the trees in the forest, which also helps to improve the accuracy.ADDIN CSL_CITATION {"citationItems":[{"id":"ITEM-1","itemData":{"DOI":"10.1109/34.709601","author":[{"dropping-particle":"","family":"Ho","given":"T. K.","non-dropping-particle":"","parse-names":false,"suffix":""}],"container-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence","id":"ITEM-1","issue":"8","issued":{"date-parts":[["1998"]]},"page":"832-844","publisher":"IEEE Computer Society","title":"The random subspace method for constructing decision forests","type":"article-journal","volume":"20"},"uris":[""]}],"mendeley":{"formattedCitation":"<sup>5</sup>","plainTextFormattedCitation":"5","previouslyFormattedCitation":"<sup>5</sup>"},"properties":{"noteIndex":0},"schema":""}5 The main parameters in the random forest model are ntree, the number of trees to grow; mtry, the size of the random sample of features selected for splitting; and nodesize, the minimum number of observations in each leaf of each tree. It is common to set ntree to 500 trees, mtry to the square root of the number of features (for classification), and nodesize to 1 (for classification). For ntree, larger values are preferred so as to reduce the random variation in the predictions. For mtry and nodesize, it is known that tuning these parameters typically has a minimal effect on performance, and that the default values above are often near-optimal.ADDIN CSL_CITATION {"citationItems":[{"id":"ITEM-1","itemData":{"author":[{"dropping-particle":"","family":"Tang","given":"G. H.","non-dropping-particle":"","parse-names":false,"suffix":""},{"dropping-particle":"","family":"Rabie","given":"A. B.M.","non-dropping-particle":"","parse-names":false,"suffix":""},{"dropping-particle":"","family":"H?gg","given":"U.","non-dropping-particle":"","parse-names":false,"suffix":""}],"container-title":"R News","id":"ITEM-1","issue":"3","issued":{"date-parts":[["2004"]]},"page":"434-438","title":"Classification and regression by randomForest","type":"article-journal","volume":"83"},"uris":["",""]},{"id":"ITEM-2","itemData":{"abstract":"The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad hoc learning tasks, and returnsmeasures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures.This reviewis intended to provide non-experts easy access to the main ideas.","author":[{"dropping-particle":"","family":"Biau","given":"G","non-dropping-particle":"","parse-names":false,"suffix":""},{"dropping-particle":"","family":"Scornet","given":"E","non-dropping-particle":"","parse-names":false,"suffix":""}],"container-title":"Test","id":"ITEM-2","issue":"2","issued":{"date-parts":[["2016","11"]]},"page":"197-227","title":"A random forest guided tour","type":"article-journal","volume":"25"},"uris":["",""]}],"mendeley":{"formattedCitation":"<sup>6,7</sup>","plainTextFormattedCitation":"6,7","previouslyFormattedCitation":"<sup>6,7</sup>"},"properties":{"noteIndex":0},"schema":""}6,7 We thus use the default values above. We estimated all random forest models using the ranger package in R.ADDIN CSL_CITATION {"citationItems":[{"id":"ITEM-1","itemData":{"abstract":"We introduce the C++ application and R package ranger. The software is a fast implementation of random forests for high dimensional data. Ensembles of classification, regression and survival trees are supported. We describe the implementation, provide examples, validate the package with a reference implementation, and compare runtime and memory usage with other implementations. The new software proves to scale best with the number of features, samples, trees, and features tried for splitting. Finally, we show that ranger is the fastest and most memory efficient implementation of random forests to analyze data on the scale of a genome-wide association study.","author":[{"dropping-particle":"","family":"Wright","given":"Marvin N.","non-dropping-particle":"","parse-names":false,"suffix":""},{"dropping-particle":"","family":"Ziegler","given":"Andreas","non-dropping-particle":"","parse-names":false,"suffix":""}],"id":"ITEM-1","issued":{"date-parts":[["2015","8"]]},"title":"ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R","type":"article-journal"},"uris":["",""]}],"mendeley":{"formattedCitation":"<sup>8</sup>","plainTextFormattedCitation":"8","previouslyFormattedCitation":"<sup>8</sup>"},"properties":{"noteIndex":0},"schema":""}8Gradient boosted trees: We estimated all gradient boosted tree models using the xgboost package in R.ADDIN CSL_CITATION {"citationItems":[{"id":"ITEM-1","itemData":{"author":[{"dropping-particle":"","family":"Chen","given":"Tianqi","non-dropping-particle":"","parse-names":false,"suffix":""},{"dropping-particle":"","family":"Guestrin","given":"Carlos","non-dropping-particle":"","parse-names":false,"suffix":""}],"container-title":"Proceedings of the 22nd ACM SIGKDDD International Conference on Knowledge Discovery and Data Mining","id":"ITEM-1","issued":{"date-parts":[["2016"]]},"page":"785-794","title":"Xgboost: A scalable tree boosting system","type":"paper-conference"},"uris":[""]}],"mendeley":{"formattedCitation":"<sup>9</sup>","plainTextFormattedCitation":"9"},"properties":{"noteIndex":0},"schema":""}9 In xgboost, the key parameters that need to be tuned are eta, the learning rate, and depth, which is the maximum allowable depth of each tree. We set the number of iterations nrounds to 100. To tune eta and depth, we applied k-fold cross validation on the training set with k = 5 folds. We tested the range {0.001, 0.01, 0.02, 0.05, 0.1} for eta and the range {2,3,4,5} for depth. Supplemental ReferencesADDIN Mendeley Bibliography CSL_BIBLIOGRAPHY 1. Friedman J, Hastie T, Tibshirani R. glmnet: Lasso and elastic-net regularized generalized linear models. R Packag version 2009;1.2. Breiman L. Random Forests. Mach Learn 2001;45:5–32.3. James G, Witten D, Hastie T, Tibshirani R. Statistical Learning. In: Springer, 2013:15–57.4. Breiman L. Bagging predictors. Mach Learn 1996;24:123–40.5. Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 1998;20:832–44.6. Tang GH, Rabie ABM, H?gg U. Classification and regression by randomForest. R News 2004;83:434–8.7. Biau G, Scornet E. A random forest guided tour. Test 2016;25:197–227.8. Wright MN, Ziegler A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. 2015.9. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDDD International Conference on Knowledge Discovery and Data Mining., 2016:785–94. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download