Department of Mathematics and Statistics | College of ...



Chapter 6: Exercise 2a (Lasso)iii. Less flexible and better predictions because of less variance, more biasb (Ridge regression)Same as lasso. iii.c (Non-linear methods)ii. More flexible, less bias, more varianceChapter 6: Exercise 3a(iv) Steadily decreases: As we increase?s?from?0, all?β?'s increase from?0?to their least square estimate values. Training error for?0?β?s is the maximum and it steadily decreases to the Ordinary Least Square RSSb(ii) Decrease initially, and then eventually start increasing in a U shape: When?s=0, all?β?s are?0, the model is extremely simple and has a high test RSS. As we increase?s,?beta?s assume non-zero values and model starts fitting well on test data and so test RSS decreases. Eventually, as?beta?s approach their full blown OLS values, they start overfitting to the training data, increasing test RSS.c(iii) Steadily increase: When?s=0, the model effectively predicts a constant and has almost no variance. As we increase?s, the models includes more?β?s and their values start increasing. At this point, the values of?β?s become highly dependent on training data, thus increasing the variance.d(iv) Steadily decrease: When?s=0, the model effectively predicts a constant and hence the prediction is far from actual value. Thus bias is high. As?s?increases, more?β?s become non-zero and thus the model continues to fit training data better. And thus, bias decreases.e(v) Remains constant: By definition, irreducible error is model independent and hence irrespective of the choice of?s, remains constant.Chapter 6: Exercise 9aLoad and split the College datalibrary(ISLR)set.seed(11)sum(is.na(College))## [1] 0train.size = dim(College)[1] / 2train = sample(1:dim(College)[1], train.size)test = -trainCollege.train = College[train, ]College.test = College[test, ]bNUmber of applications is the Apps variable.lm.fit = lm(Apps~., data=College.train)lm.pred = predict(lm.fit, College.test)mean((College.test[, "Apps"] - lm.pred)^2)## [1] 1538442Test RSS is?1538442cPick?λ?using College.train and report error on College.testlibrary(glmnet)## Warning: package 'glmnet' was built under R version 2.15.2## Loading required package: Matrix## Warning: package 'Matrix' was built under R version 2.15.3## Loading required package: lattice## Warning: package 'lattice' was built under R version 2.15.3## Loaded glmnet 1.9-3train.mat = model.matrix(Apps~., data=College.train)test.mat = model.matrix(Apps~., data=College.test)grid = 10 ^ seq(4, -2, length=100)mod.ridge = cv.glmnet(train.mat, College.train[, "Apps"], alpha=0, lambda=grid, thresh=1e-12)lambda.best = mod.ridge$lambda.minlambda.best## [1] 18.74ridge.pred = predict(mod.ridge, newx=test.mat, s=lambda.best)mean((College.test[, "Apps"] - ridge.pred)^2)## [1] 1608859Test RSS is slightly higher that OLS,?1608859.dPick?λ?using College.train and report error on College.testmod.lasso = cv.glmnet(train.mat, College.train[, "Apps"], alpha=1, lambda=grid, thresh=1e-12)lambda.best = mod.lasso$lambda.minlambda.best## [1] 21.54lasso.pred = predict(mod.lasso, newx=test.mat, s=lambda.best)mean((College.test[, "Apps"] - lasso.pred)^2)## [1] 1635280Again, Test RSS is slightly higher that OLS,?1635280.The coefficients look likemod.lasso = glmnet(model.matrix(Apps~., data=College), College[, "Apps"], alpha=1)predict(mod.lasso, s=lambda.best, type="coefficients")## 19 x 1 sparse Matrix of class "dgCMatrix"## 1## (Intercept) -6.038e+02## (Intercept) . ## PrivateYes -4.235e+02## Accept 1.455e+00## Enroll -2.004e-01## Top10perc 3.368e+01## Top25perc -2.403e+00## F.Undergrad . ## P.Undergrad 2.086e-02## Outstate -5.782e-02## Room.Board 1.246e-01## Books . ## Personal 1.833e-05## PhD -5.601e+00## Terminal -3.314e+00## S.F.Ratio 4.479e+00## perc.alumni -9.797e-01## Expend 6.968e-02## Grad.Rate 5.160e+00eUse validation to fit pcrlibrary(pls)## ## Attaching package: 'pls'## ## The following object(s) are masked from 'package:stats':## ## loadingspcr.fit = pcr(Apps~., data=College.train, scale=T, validation="CV")validationplot(pcr.fit, val.type="MSEP")pcr.pred = predict(pcr.fit, College.test, ncomp=10)mean((College.test[, "Apps"] - data.frame(pcr.pred))^2)## [1] 3014496Test RSS for PCR is about?3014496.fUse validation to fit plspls.fit = plsr(Apps~., data=College.train, scale=T, validation="CV")validationplot(pls.fit, val.type="MSEP")pls.pred = predict(pls.fit, College.test, ncomp=10)mean((College.test[, "Apps"] - data.frame(pls.pred))^2)## [1] 1508987Test RSS for PLS is about?1508987.gResults for OLS, Lasso, Ridge are comparable. Lasso reduces the?F.Undergrad?and?Books?variables to zero and shrinks coefficients of other variables. Here are the test?R2?for all models.test.avg = mean(College.test[, "Apps"])lm.test.r2 = 1 - mean((College.test[, "Apps"] - lm.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)pcr.test.r2 = 1 - mean((College.test[, "Apps"] - data.frame(pcr.pred))^2) /mean((College.test[, "Apps"] - test.avg)^2)pls.test.r2 = 1 - mean((College.test[, "Apps"] - data.frame(pls.pred))^2) /mean((College.test[, "Apps"] - test.avg)^2)barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), col="red", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")The plot shows that test?R2?for all models except PCR are around 0.9, with PLS having slightly higher test?R2?than others. PCR has a smaller test?R2?of less than 0.8. All models except PCR predict college applications with high accuracy.Chapter 7: Exercise 5aWe'd expect?g2?to have the smaller training RSS because it will be a higher order polynomial due to the order of the derivative penalty function.bWe'd expect?g1?to have the smaller test RSS because?g2?could overfit with the extra degree of freedom.cTrick question.?g1=g2?when?λ=0.Chapter 6: Exercise 9Load the Boston datasetset.seed(1)library(MASS)attach(Boston)alm.fit = lm(nox ~ poly(dis, 3), data = Boston)summary(lm.fit)## ## Call:## lm(formula = nox ~ poly(dis, 3), data = Boston)## ## Residuals:## Min 1Q Median 3Q Max ## -0.12113 -0.04062 -0.00974 0.02338 0.19490 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.55470 0.00276 201.02 < 2e-16 ***## poly(dis, 3)1 -2.00310 0.06207 -32.27 < 2e-16 ***## poly(dis, 3)2 0.85633 0.06207 13.80 < 2e-16 ***## poly(dis, 3)3 -0.31805 0.06207 -5.12 4.3e-07 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.0621 on 502 degrees of freedom## Multiple R-squared: 0.715, Adjusted R-squared: 0.713 ## F-statistic: 419 on 3 and 502 DF, p-value: <2e-16dislim = range(dis)dis.grid = seq(from = dislim[1], to = dislim[2], by = 0.1)lm.pred = predict(lm.fit, list(dis = dis.grid))plot(nox ~ dis, data = Boston, col = "darkgrey")lines(dis.grid, lm.pred, col = "red", lwd = 2)Summary shows that all polynomial terms are significant while predicting nox using dis. Plot shows a smooth curve fitting the data fairly well.bWe plot polynomials of degrees 1 to 10 and save train RSS.all.rss = rep(NA, 10)for (i in 1:10) { lm.fit = lm(nox ~ poly(dis, i), data = Boston) all.rss[i] = sum(lm.fit$residuals^2)}all.rss## [1] 2.769 2.035 1.934 1.933 1.915 1.878 1.849 1.836 1.833 1.832As expected, train RSS monotonically decreases with degree of polynomial.cWe use a 10-fold cross validation to pick the best polynomial degree.library(boot)all.deltas = rep(NA, 10)for (i in 1:10) { glm.fit = glm(nox ~ poly(dis, i), data = Boston) all.deltas[i] = cv.glm(Boston, glm.fit, K = 10)$delta[2]}plot(1:10, all.deltas, xlab = "Degree", ylab = "CV error", type = "l", pch = 20, lwd = 2)A 10-fold CV shows that the CV error reduces as we increase degree from 1 to 3, stay almost constant till degree 5, and the starts increasing for higher degrees. We pick 4 as the best polynomial degree.dWe see that dis has limits of about 1 and 13 respectively. We split this range in roughly equal 4 intervals and establish knots at?[4,7,11]. Note: bs function in R expects either df or knots argument. If both are specified, knots are ignored.library(splines)sp.fit = lm(nox ~ bs(dis, df = 4, knots = c(4, 7, 11)), data = Boston)summary(sp.fit)## ## Call:## lm(formula = nox ~ bs(dis, df = 4, knots = c(4, 7, 11)), data = Boston)## ## Residuals:## Min 1Q Median 3Q Max ## -0.1246 -0.0403 -0.0087 0.0247 0.1929 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 0.7393 0.0133 55.54 < 2e-16## bs(dis, df = 4, knots = c(4, 7, 11))1 -0.0886 0.0250 -3.54 0.00044## bs(dis, df = 4, knots = c(4, 7, 11))2 -0.3134 0.0168 -18.66 < 2e-16## bs(dis, df = 4, knots = c(4, 7, 11))3 -0.2662 0.0315 -8.46 3.0e-16## bs(dis, df = 4, knots = c(4, 7, 11))4 -0.3980 0.0465 -8.56 < 2e-16## bs(dis, df = 4, knots = c(4, 7, 11))5 -0.2568 0.0900 -2.85 0.00451## bs(dis, df = 4, knots = c(4, 7, 11))6 -0.3293 0.0633 -5.20 2.9e-07## ## (Intercept) ***## bs(dis, df = 4, knots = c(4, 7, 11))1 ***## bs(dis, df = 4, knots = c(4, 7, 11))2 ***## bs(dis, df = 4, knots = c(4, 7, 11))3 ***## bs(dis, df = 4, knots = c(4, 7, 11))4 ***## bs(dis, df = 4, knots = c(4, 7, 11))5 ** ## bs(dis, df = 4, knots = c(4, 7, 11))6 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.0619 on 499 degrees of freedom## Multiple R-squared: 0.718, Adjusted R-squared: 0.715 ## F-statistic: 212 on 6 and 499 DF, p-value: <2e-16sp.pred = predict(sp.fit, list(dis = dis.grid))plot(nox ~ dis, data = Boston, col = "darkgrey")lines(dis.grid, sp.pred, col = "red", lwd = 2)The summary shows that all terms in spline fit are significant. Plot shows that the spline fits data well except at the extreme values of?dis, (especially?dis>10).eWe fit regression splines with dfs between 3 and 16.all.cv = rep(NA, 16)for (i in 3:16) { lm.fit = lm(nox ~ bs(dis, df = i), data = Boston) all.cv[i] = sum(lm.fit$residuals^2)}all.cv[-c(1, 2)]## [1] 1.934 1.923 1.840 1.834 1.830 1.817 1.826 1.793 1.797 1.789 1.782## [12] 1.782 1.783 1.784Train RSS monotonically decreases till df=14 and then slightly increases for df=15 and df=16.fFinally, we use a 10-fold cross validation to find best df. We try all integer values of df between 3 and 16.all.cv = rep(NA, 16)for (i in 3:16) { lm.fit = glm(nox ~ bs(dis, df = i), data = Boston) all.cv[i] = cv.glm(Boston, lm.fit, K = 10)$delta[2]}## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases## Warning: some 'x' values beyond boundary knots may cause ill-conditioned basesplot(3:16, all.cv[-c(1, 2)], lwd = 2, type = "l", xlab = "df", ylab = "CV error")CV error is more jumpy in this case, but attains minimum at df=10. We pick?10?as the optimal degrees of freedom.Chapter 7: Exercise 10aset.seed(1)library(ISLR)library(leaps)attach(College)train = sample(length(Outstate), length(Outstate)/2)test = -trainCollege.train = College[train, ]College.test = College[test, ]reg.fit = regsubsets(Outstate ~ ., data = College.train, nvmax = 17, method = "forward")reg.summary = summary(reg.fit)par(mfrow = c(1, 3))plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")min.cp = min(reg.summary$cp)std.cp = sd(reg.summary$cp)abline(h = min.cp + 0.2 * std.cp, col = "red", lty = 2)abline(h = min.cp - 0.2 * std.cp, col = "red", lty = 2)plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")min.bic = min(reg.summary$bic)std.bic = sd(reg.summary$bic)abline(h = min.bic + 0.2 * std.bic, col = "red", lty = 2)abline(h = min.bic - 0.2 * std.bic, col = "red", lty = 2)plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "l", ylim = c(0.4, 0.84))max.adjr2 = max(reg.summary$adjr2)std.adjr2 = sd(reg.summary$adjr2)abline(h = max.adjr2 + 0.2 * std.adjr2, col = "red", lty = 2)abline(h = max.adjr2 - 0.2 * std.adjr2, col = "red", lty = 2)All cp, BIC and adjr2 scores show that size 6 is the minimum size for the subset for which the scores are withing 0.2 standard deviations of optimum. We pick 6 as the best subset size and find best 6 variables using entire data.reg.fit = regsubsets(Outstate ~ ., data = College, method = "forward")coefi = coef(reg.fit, id = 6)names(coefi)## [1] "(Intercept)" "PrivateYes" "Room.Board" "PhD" "perc.alumni"## [6] "Expend" "Grad.Rate"blibrary(gam)## Loading required package: splines## Loaded gam 1.09gam.fit = gam(Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, df = 2), data = College.train)par(mfrow = c(2, 3))plot(gam.fit, se = T, col = "blue")cgam.pred = predict(gam.fit, College.test)gam.err = mean((College.test$Outstate - gam.pred)^2)gam.err## [1] 3745460gam.tss = mean((College.test$Outstate - mean(College.test$Outstate))^2)test.rss = 1 - gam.err/gam.tsstest.rss## [1] 0.7697We obtain a test R-squared of 0.77 using GAM with 6 predictors. This is a slight improvement over a test RSS of 0.74 obtained using OLS.dsummary(gam.fit)## ## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, ## df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, ## df = 2), data = College.train)## Deviance Residuals:## Min 1Q Median 3Q Max ## -4977.7 -1184.5 58.3 1220.0 7688.3 ## ## (Dispersion Parameter for gaussian family taken to be 3300711)## ## Null Deviance: 6.222e+09 on 387 degrees of freedom## Residual Deviance: 1.231e+09 on 373 degrees of freedom## AIC: 6942 ## ## Number of Local Scoring Iterations: 2 ## ## Anova for Parametric Effects## Df Sum Sq Mean Sq F value Pr(>F) ## Private 1 1.78e+09 1.78e+09 539.1 < 2e-16 ***## s(Room.Board, df = 2) 1 1.22e+09 1.22e+09 370.2 < 2e-16 ***## s(PhD, df = 2) 1 3.82e+08 3.82e+08 115.9 < 2e-16 ***## s(perc.alumni, df = 2) 1 3.28e+08 3.28e+08 99.5 < 2e-16 ***## s(Expend, df = 5) 1 4.17e+08 4.17e+08 126.2 < 2e-16 ***## s(Grad.Rate, df = 2) 1 5.53e+07 5.53e+07 16.8 5.2e-05 ***## Residuals 373 1.23e+09 3.30e+06 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Anova for Nonparametric Effects## Npar Df Npar F Pr(F) ## (Intercept) ## Private ## s(Room.Board, df = 2) 1 3.56 0.060 . ## s(PhD, df = 2) 1 4.34 0.038 * ## s(perc.alumni, df = 2) 1 1.92 0.167 ## s(Expend, df = 5) 4 16.86 1e-12 ***## s(Grad.Rate, df = 2) 1 3.72 0.055 . ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Non-parametric Anova test shows a strong evidence of non-linear relationship between response and Expend, and a moderately strong non-linear relationship (using p value of 0.05) between response and Grad.Rate or PhD. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download