Efron’s bootstrap - Johns Hopkins Bloomberg School of Public Health

[Pages:3]Efron's bootstrap

The bootstrap was introduced by Brad Efron in the late 1970s. It is a computer-intensive method for approximating the sampling distribution of any statistic derived from a random sample. Here Dennis Boos and Leonard Stefanski give simple examples to show how the bootstrap is used and help to explain its enormous success as a tool of statistical inference.

Bootstrap basics

A fundamental problem in statistics is assessing the variability of an estimate derived from sample data. Consider, for example, a simple survey in which a newspaper with a circulation of 300 000 (the population) randomly samples 100 of its subscribers (the sample) and asks their preference as to whether front-page stories should continue on the second page or on the back page of the section. Suppose that in the sample of 100 readers 64% favoured the back page. If this study were repeated with a new random sample of 100 readers, then the results would be unlikely to be 64% again, but would probably be something else, say 59%. And if the study were repeated over and over, the results would be a large set of percentages,

say {64, 59, 65, 70, 52, ...}. This hypothetical set of possible study results represents the sampling distribution of the sample proportion statistic. With it one can assess the variability in the real-sample estimate (e.g., attach a margin of error to it, say 64% ? 9%), and rigorously address questions such as whether more than half the readers prefer stories to continue on the back page.

The catch is, of course, that it is impractical to repeat studies, and thus the set of possible percentages described above is never more than hypothetical. The solution to this dilemma, before the widespread availability of fast computing, was to derive the sampling distribution mathematically. This is easy to do for simple estimates such as the sample proportion, but not so easy for more complicated statistics.

Fast computing opened a new door to the problem of determining the sampling distribution of a statistic. On the other side of that door was Efron's bootstrap, or what is now known simply as the bootstrap. In broad strokes, the bootstrap substitutes computing power for mathematical prowess in determining the sampling distribution of a statistic.

In practice, the bootstrap is a computerbased technique that mimics the core concept

of random sampling from a set of numbers and thereby estimates the sampling distribution of virtually any statistic computed from the sample. The only way it differs from the hypothetical resampling described above is that the repeated samples are not drawn from the population, but rather from the sample itself because the population is not accessible.

Examples

To illustrate these ideas we use two simple examples where the statistics are the sample mean and median. Consider the data set in Table 1 of n = 25 adult male yearly incomes (in thousands of dollars) collected from a fictitious county in North Carolina.

The sample mean

The sample mean of the Table 1 data is Y = 47.76. Statistical theory tells us that if these values were independently drawn from a population of incomes having mean ? and variance s2, then the sampling distribution of Y has mean ?, variance s2/n (here n = 25), and standard deviation s/n.

Table 1. Random sample of 25 yearly incomes in thousands of dollars (ordered from lowest to highest)

1 4 6 12 13 14 18 19 20 22 23 24 26 31 34 37 46 47 56 61 63 65 70 97 385

186

december2010

? 2010 The Royal Statistical Society

samples in the same manner, and then computing Y for each one. We did this for 1000 random samples and plotted a histogram of the Y values in the left panel of Figure 1.

0.04

0.04

0.02

Density

0.02

Density

0.00

0.00

0 20 40 60 80 100 120

0 20 40 60 80 100 120

Means of Lognormal Samples of Size n=25

Means of Bootstrap Samples of Size n=25

Figure 1. (Left) Histogram of 1000 sample means from repeated sampling of a theoretical lognormal population. (Right) Histogram of 1000 bootstrap sample means from randomly sampling with replacement from Table 1 data

Figure 1: (Left) Histogram of 1000 sample means from repeated sampling of a theoretical

lognormal population. (Right) Histogram of 1000 bootstrap sample means from randomly

The sampling distribution of a statistic computed

Note that repeated values of the original data second component is due to the difference in

from a randosmamsampplilenigs twheitdhistrriebuptlioanceofmtheent afrpopemar wTitahbinleea1chdreastaam.ple because the sam- the denominators between sn and sn?1. These are

statistic in repeated sampling from that popula- pling is with replacement (as opposed to without relatively minor discrepancies, and most analysts

tion. Usually we do not know the population and replacement). The only sample of size n = 25 that are usually willing to accept a small amount of

cannot repeatedly sample, and thus we estimate could be drawn without replacement is the origi- variation in bootstrap standard errors due to

? with Y and alsoTeshtiembatoe othtesstarmapplincgasntanbdaerdusendaltsoamapplepitrsoelxf.imTheatrieghtthpeanselaimn Fpigluinreg1 disiastritbheutMioonnte oCafrlYo swimhuleatnionw, ethdatoisn, ousting 1000

deviation of Y (often called the standard error) by histogram of the 1000 sample means computed resamples rather than say 1 million resamples.

sn?1/n, whekrenso2n?w1 = t(hn e? 1p)o?1Spuni=1l(aYit?ioYn)2 ifsrtohme wfrhomic1h00t0hreesasmapmlesp. lIteiswthaesbooobtsttaraipneesdtim(aatelwa(yAnsdtohfecocuarsseethweiftahct rtehaatl mdeaantsa)fr.om even

unbiased version of the sample variance. Statis- of the distribution in the left panel. Remember 1000 resamples are calculated implies the boot-

tical inferenTcehperocneoednspbayraremlyinegtroinc tbhoeofatcsttratphapt rwoecheaevdesthbeyletftrepaanteilnign tthhise cdasaetoanliyn Table 1 as a population and

that Y is approximately normally distributed due because we generated the sample from a known

to the centradl rliamwit itnhegorerma.ndom samples frpormobabiitli.ty Adistbribouotitosnt.rIanpanryarneadl oapmplicsaatimonple (also called a resample) is

So that we know what the bootstrap should we cannot produce the left panel, but the boot-

be estimatindgr, awwe ngefnreoramtedtthhee dTaatabilne T1abpleseusdtraop-pcaonpaulwlaatysiopnrodbuycerathnedriogmht lpyancehl.oTohseing 2B5oovtastlruaepspwinigthnereedpslcaocmempuetinntg 1 as Yi = 30 exp(Zi) (?$1000), i = 1, ..., 25, two panels are similar, but there are differences power. Happily it was devised just as where Z1, ...f,roZ2m5 artehiendevpaelnudeenstlyindiTstraibbulteed1. rTesaulbtilneg 2frodmisthpelaboyoststtrwapostseupcthhatsausmesptlhees. computers became common

standard normal random variables. Thus our ficti- sample as if it were the population.

tious sample is known to come from a lognormal

An important use of the bootstrap is calcula-

population or distribution. Since we know thTeabtlieon2o:f Bthoeostasntdraardperreosr aomf apnlesstifmraotem(tTheable 1.

population distribution, we can also generate essential component of the margin of error as-

the true sampling distri#buti1on of Y b1y creat4ing s4ociated6with 1a8stati2st2ical e2s2timate2)3. For 2o3ur 2st3rap's p2r4actica2l 6need f3o1r a computer. Happily, it

independent random samples in the same man- toy example, the bootstrap standard error of the was developed just as computing power became

ner, and then computing Y for eac3h7one.4W6e 4m7ean es4t7imate,5467.76,5i6s 61 61 63 6w5idely a6v5ailable6.5)

did this for 1000 random samples and plotted

a histogram Figure 1.

of

the

Y

va#lue2s

in

the

lef1t

panel4of

The bootstrap can be the sampling distribution

oufseYdwthoena2pw6perodxoim2na6otet

6

1310010 1-41

1000

(1Yi4-

i =1

Y )2118/2

=

113.98

22

3In7this4c6ase w4e6can a4l7so use47the th6e3oretic6a3lly

In some situations, we might feel comfortable

2mth3aekidnagt2aa3cgaumeess2fr3aotmt,hte2hta4ytpies,otfhdeisbtarisbicutsihoanptehaotf

7tinh0eTaubnlde7er10lyiisngac9ptou7palulylatfrioonm.

For example, the data a normal distribution

know the population from which the sample was derived formula to get the non-bootstrap stan- and then exponentiated to get lognormal data.

obtained (always the case with real data). The dard error estimate sn?1/n = 14.8 for the Table Another way to do bootstrap sampling is to estinonparametric bootstrap proceeds by treating 1 data. The difference between the two esti- mate the parameters of the assumed distribution

the data in Table 1 as a population and draw- mated standard errors (13.8 versus 14.8) has and then generate bootstrap samples from the

ing random sampNleos tfreomthit.aAt broeoptsetraapteradndvoamluetswoocfomthpoeneonrtsi.gTihnearlanddaomtacoamppponeeantr iswdiutehinesetiamcahtedrepsopaumlatpiolne. bTheiscaisucsaelletdhpearametric

sample (also called a resample) is drawn from the to the fact that the bootstrap estimate is based bootstrapping, and is best used when the distri-

Table 1 pseudsoa-mpoppulliantigonibsy wranitdhomrlyepchloaocseinmg enotn (1a0s00orpespamopsleesd. Htaod wweituhsoeduat mreucphlalacregemr enbtu)t.ionTtyhpee ios nrelaysonsaabmly pwelell konfowsni.ze

25 values with replacement from the values in number of resamples, then the bootstrap stan-

Table 1. Table 2 displays two such samples.

dard error would approxim3ate sn/n = 14.5. The The sample median

Table 2. Bootstrap resamples from Table 1

Sample 1 1 4 4 6 18 22 22 23 23 23 24 26 31 37 46 47 47 56 56 61 61 63 65 65 65

Sample 2 1 4 6 13 14 14 18 19 22 23 23 23 24 26 26 37 46 46 47 47 63 63 70 70 97

A histogram of the Table 1 data (not displayed) reveals that it is quite skewed to the right. This skewness is also clear from the fact that the sample mean 47.8 is much larger than the sample median, 26. In situations with such skewness it is typical to use the median to measure central tendency instead of the mean. Not

december2010

187

Density 0.00 0.02 0.04

Density 0.00 0.04 0.08 0.12

0 10 20 30 40 50 60 70

0 10 20 30 40 50 60 70

Medians of Lognormal Samples of Size n=25

Medians of Bootstrap Samples of Size n=25

Figure 2. (Left) Histogram of 1000 sample medians from repeated sampling of a theoretical lognormal population. (Right) Histogram of 1000 bootstrap sample medians from Table 1 data

Figure 2: (Left) Histogram of 1000 sample medians from repeated sampling of a theoretical

only is the mleodigann omromre raelprpesoenptautlivaetoiof tny.pic(aRl ightU)sinHgisthteog10r0a0mboootfst1ra0p0m0edbiaonsotdsetpircatepd saamct pcloevermageedpiraobnasbilfirtyom0.95T7.abEflreon11pointed

data dard

vdaelvuieast,iodbnuatitstahme. uscahmpsmlinaglledrisftorribtuhteionsasmtapnle-

in the right panel of Figure 2, the bootstrap standard error (of the median estimate 26) is

out the close similarity between the bootstrap percentile interval and this nonparametric con-

median than for the sample mean (small is good

fidence interval.

for standard deviations of estimators!). Thus, for

emxeadmiapnles,tothseudmUiffimS aCcreiuznelstuinstcooBmuereesadtuaitmaro.uattineeltyhuasnes the

samp1l0in010g-d1i1si0=0t10r(Mibi u- tMi)o2n1/2o=f

7.3

the

mean. CHonocwluesvioenr,

the

bootstrap

still

theUsnafmorptluenmaeteesdltyiiamnthiaestdseaisfmficptulhilntegtosdaiasmntrailpbyusletiinotngheood-fistrCthoiabmtupfaotrriionthgnethswaismebplloleoetmsnteroaanup,gs1th3a.n,8d,aaerndmdperirrioincra,lp7ly.a3sr,utptoi-culTahre pporwoevr iodf ethseabovoatsltirdap sliteas nindtahredfact that

rfeotricthalelys.taInndfeaarrcdrt,odetrhvieeartseitoiinsmonfoatthseiemsfpaolmerpetlxehpmreeesmdsiioaenndiapsntoar,ttisswttihhceethcrlaaeniamtshtehtamhteetahnbe femosretds-ikakenwniesodawdlanetsaso.vtahriaebrlecomtmhaepttmueerttahhotowidocanopmpalpliellyiscat-otbe(ada.slmTehodestom)nalenyytreheqsoutiidmreamtoern,tnios

lWikeecathneofexcporuferosssreiosentussdty/imthneafdtoiirsnttrhgibeussttaioamnnpbldeyaMmroednantee. rromrosnT,lhyetuh1se0ed00jfaobrcookotkthsentrraipfpuerm,peoddsoeiasenassvanwlueoellts..Tahree

complots

a computer program to calculate the estimator from a sample and a method to draw resamples.

Carlo sampling from a true population when it in Figure 2 suggest that the sampling distribution We have described only the case of simple random

is known. Using the 1000 bootstrap medians depicted in the right panesalmopflinFg.igHuowreeve2r,,thtehbeoobtsotroaptsmtertahopd applies

The left panel of Figure 2 gives a histogram

to any type of probability-based data collection,

of sample msetdaianndvaalrudes efrrormorth(eosfamthe e10m00edian estimate 26) is

lognormal samples as in the left panel of Fig-

provided that it can be imitated via a computer program to generate resamples that relate statis-

utmshareemedp1ital.renu,T.ehnHoissotawhmtehipsvetleioprn,gogrpianumd1lriase0totai0rfolibn0m1liu.fet-eTidhoiwuna1esnsoto 1hifna0=ep0lty1rp0hir(egkoMnhxotsiamwpimaa-tnptheeleeslM)2

1/2The only requirement for a boots=tra7p.3is. a computer program and a method to draw resamples

of Figure 2 gives the histogram of 1000 sample

tically to the real sample in the same way that the real sample relates to the population from which it was selected. For example, economic data is often in the form of time series where all the sample data are correlated. A parametric

medians comCpuotemd pfraomrinthge tsahmies rbesoaomtpsletsraasp standard error, 7.3, to that for the sambopotlsetramp weaounld, a1s3su.m8e, aemspepciifricicmaoldleyl such as

used in the right panel of Figure 1.

a normal autoregressive process. After estimating

Note thatsuthpepvoerrttiscatl hscealcelsaaimre dtihffaertentthe median is a less variable statistic thatnhetuhnkenomwneapnarafmoertesrskeofwtheedmdoadetla, m. any in-

in Figure 2. Because of the discreteness of the

dependent bootstrap time series would be gener-

bootstrap pseudo-population and the nature of of the median is mildly skewed to the right. (In ated from the estimated autoregressive process.

the median, theTehsteim1at0ed00sambpolointgstdrisatrpibum- edlairagne savmaplluesestheasraempcloinmg dmistorinbulytionusapepdroxfio-r othTehrerepaurerpliotesrealsly atshouwsaenldl.s oTf hareticles on

tion is very discrete, with most of the sample mates a normal distribution.) Thus, we might the bootstrap and many expository reviews. For

medians conpcelnottrasteidnonFitgheurTeabl2e 1sucegngteraslt tbheainttetrehsteedsianmthpe lbiinasgindtihsetsraimbpuletimoendiaonf, thsteartmerse, dthioaungh,isthembiolodklyby sEkfreonwaendd Tibshi-

values 22, 23, 24, 26, 31 and 34. For most that is, the difference between the mean of the rani2 is a good introduction, and those by Efron1

purposes thitsoditschreeterniegsshtis. no(tIna plarorbgleem.samsampplelisngtdhisetrisbuatmionplainndgthdeistrtureibpouptuiloatnionactaundalShaapoparnodxTium3 acatnesbeaconnsourltmedaflor more

Comparing Figures 1 and 2 visually suggests median. The bootstrap estimate of that bias is technical accounts.

that the boodtsitsrtapribdiusttriibounti.o)n fTorhtuhes,sawmpelemig(1h0t00)b?1eSi1i=0n100^Mtie?re2s6t=ed28.i4n?t2h6e=b2i.4a,sleiandintghe sample median, that is, the

mean is a better estimate of the true sampling to the bootstrap bias-adjusted median estimate

distribution dofifftheeresanmcpelebmeetawn etheann tithies fmor ea2n6 ?of2.t4h=e23s.a6.mpling distribution and

the the

sample median. This reflects the fact that

samplinTg hdeistbribouotitosntroafptheestmimediaatneisof

Another

tahmaentabbleiatso

tiihsmep(bo1orot0at0sntt0ra)p-sit1satciostnif1ici=a0dl0e10ncMteecinhi tn-eirqvua2el6

tRhefeeretncreus e population median.

1. Efron, B. (1982) The Jackknife, the

=BSooco2ites8ttyr.a4fpo,r-aInndd2u6Ostthr=iearl Ra2ens.da4mA,pppllileniegaddPMlaianntshg.ePmthaoitliacdse.lphia:

more difficult to estimate than the sampling construction. The simplest bootstrap approach

2. Efron, B. and Tibshirani, R. J. (1993) An

distribution of the mean. However, the boot- to confidence intervals is first to order the Introduction to the Bootstrap. New York: Chapman and

strap still estimates the sampling distribution well enough, and in particular provides a valid standard error estimate for the median, whereas the best-known other computationally-based

1000 bootstrap medians 6displayed in the right

panel Then

(^oMf(25F),ig^Mur(9e75)2),=sa(1y9^M, (41)6)

^M(2) ... is called

^M(1000). the 95%

bootstrap percentile interval. In this case, an

Hall. 3. Shao, J., and Tu, D. (1996) The Jackknife

and Bootstrap. New York: Springer.

method for estimating standard errors, the jack- exact nonparametric confidence interval for the Dennis Boos and Leonard Stefanski are at the Depart-

knife, does not.

median is available, given by (19, 47) with ex- ment of Statistics, North Carolina State University.

188

december2010

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download