Sampling in SPSS and R



Robin Beaumont robin@organplayers.co.ukOptions for demonstrating sampling variability and sampling distributions in teaching statisticsTuesday, 11 October 2011Contents TOC \o "1-3" \h \z \u Sampling in SPSS and R PAGEREF _Toc306059510 \h 21Using SPSS PAGEREF _Toc306059511 \h 21.1Using SPSS syntax PAGEREF _Toc306059512 \h 21.1.1One sample PAGEREF _Toc306059513 \h 21.1.2Multiple samples all the same size and from same distribution. PAGEREF _Toc306059514 \h 31.1.3Samples of different sizes PAGEREF _Toc306059515 \h 41.1.4Sampling distributions PAGEREF _Toc306059516 \h 62Online Apps PAGEREF _Toc306059517 \h 73The standard error of the Mean PAGEREF _Toc306059518 \h 73.1.1Effect of sample size upon SEM - formula appreciation PAGEREF _Toc306059519 \h 74Using SPSS script PAGEREF _Toc306059520 \h 84.1.1Alternative script - Distribution.sbs PAGEREF _Toc306059521 \h 95In R PAGEREF _Toc306059522 \h 106Online presentations and other tools PAGEREF _Toc306059523 \h 11Sampling in SPSS and RThe aim of this handout is to describe the various options available for teaching the concept of sampling variability along with some student material.The process usually involves creating samples and then comparing them with both the parent population and amongst themselves (SEM demonstration).I have offered four ways of doing this below; Using SPSS (two methods) online apps and R.Using SPSSUsing SPSS syntaxThe traditional way of investigating random samples in SPSS is to use the SPSS syntax window:One sampleSimple example to create a single sample with 1000 cases from a Normal distribution with mean = 100 ; SD=15:SPSS syntaxUse Analyze the get the results*example of creating a random sample* Create 10,000 cases for sampleNEW FILE.INPUT PROGRAM.LOOP #1 = 1 TO PUTE X = RV.NORMAL(100,15).END CASE.END LOOP.END FILE.END INPUT PROGRAM.EXECUTE.And to get a boxplot:Next exercise is to produce several samples.Multiple samples all the same size and from same distribution.Variables called V20 to V30, all the same size. I have assumed that you have run the above syntax first if not you need to use the syntax below right:If have run above scriptIf have not run above scriptNUMERIC V20 to V30.vector v = V20 to V30.* loop for sample sizeLOOP #case = 1 TO 100.*loop for each sampleLOOP #i= 1 TO 11. *now we have to specify both column(sample) and row (sample number) COMPUTE v(#i) = RV.NORMAL(100,15).END LOOP.END LOOP.EXECUTE.NEW FILE.INPUT PROGRAM.NUMERIC V20 to V30.vector v = V20 to V30.* loop for sample sizeLOOP #case = 1 TO 100.*loop for each sampleLOOP #i= 1 TO 11. *now we have to specify both column(sample) and row (sample number) COMPUTE v(#i) = RV.NORMAL(100,15).END LOOP.END CASE.END LOOP.END FILE.END INPUT PROGRAM.EXECUTE.Typical output:Descriptive StatisticsNMeanStd. DeviationV20100101.542114.20531V21100101.003915.53362V2210099.112414.14247V2310097.624014.07071V2410099.938214.43248V25100100.181813.80487V26100100.450215.45697V27100101.605515.04477V28100100.888814.05551V29100101.652314.24829V3010099.904314.19884Samples of different sizesTwo main ways to do this, you can create all the samples in a single variable and add a Grouping variable or alternatively create several variables with different sample sizes in each. For various reason the former strategy is best however just for interest I have included below the latter option of putting the various samples of different sizes in separate variables:NEW FILE.INPUT PROGRAM.LOOP #count = 1 TO 500.DO IF (#count <31). COMPUTE samp30 = RV.NORMAL(100,15). END IF.DO IF (#count <51). COMPUTE samp50 = RV.NORMAL(100,15). END IF.DO IF ( #count <101). COMPUTE samp100 = RV.NORMAL(100,15). END PUTE samp500 = RV.NORMAL(100,15).END CASE.END LOOP.END FILE.END INPUT PROGRAM.EXECUTE.This approach (i.e. separate variable each sample) causes problems when analysing the data as SPSS considers the smaller samples to have missing values! Therefore the better solution is to use a grouping variable that is an identifier indicating the sample each observation(case) belongs to.The next SPSS syntax script duplicates the above but just creates two variables (one called GROUP the other VALUE) here:new file.input program.loop #i=1 to pute group=pute value=rv.Normal(100,15).end case.end loop.loop #i=1 to pute group=pute value=rv.Normal(100,15).end case.end loop.loop #i=1 to pute group=pute value=rv.Normal(100,15).end case.end loop.loop #i=1 to pute group=pute value=rv.Normal(100,15).end case.end loop.end file.end input program.execute .The code opposite is not the most elegant you could use one loop with a number of 'DO IF' statements:new file.input program.loop #i=1 to 500.DO IF (#i<31). compute group=pute value=rv.Normal(100,15).end case.END IF.DO IF (#i<51). compute group=pute value=rv.Normal(100,15).end case.END IF.DO IF (#i<101). compute group=pute value=rv.Normal(100,15).end case.END pute group=pute value=rv.Normal(100,15).end case.end loop.end file.end input program.SORT CASES by group(a).execute .Both the above SPSS syntax files do the same thing that is produce four samples of different size from a normal distribution with mean 100 SD=15.Obviously you could easily change the parameters of the distribution or even change the actual distribution, Two alternatives are:the uniform: rv.Uniform(lower, upper) or exponential: rv.exp(mean)Using the Explore command in SPSS shows the SD for each group and also a box plot.43434060960-165735132715Carrying out the above tasks it is then possible to complete the following table. Sample sizeMinimum valuemeanMaximum valueStandard deviation3050100500Theoretical population valueThe above exercise will demonstrate; Standard deviation varies little over sample size - there must be a sample adjustment factor in it!Mean also varies little (repeated sampling for smaller samples produces wider variation - next exercise) from the population mean of 100The above exercise can then be repeated changing the sample size to 3, 10, 20, 30new file.input program.loop #i=1 to pute group=pute value=rv.Normal(100,15).end case.end loop.loop #i=1 to pute group=pute value=rv.Normal(100,15).end case.end loop.loop #i=1 to pute group=pute value=rv.Normal(100,15).end case.end loop.loop #i=1 to pute group=pute value=rv.Normal(100,15).end case.end loop.end file.end input program.execute .Given these are random samples each person will obtain a different result however what they should notice is that the means(medians in above boxplot) vary less as the sample size gets larger. You could ask them to repeatedly create multiple random samples of varying size then plot the means (technically what we would produce is a sampling distribution of the mean) but at this stage it is probably better to revert to online simulations (see below).Sampling distributionsStudent typical explaination:So far we have looked at the characteristics of one or more samples from a population but what about the characteristics across samples! Why, you may well ask, would we bother with such additional complexity but just consider this: I have a valuable substance (Guinness) and only want to take as small sample as possible to find an accurate mean value of substance X. So how can we calculate what would be a small enough sample to produce a accurate mean value?To answer this question obviously we need to assess the variation of means across samples of a specific size. While we have done this for a small number of samples we will now consider many samples to produce a distribution. Online AppsGo to the app at this website we can ask for repeated samples of different sizes and then plot their means. I have done it for 10,000 samples of size 5 and also size 25 - Students should notice how much more spread out the means are for the smaller samples..Student explanation:The standard error of the MeanThe Standard Error of the Mean provides a measure of the standard deviation of sample means. In other words it is just another standard deviation but now we are at the between sample level rather than within sample level. Because we are working at a different level the name has changed for the same idea concerning spread. From the above exercise, we have both the population data along with information about a set of samples from it. Interestingly all we need to calculate the SEM is information from a single sample. We will now compare the observed answer (for the samples in the above screen shot = 2.23 for samples of size 5) with a specific formula. This formula is known as the SEM (Standard Error of the Mean). σx =σ2n=σn= standard deviation of samplesquare root of number in sample = 5/√5 = 2.236and for the sample size of 25 SEM = 5/√25 = 1 We can see from the above formula that the Standard Error of the Mean is equal to the standard deviation divided by the square root of the sample size. We have samples of size 5 and 25 so we can calculate the SEM from each one. You will notice that the observed SD of the sample means is identical to that using the formula - this is truly amazing We can predict the distribution of means of random samples without carrying out the sampling just using the SEM formula. Effect of sample size upon SEM - formula appreciationWe know that the formulae for the standard error of the mean (SEM) is:σx =σ2n=σn= standard deviation of samplesquare root of number in sampleLets consider what happens to the SEM as the sample size changes. From the above equation the top value (numerator) will remain constant, but the bottom value (denominator) will increase. What happens in this instance, which is a property of all fractions, is that the total value decreases, therefore as sample size increases the variability of the sample means decreases. You can think of it in terms of accuracy, the larger the random sample the more accurate the SEM, a statistician would say that this indicated that it was a consistent estimatorAs N increases -> SEM decreasesTo learn more about SPSS syntax see the excellent tutorial including datasets and videos at: SPSS script -216535404495SPSS scripts allow users to create additional dialog boxes and several people have produced scripts which provide dialog boxes for creating random samples. This is probably an easier alternative to learning SPSS syntax. provides three possible scriptsRight mouse click on the "Generate Random variables EN SBS" link select the "Save Link as" option to save the script file to your local drive change the default extension from txt to sbs.Back in SPSS:74295113665This allows you to create multiple samples of a specific size. You can also run the script several times to create many samples by un-checking the "Replace the working data file" option.Alternative script - Distribution.sbsYou will then be presented with:Type in the sample size you want:Step 1 - click next to allow you to select:Step 2 - the distribution, I selected Normal Step 3 - - you can change the mean, SD.Once you have created one sample you can create up to 20 different ones each time clicking nextTo finish click the Finish button!Typical results using the menu option explore:Case Processing SummarygroupCasesValidMissingTotalNPercentNPercentNPercentvaluedimension11.0030100.0%0.0%30100.0%2.0020100.0%0.0%20100.0%3.0015100.0%0.0%15100.0%4.0010100.0%0.0%10100.0%In RR is not for the lazy! but it is amazingly versatile. This section is for completeness.# this is a comment#create a plot x axis=0 to 62 y axis=50 to 150# Give the axes labelsplot(c(0,62), c(50,150), type="n",xlab="Sample size", ylab="mean")#sample size 3 to 30 in steps of 2 (=df) for (df in seq(3,61,2)) {# number of samples (=60) at each size for (i in 1:60) { # create random samples from a normal distribution of size df # and store in the vector (column) xx<- rnorm(df,mean =100, sd=15)points(df,mean(x)) } # end for each group of samples} # end for each sample sizeYou can see an animated version of the above at: this site has a large number of animations all written in r code using the free R animation package. To the casual visitor all the R code is hidden away they just seeing the beautiful animations.With more R knowledge one can create more complex examples, the following is taken from Maindonald & Braun 3rd ed. 2010 p. 89. This produces 10,000 simulations of different samples of different sizes from a skewed distribution. The code below can be used as the basic for a large number of similar exercises. ############################### from Miandolald & Braun p.89-90######## CUP 2010## uses the lattice librarylibrary(lattice)############### function to generate n sample valuessampvals <- function(n) exp(rnorm(n, mean = 0.5, sd = 0.3))## Means across rows of a dimension nsamp x sampsize matrix of## sample values gives nsamp means of samples of size sampsize.samplingDist <- function(sampsize = 3, nsamp = 1000, FUN = mean)apply(matrix(sampvals(sampsize * nsamp), ncol = sampsize), 1, FUN)size <- c(3, 10, 30)## Simulate means of samples of 3, 9 and 30; place in dataframedf <- data.frame(y3 = samplingDist(sampsize=size[1]), y9 = samplingDist(sampsize=size[2]), y30 =samplingDist(sampsize=size[3]))################# use the strip.custom to customise the strip labellingdoStrip <- strip.custom(strip.names = TRUE, factor.levels= as.expression(size), var.name= " sample size", sep = expression(" = "))## Then include the argument 'strip=doStrip' in the call to densityplot################# Simulate source population (sampsize = 1)y <- samplingDist(sampsize = 1)densityplot(~y3+y9+y30, data=df, outer=TRUE, layout= c(3,1),plot.points = FALSE, panel = function(x, ...) {panel.densityplot(x,..., col = "black")panel.densityplot(y, col = "gray40", lty=2, ...)}, strip=doStrip)Online presentations and other toolsThe new Zealand census at school Website a section on informal inference, called "The eyes have it" which contains animated gifs that people can use in their representations and also an excellent presentation concerning sampling variability and how this can informally relate to hypothesis testing see: of document ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download