Topic 2: Distributions and basic descriptive statistics ...



Topic 2: Distributions and basic descriptive statistics for quantitative variablesJeroen ClaesContentsHow to load and view data from a CSV fileWorking with data.framesMeasures of central tendencyNormal distributionDispersion measuresWhat to do when your data is not distributed normally1. How to load data from a CSV file1. How to load data from a CSV file (1/2)In this class, we will be working with data by Balota et al. (2007), provided by Levshina (2015):Data in CSV (comma-separated values)Experiment in which subjects had to decide whether or not a word is an existing word.Columns: Word, Word length, Word frequency, and Mean reaction time1. How to load data from a CSV file (2/2)To load data, we use the package readr and we load the data straight from the course website. This works the same for local data sets.library(readr)dataSet <- read_csv(";)1. How to create a CSV fileIn Excel:Save as ‘CSV’ (Comma-separated values)Make sure you choose UTF-8 encoding (on Excel 2016 for Mac: CSV UTF-8 Encoding)If your Excel is set to Belgian locale, the separator will not be the comma ,, but rather the semicolon ;.In that case, you have to use the read_csv2 function to load your data.2. Working with data.frames2.1 Structure of data.frames (1/3)Now let’s check out how the object dataSet looks like. Recall that we can use str() for this purpose.str(dataSet)2.1 Structure of data.frames (2/3)We can also get a nice table-like view of itView(dataSet)2.1 Structure of data.frames (3/3)With dim we compute the number of dimensions (columns and rows), with nrow the number of rows, and with ncol the number of columns.dim(dataSet)## [1] 100 4nrow(dataSet)## [1] 100ncol(dataSet)## [1] 42.2 headPrint the first N lines from the data.frame#first 6 lines head(dataSet)## # A tibble: 6 x 4## Length Freq Mean_RT Word## <int> <int> <dbl> <chr>## 1 8 131 819.19 marveled## 2 10 82 977.63 persuaders## 3 7 0 908.22 midmost## 4 6 592 766.30 crutch## 5 12 2 1125.42 resuspension## 6 12 9 948.33 efflorescent2.3 tailPrint the last N lines from the data.frame#last 10 linestail(dataSet, 10)## # A tibble: 10 x 4## Length Freq Mean_RT Word## <int> <int> <dbl> <chr>## 1 6 757 641.06 club's## 2 6 775 775.59 minced## 3 5 14860 602.09 drain## 4 7 31679 576.58 justice## 5 10 181 792.93 numerology## 6 6 158 876.56 powwow## 7 8 429 710.63 backdoor## 8 11 93 875.27 maladjusted## 9 6 2434 627.11 Elaine## 10 11 162 1458.75 diacritical2.4 glimpseFor data with many columns, RStudio will omit the columns that do not fit the console window. There’s a function glimpse in the dplyr package that prints the first couple of lines of your data frame as a list.library(dplyr)glimpse(dataSet)## Observations: 100## Variables: 4## $ Length <int> 8, 10, 7, 6, 12, 12, 3, 11, 11, 5, 6, 6, 11, 4, 11, 8,...## $ Freq <int> 131, 82, 0, 592, 2, 9, 14013, 15, 48, 290, 3264, 3523,...## $ Mean_RT <dbl> 819.19, 977.63, 908.22, 766.30, 1125.42, 948.33, 641.6...## $ Word <chr> "marveled", "persuaders", "midmost", "crutch", "resusp...2.5 summaryThe function summary prints a general overview of the values (minimum, maximum, median,mean,…) in a data.frame.summary(dataSet)## Length Freq Mean_RT Word ## Min. : 3.00 Min. : 0.0 Min. : 564.2 Length:100 ## 1st Qu.: 6.00 1st Qu.: 53.5 1st Qu.: 713.1 Class :character ## Median : 8.00 Median : 310.5 Median : 784.9 Mode :character ## Mean : 8.23 Mean : 3350.3 Mean : 808.3 ## 3rd Qu.:10.00 3rd Qu.: 2103.2 3rd Qu.: 905.2 ## Max. :15.00 Max. :75075.0 Max. :1458.82.6 Dollar-sign operator: extracting a single columnTo extract a column (a single vector) from the data.frame, we use the dollar sign operator e.g., dataSet$Word#first 6 lines of just the column Wordhead(dataSet$Word)## [1] "marveled" "persuaders" "midmost" "crutch" ## [5] "resuspension" "efflorescent"#last 10 linestail(dataSet$Word, 10)## [1] "club's" "minced" "drain" "justice" "numerology" ## [6] "powwow" "backdoor" "maladjusted" "Elaine" "diacritical"2.7 Extracting multiple columnsTo extract multiple columns from a data.frame, we simply put a character vector of the column names we want to extract between square brackets, preceded by a comma:#first 6 lines of Word and Freqhead(dataSet[, c("Word", "Freq")])## # A tibble: 6 x 2## Word Freq## <chr> <int>## 1 marveled 131## 2 persuaders 82## 3 midmost 0## 4 crutch 592## 5 resuspension 2## 6 efflorescent 92.8 Subsetting by rowsTo extract rows from a data.frame we use a numeric vector of row numbers between square brackets, followed by a commadataSet[c(1, 2, 5), ]## # A tibble: 3 x 4## Length Freq Mean_RT Word## <int> <int> <dbl> <chr>## 1 8 131 819.19 marveled## 2 10 82 977.63 persuaders## 3 12 2 1125.42 resuspension2.9 Subsetting: the systemYou will have noticed that the general pattern for subsetting dataframes is the following:dataframeName[rows, columns]2.10 Conditional selection by row (1/2)Instead of writing the row numbers ourselves, we can also ask R to compute the indices of rows that fulfill some condition.The following statement will return the first six rows of all rows for which Freq > 100head(dataSet[dataSet$Freq > 100, ])## # A tibble: 6 x 4## Length Freq Mean_RT Word## <int> <int> <dbl> <chr>## 1 8 131 819.19 marveled## 2 6 592 766.30 crutch## 3 3 14013 641.67 row## 4 5 290 654.12 jumpy## 5 6 3264 583.81 enjoys## 6 6 3523 667.18 screws2.10 Conditional selection by row (2/2)Other useful conditions are:is.na() (e.g., dataSet[is.na(dataSet$Freq), ]) - “is empty value”!is.na() - “is NOT is empty value”< - “smaller than”<= - “smaller than or equal to”> - “larger than”>= - “larger than or equal to”%in% (e.g., dataSet[dataSet$Freq %in% c(1,2,4)]) - “one of list”%in% with NOT operator (e.g., dataSet[!dataSet$Freq %in% c(1,2,4)]) - “not one of list”2.11 Adding columnsAdding columns is just as easy as selecting columnsWe simply type the dataframe name, followed by the dollar-sign operator and a name for the column. Then we can assign whatever value we want to the new columndataSet$LogFreq <- log(dataSet$Freq)glimpse(dataSet)## Observations: 100## Variables: 5## $ Length <int> 8, 10, 7, 6, 12, 12, 3, 11, 11, 5, 6, 6, 11, 4, 11, 8,...## $ Freq <int> 131, 82, 0, 592, 2, 9, 14013, 15, 48, 290, 3264, 3523,...## $ Mean_RT <dbl> 819.19, 977.63, 908.22, 766.30, 1125.42, 948.33, 641.6...## $ Word <chr> "marveled", "persuaders", "midmost", "crutch", "resusp...## $ LogFreq <dbl> 4.8751973, 4.4067192, -Inf, 6.3835066, 0.6931472, 2.19...2.12 ExcercisesGo to the exercises under data.frames3. Distributions and measures of central tendency3. Distributions and measures of central tendency (1/2)In statistics a set of scores/values/numbers is called a distribution.The values in the columns Length, Freq, and Mean_RT each constitute a distribution.The typical values of a distribution are called its central tendencyBefore we can establish relationships between variables, we have to understand their distributions first!3. Distributions and measures of central tendency (2/2)We can describe distributions by calculating the following statistics:Mean (average)MedianQuantiles/quartilesModeMinimumMaximumEach of these statistics will allow us to say something about the typical values (central tendency) of the distribution.3.1 MeanThe mean of a distribution equals to the sum of the values of a distribution divided by the number of values in the distribution.Summarizes the typical values of the distribution in a single number#Mean word lengthmean(dataSet$Length, na.rm=TRUE)## [1] 8.233.2 MedianThe median of a distribution equals to the value that marks the middle of the distribution.If a distribution has 100 values and we rank them from small to large, the median will be the 50th value.median(dataSet$Length, na.rm=TRUE)## [1] 83.3 MedianWhen combined with the mean, the median provides a good idea of the central tendency of a distribution:The mean is prone to shift upwards or downwards if there is a single atypical value (called outlier in statistics)The median remains identical#Comparemedian(c(1, 2, 3, 4, 5, 6, 7, 8000, 190, 1000000)) # mean = 100821.8## [1] 5.5median(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) # mean = 5.5## [1] 5.53.4 QuantilesA related concept are quantiles (median = 0.5 quantile)0.25 quantile is also called the first quartile, median is second quartile etc.Quantiles express that a percentage of the values of the distribution is lower than a particular valueUseful to study the spread of the values in the distribution# One quantilequantile(dataSet$Length, 0.25)## 25% ## 6# Multiple quantilesquantile(dataSet$Length, c(0.25, 0.5, 0.75))## 25% 50% 75% ## 6 8 103.5 ModeMost frequent value in a distributionNot widely used# table() counts the number of times a value occurs in a vectortab <- table(dataSet$Length)# We sort the table by its values, in decreasing ordersorted <- sort(tab, decreasing = TRUE)# We extract the first value, which is the highest.mode <- sorted[1]# The number 8 is the mode. It occurs 16 timemode## 8 ## 16# In one line:sort(table(dataSet$Length), decreasing = TRUE)[1]## 8 ## 163.6 MinimumMinimum: lowest valuemin(dataSet$Length, na.rm=TRUE)## [1] 33.7 MaximumMaximum: highest valuemax(dataSet$Length, na.rm=TRUE)## [1] 153.8 Calculating most of these at onceBy calling summary on a single column, you can calculate many of the statistics discussed here with one commandsummary(dataSet$Length) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.00 6.00 8.00 8.23 10.00 15.003.9 Describing a distributionMeanMedianIf the mean is lower than the median, the distribution is negatively skewed. If the mean is higher than the median, it is positively skewed.3.10 ExcercisesGo to the excercises under 2.Measures of central tendency4. The normal distributionStatisticians have identified different types of distributionsThe most important is the normal or Gaussian distributionMost statistical tests assume that your data approaches this theoretical distribution:Important to check whether your data is normally distributedOtherwise statistical tests may produce unreliable results!Most data drawn from sufficiently large samples will be normally distributed4.1 Characteristics (1/2)The mean and the median are not very different (easy test)The largest amount of values occur around the mean, the lowest amount of values occur near the minimum and maximumThe distribution is symmetrical: there are about as many values that are smaller than the mean than there are values that are larger than the mean4.1 Characteristics (2/2)This pattern shows up as a bell-shaped curve in density plots4.2 Checking for normalityPlotting (preferred)Shapiro-Wilk test4.2.1 Plotting4.2.1.1 Plotting in ggplotLet’s take a moment to learn the basic syntax of ggplot,a highly structured (almost language-like) plotting packageThis command sets up the basis of ggplot:It defines the data we are working with (dataSet)It defines the aesthetics (aes()) that are inherited by the other plotting functions that follow behind the + signs:Our x axis (the horizontal axis) will plot the Length variableOur y axis (the vertical axis) will plot the Mean_RT variableIf we don’t specify a y axis, this will have to be calculated by a plotting function (stat="bin" or stat="density")library(ggplot2)ggplot(dataSet, aes(x=Length, y=Mean_RT))4.2.1.1 Plotting in ggplotAfter the basic setup, we can add multiple plot layers by adding plotting functions separated by a + sign:geom_line() for line graphsgeom_bar() for bar chartsgeom_vline() for vertical linesgeom_abline() for horizontal linesgeom_point() for points (scatterplots)stat_summary() to summarise the y values in your data in function of x before plotting (here we take the mean of all Mean_RT values that correspond to the same Length value)library(ggplot2)ggplot(dataSet, aes(x=Length, y=Mean_RT)) + stat_summary(fun.y = "mean", geom = "line")4.2.1.1 Plotting in ggplotlibrary(ggplot2)ggplot(dataSet, aes(x=Length, y=Mean_RT)) + stat_summary(fun.y = "mean", geom = "line")4.2.1.2 Density plots (1/2)Suppose we want to know if our variable Length approaches a normal distribution:We draw up a density plot, an ordered plot that visualizes the number of occurrences of each value in our distributionWe add a red vertical line at the meanWe look for the characteristic bell curve and symmetrical patternObserve that the geom_vline has its own aesthetic: xintercept, which defines the point where it crosses/intercepts the x axis (the x - intercept). We set this aesthetic to the mean of our Lengthvariablelibrary(ggplot2)ggplot(dataSet, aes(x=Length)) + geom_line(stat="density") + geom_vline(aes(xintercept=mean(dataSet$Length)), color="red")4.2.1.2 Density plots (2/2)library(ggplot2)ggplot(dataSet, aes(x=Length)) + geom_line(stat="density") + geom_vline(aes(xintercept=mean(dataSet$Length)), color="red")4.2.1.3 Histograms (1/2)Suppose we want to know if our variable Length approaches a normal distribution:We draw up a histogram, an ordered plot that visualizes the number of occurrences of each value in our distributionWe add a red vertical line at the meanWe look for the characteristic bell curve and symmetrical patternlibrary(ggplot2)ggplot(dataSet, aes(x=Length)) + geom_histogram(bins=11) + geom_vline(aes(xintercept=mean(dataSet$Length)), color="red")4.2.1.3 Histograms (2/2)4.2.1.4 QQ-plots (1/2)Another way to identify if your data are normally distributed are QQ-plotsIf the dots follow the line closely, the data are normally distributedqqnorm(dataSet$Length)qqline(dataSet$Length)4.2.1.4 QQ-plots (2/2)qqnorm(dataSet$Length)qqline(dataSet$Length)4.2.2 Shapiro-Wilk test of normalityIf p < 0.05 (don’t worry about what that means for now), the data is not normally distributedPlotting is preferred (especially QQ-plots), because this test depends on sample sizeshapiro.test(dataSet$Length)## ## Shapiro-Wilk normality test## ## data: dataSet$Length## W = 0.97756, p-value = 0.085594.3 ExcercisesGo to the excercises under 3. The normal distribution5. Measures of dispersion5. Measures of dispersionIf you data are normally distributed, the mean and the median are good representatives of the typical values of a distributionOur analysis of the Freq and Mean_RTvariables have shown that this is not the case for non-normal distributionsMeasures of dispersion express how much variation there is between values of a distribution. They include:RangeVarianceStandard deviationInterquartile RangeMedian absolute deviationA boxplot allows you to display dispersion graphically5.1 RangeThe range is the difference between the maximum and the minimum value of a distributionmax(dataSet$Length) - min(dataSet$Length)## [1] 125.2 VarianceSum of the squared deviations from the mean, divided by the sample size minus one.The larger the variance, the larger the differences between the individual values and the meanNot on the same scale as the data: squared (the sum of the deviations from the mean would be zero without squaring it)If the distribution is non-normal, this will lead us to overestimate the variation between the mean and an average score# Same as: # sum((dataSet$Length-mean(dataSet$Length))^2 )/(length(dataSet$Length)-1)var(dataSet$Length)## [1] 6.2596975.3 Standard deviationThen standard deviation is the square root of the varianceBy taking the square root of the variance (which are the squared deviations from the mean) we get back to the original units of our dataExpresses the average deviation from the meanIf the distribution is non-normal, the standard deviation will lead us to overestimate the variation between the mean and an average scoresd(dataSet$Length)## [1] 2.5019395.4 Interquartile RangeThe interquartile range expresses the difference between the first quartile (the 25% lowest values) and the third quartile (the 75% lowest values)More robust for non-normal distributions than the variance and standard deviationsIQR(dataSet$Length)## [1] 45.5 Median absolute deviationThe median of the deviations between the values in the distribution and the medianBased on the median, so it is not sensitive to outliersmad(dataSet$Length)## [1] 2.96525.6 Displaying dispersion graphically (1/2)A box-and-whisker plot displays the mean and the min and max individual valuesIf there are outliers, the lines will be longer on one end of the graphlibrary(ggplot2)ggplot(dataSet, aes(x=1, y=Length)) + geom_boxplot()5.6 Displaying dispersion graphically (2/2)library(ggplot2)ggplot(dataSet, aes(x=1, y=Length)) + geom_boxplot()5.7 How to report it all?When you describe a distribution, you report:MeanMedianAND:If the data are normally distributed, you report the standard deviationIf the data are not normally distributed, you report IQR or mad scoresIdeally you also provide a box-and-whisker plot5.8 ExcercisesGo to the excercises under 4. Measures of dispersion6. What to do when your data is not distributed normally6. What to do when your data is not distributed normallyWe have seen that Freq and Mean_RT are not normally distributed.This does not mean that we cannot analyse them, but we will have to be careful. Most tests do not behave well when the data is not normally distributed or contains outliers (e.g., standard deviation and variance)In general when you encounter a non-normal distribution you have two options:Remove outliersApply a transformation6.1 Remove outliers (1/6)If the deviation from the normal distribution is caused by a few atypical values, you may exclude these from your analysisLet’s consider Mean_RTsummary(dataSet$Mean_RT)## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 564.2 713.1 784.9 808.3 905.2 1458.86.1 Remove outliers (2/6)The summary shows that the interquartile range is 192.1The difference between the 3rd quartile and the maximum value is 553.6.This suggest that there may be a few exceptionally large values in the right tail of the distribution6.1 Remove outliers (3/6)Subjects appear to have had much difficulty with recognizing the top-4 words diacritical, dessertspoon, acquisitiveness, and resuspensionMaybe these were not the best test items?LengthFreqMean_RTWordLogFreq111621459diacritical5.08812111314dessertspoon2.3981541217acquisitiveness1.3861221125resuspension0.6931111851089denominated5.2211461055euphemistic3.8296.1 Remove outliers (4/6)It is usually a good idea to select outliers based on z-scoresz-scores express how far the value is removed from the mean in standard deviation unitsPositive z-scores are values that are larger than the mean, negative z-scores are values that are smaller#(dataSet$Mean_RT - mean(dataSet$Mean_RT))/sd(dataSet$Mean_RT)dataSet$zscores <- scale(dataSet$Mean_RT)summary(dataSet$zscores)## V1 ## Min. :-1.5931 ## 1st Qu.:-0.6210 ## Median :-0.1522 ## Mean : 0.0000 ## 3rd Qu.: 0.6328 ## Max. : 4.24596.1 Remove outliers (5/6)With this score we remove all values that are more than two standard deviations removed from the mean (absolute value of z >= 2) (Urdan, 2010: 18)We need the absolute values (we have to ignore their +/- sign), so we wrap the scores with absdataSet <- dataSet[abs(dataSet$zscores) <= 2, ]6.1 Remove outliers (6/6)To check if our problem is solved, we can plot our variableggplot(dataSet, aes(x=Mean_RT)) + geom_line(stat="density") + geom_vline(aes(xintercept=mean(dataSet$Mean_RT)), color="red")6.2 Apply a transformation (1/6)The variable Freq cannot be brought to a normal distribution by removing a few outliersIt has the typical power-law distribution of word frequencies (Zipf’s law):A few words occur very oftenThe rest of the words have a very low frequencyDistribution with a very broad rangeCan only be corrected with a transformation6.2 Apply a transformation (2/6)Popular transformations are:Logarithm:Expresses to which power we have to raise the mathematical constant e = 2.718281828459 (the base) to obtain the input numberlog(1000) = 6.907755, because 2.718281828459 ^ 6.907755 = 1000Other bases are:10 (log10(1000) = 3; 10 ^ 3 = 1000)2 (log2(1000) = 9.965784; 2 ^ 9.965784 = 1000)Fixes power-law distributionsExponentiate the base with the logarithm to undo6.2 Apply a transformation (3/6)Popular transformations are:Logarithm (fixes power-law distributions)Square root (fixes positively skewed distributions)Square transformation (fixes negatively skewed distributions)Reciprocal transformation (1/(vector + 1)), for distributions that look like a jBest practice is to check which transformation works best for your data6.2 Apply a transformation (4/6)Let’s apply the logarithm to the Freq variableThe log of 0 is -Inf, which would render our data useless.We add 1 to the values of Freq and then we apply the transformation (the log of 1 is 0)6.2 Apply a transformation (5/6)dataSet$transformedFreq <- log(1+dataSet$Freq)qqnorm(dataSet$transformedFreq)qqline(dataSet$transformedFreq)6.2 Apply a transformation (6/6)There’s always a trade-off to be made:If you need to have your distribution to be normal for some test, then you can consider a transformationTransforming your data makes it harder to interpret the values. E.g., log(dataSet$Freq) is less intuitively understandable than dataSet$FreqWe will revisit transformations in later classesQuestions??ReferencesBalota, D.A., Yap, M.J., & Cortese, M.J., et al. (2007). The English Lexicon Project. Behavior Research Methods, 39(3), 445–459. DOI: 10.3758/BF03193014. Data taken from Levshina (2015).Levshina, N. (2015). How to do Linguistics with R: Data exploration and statistical analysis. Amsterdam/Philadelphia, PA: John Benjamins. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download