Boxplots - Home - Test



Quick introduction to descriptive statistics and graphs inR Commander Written by: Robin Beaumont e-mail: robin@organplayers.co.uk last updated Wednesday, 24 April 2013Version: 2Contents TOC \o "1-3" \h \z \u Boxplots PAGEREF _Toc354568110 \h 2Percentages for each category/factor level PAGEREF _Toc354568111 \h 3Summaries for a interval/ratio variable divided across categories (factor levels) PAGEREF _Toc354568112 \h 3Histograms PAGEREF _Toc354568113 \h 4Density plots PAGEREF _Toc354568114 \h 5Densityplots for subgroups defined by factor levels PAGEREF _Toc354568115 \h 6Graphical summaries of data - aggregation PAGEREF _Toc354568116 \h 7Aggregating data PAGEREF _Toc354568117 \h 11Boxplots-22860170180From within R you need to load R commander by typing in the following command:library(Rcmdr)First of all you need some data and for this example I'll use the sample dataset, by loading it directly from my website. You can do this by selecting the R commander menu option: Data-> from text, the clipboard or URLThen I have given the resultant dataframe the name mydataframe, also indicating that it is from a URL (i.e. the web) and the columns are separated by tab characters. Clicking on the OK button brings up the internet URL box, you need to type in it the following to obtain my sample data: dataset has 7 variables of which we are only interested in two here; time (the outcome variable) and dosage a grouping variable indicating which group the result ('time') belongs to.934085113665-3110230311150Percentages for each category/factor level-22860212725Using the dataset from the boxplots example. Taking a single variable we can obtain the counts for each category + percentage in R commander. Consider we wanted to know what the number and percentage of cases are in each group, that is within each category (level) of the dosage variable.The dosage variable is a grouping variable = nominal data, and each value is said to represent a factor level.Summaries for a interval/ratio variable divided across categories (factor levels)-114300139700We can obtain simple descriptive statistics using the menu option show opposite we can also find these for subgroups by using the Summarize by groups option.Histograms-90170219075Say we wanted to see the distribution of ages in our dataset, you have three options usually you would only show one in a report.Frequency counts:-1270090805275590237490-2969260309880Percentages:Density histogram3848100223520Note the dataframe dollar column name format i.e. mydataframe$age description of the x axis.Density plots-75565481330A density plot is a smoothed version of a histogram its very useful. Unfortunately there is no r commander menu option to produce them so you need to type the command:plot (density(dataframe name $ column name)) So for our dataframe which we have called mydataframe and the column called age within it we type;plot( density ( mydataframe$age)) Densityplots for subgroups defined by factor levelsThere are many ways and the easiest is to use the lattice package introduced latter in the course but for now just considering the gender variable which has only 2 levels we can do the following:First copy only the male cases into a dataframe called maledata:select only rows where gender =maleand all the columns in the dataframethe comma is importantnote the double = =to mean "is equal to"maledata <- mydataframe[mydataframe$gender == "Male",]Now copy only the female cases into a dataframe called femaledata:select only rows where gender =femaleand all the columns in the dataframethe comma is importantnote the double = =to mean "is equal to"femaledata <- mydataframe[mydataframe$gender == "Female",]plot the densities of . Now create our densityplotset the x axis label to read . . . . .set the y axis limits to 0 to 0.07 the male ages set the main title of the graph to read . .. ...plot(density(maledata$age), ylim = c(0, 0.07), main = "densityplots for males/females[dotted] for age", xlab= "age (years)" )Now need to superimpose the female density line.set the line type to 2 which is dotted to differentiate it from teh default line type solidlines(density(femaledata$age), lty = 2)-60462134345Graphical summaries of data - aggregationProblem: we want to show hourly wage against years working at a health institution and have the data in the following format.-6985654059398065405-93980194945First obtain either the healthwagedata.sav or the healthwagedata.rda, file from the url below and store it on your local machine. top left screenshot shows how to load the rda XE "R code:reading files:rda R binary data files" file. We see there are many entries for each yrsscale (time worked with institution). While the hourwage shows the average hourly wage. (top right) XE "Files:reading:rda R binary data files (R commander)" Before we do anything let's check what the summary values are for each level of employment time using the menu option statistics -> summaries -> numeric summaries and setup the dialog box as shown opposite.Clearly the mean and median hourly rate go up with years employment, from 18 to 21.63Because of the multiple hourly wage values for each level of employment time a scatter plot of the raw data is not appropriate but we have two options:2159024765produce a series of boxplots or means or each group or aggregate the data, for example find the mean at each hourly wage against employment time and then plot these values.We can easily produce a boxplot XE "graphical summaries of data:Boxplots:subgroups" of the above findings.-3133725267970By selecting the identify outliers option: automatically we have the case numbers marked. -29292556350By selecting the identify outliers option we now have a clearer, but possibly less useful graph. Asking the question what do the many outliers suggest? would require knowledge of the context in which the data was collected they might be miscoded values or a particular distinct subset of employees such as consultants and a definitive answer needs detailed knowledge of the environment from where the data was collected.-13462092710Ignoring the outliers and assuming that the data are normally distributed at each no of years employment level we can produce a graph of means at each level along with a indication of range.Graphs->plot of means XE "R commander:Graphs->plot of means" Selecting the standard errors XE "graphical summaries of data:displaying standard errors" option we can see the estimated accuracy of the mean for each group-77470181610I feel that presenting the data like this possibly does it a disservice as it now appears very clean giving no indication of those very low and high paid workers! XE "Files:reading:rda R binary data files (R commander)" -2953385633095 XE "R data types:factor:recording factor levels (R commander)" Notice that the x categories are in the correct order but this is not always the case, the rda XE "R code:reading files:rda R binary data files advantages over txt files" and sav files contained additional information specifying the factor level order. However if we had used a plan text file (i.e. .dat or .txt) you would have needed to reorder the factor levels by using the R Commander menu option:Data ->Manage variables in active dataset->Reorder factor->levels XE "R commander:Data->Manage variables in active dataset->Reorder factor->levels" 11239592710The alternative strategy is to produce a new dataframe which only consists of the summary values.To do this we first need to remove all those rows which have empty values for either the hourwage or yrsscale variables.data->active data set->remove cases with missing data XE "R commander:Data->Active data set->Remove cases with missing data" XE "Removing cases with missing values (R commander)" See opposite. I have called the new dataframe cleandataframe.13335146685Notice that the new dataframe is automatically loaded.The new dataframe has 89 less recordsAggregating data-126365149860Aggregating data and new datasets from the aggregated values is a common occurrence with large datasets and this scenario provides you with a good example.Having removed all the cases with missing data we can now create a newdataframe with just the aggregated data (i.e. the means) by selecting the menu option:Then setup the dialog box as shown opposite.-7112089535Notice that the new dataframe is automatically loaded.The new dataframe has 6 records.-6921599060Clicking on the edit data set button we can edit the new dataframe.When you have finished make sure you close it by clicking on the X button on the top right hand side of the window.-35115540640-29997401462405The next stage is to produce a scatterplot of the means against year, however we can only do this when we have at least two interval/ratio variables in the dataframe else the R commander scatterplot menu option is grayed out. Which it would be if you tried with the current dataframe. However this is easily fixed by changing the yrsscale variable from a factor to a numeric variable.Once again click on the edit data set button this time selecting the top of the yrsscale column and change the variable to numeric. XE "R commander:Edit data set button changing a factor to a numeric type" XE "R data types:factor:converting froma factor to a numeric variable (R commander)" When you have finished make sure you close both the variable editor and the data editor windows with the X button. Now we can produce the scatterplot.Setup the dialog box as shown opposite.The result is shown below. But I feel is far less informative than the boxplots we created earlier?end of document ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download