Introduction to R



Handout 11 – Introduction to Creating Graphs with ggplot2This handout will provide an introduction to creating graphics in R with the ggplot2 package. While the R commands introduced in the previous handout allow the user to make basic plots, the advantage of ggplot (grammar of graphics plot) is that it uses “a particular grammar… focused on thinking about, reasoning with, and communicating with graphics.” Note: This handout is from the Stats2Labs “R Tutorials in Data Science” page and is the result of collaborative work across Grinnell College (Shonda Kuiper), Lawrence University (Adam Loy), and Carleton College (Laura Chihara). (A few very minor modifications were made). (Link: )The dataIn this handout, we will use the AmesHousing data, which provides information on the sales of individual residential properties in Ames, Iowa, from 2006 to 2010 (source). The data set contains 2,930 observations and a large number of explanatory variables involved in assessing home values. A full description of this data set is available here.Start by importing the AmesHousing.csv file into R. You’ll also have to install and load the ggplot2 package.The qplot functionThe qplot() function is similar to base plotting functions in R (which were discussed in the previous handout). It can be used to produce quick and easy graphics. Copy and paste the following code into an R script window, and answer the questions that follow after running the code.# Create a histogram of housing pricesqplot(data=AmesHousing, x=SalePrice, main ="Histogram of Housing Prices in Ames, Iowa")# Create a scatterplot of above ground living area by sales priceqplot(data=AmesHousing,x=Gr.Liv.Area, y=SalePrice)# AmesHousing$Kitchen.Qual = factor(AmesHousing$Kitchen.Qual, levels=c("Ex","Gd","TA","Fa","Po"))# Create a scatterplot with log transformed variables, coloring by a third variableqplot(data=AmesHousing,x=log(Gr.Liv.Area),y=log(SalePrice), color=Kitchen.Qual)# Create distinct scatterplots for each type of kitchen quality and number of fireplacesqplot(data=AmesHousing,x=Gr.Liv.Area,y=SalePrice,facets=Kitchen.Qual~Fireplaces)# Create a dotplot of sale prices by kitchen qualityqplot(data=AmesHousing,x=Kitchen.Qual,y=SalePrice)# Create a boxplot of sale prices by kitchen qualityqplot(data=AmesHousing,x=Kitchen.Qual,y=log(SalePrice),geom="boxplot")Questions/Tasks:In this dataset, how many houses were sold with four fireplaces?What is the purpose of the facet argument?Look at the data documentation. What are the five different levels for kitchen quality?Do you see any issues with the way the graphs created using kitchen quality are displayed? “Uncomment” the command line above that should fix this issue.Do these graphs indicate that the quality of a kitchen could be related to the sale price?In the RStudio console, type ?qplot to learn more about this function. Modify the above code to create a jittered dotplot (geom=”jitter”) of sales by kitchen quality.The basic structure of the qqplot functionAll ggplot functions must have at least three components:data frame: We will use the AmesHousing data.geom: determines the type of geometric shape used to display the data (such as line, bar, point, or area)aes: determines how variables in the data are mapped to visual properties (aesthetics) of geoms. This can include x position, y position, color, shape, fill, and size.The simplest code for a graphic made with ggplot() would have one of the following forms:ggplot(data, aes(x,y)) + geom_line() orggplot(data) + geom_line(aes(x,y)).The two lines of code above produce identical results. In the first case, the aes is set as the default for all geoms. Essentially, the same x and y variables are used throughout the entire graphic. As graphics get more and more complex, however, it is often best to create local aes mappings for each geom (as shown in the second line of code).For example, copy and paste the following into your R script window.#Create a histogram of housing pricesggplot(data=AmesHousing) + geom_histogram(mapping=aes(SalePrice))You should see the following in the Plots window after submitting the commands.Note that in the above code, the terms data= and mapping= are optional (but are used for clarification. The following command will produce identical results:ggplot(AmesHousing) + geom_histogram(aes(SalePrice))Next, copy and paste the following commands into your script window.# Create a scatterplot of above ground living area by sales priceggplot(data=AmesHousing) + geom_point(mapping= aes(x=Gr.Liv.Area, y=SalePrice))After running this command, you should see the following:Questions/Tasks:Modify the code for the histogram above so that the aes is not within the geom. The end result, though, should be the same.Create a scatterplot using ggplot with Fireplaces on the x-axis and SalePrice on the y-axis.Customizing graphics using the qqplot functionIn the following code, additional components are layered onto the histogram created above.ggplot(data=AmesHousing) + geom_histogram(mapping = aes(SalePrice/100000), breaks=seq(0, 7, by = 1), col="red", fill="lightblue") + geom_density(mapping = aes(x=SalePrice/100000, y = (..count..))) + labs(title="Figure 9: Housing Prices in Ames, Iowa (in $100,000)", x="Sale Price of Individual Homes") Comments:The histogram geom transforms the SalePrice, modifies the bin size, and changes the colorThe density geom overlays a density curve on top of the histogramTypically, density curves and histograms have very different scales; here, y = (..count..) is used to modify the density. Alternatively, we could specify aes(x=SalePrice/100000,y=(..density..)) in the histogram geom.The labs() command adds a title and an x-axis label. A y-axis label can also be added by using y=” “.The next three sets of commands create scatterplots of the log of the above ground living area by the log of the sale price.ggplot(data=AmesHousing, aes(x=log(Gr.Liv.Area), y=log(SalePrice)) ) + geom_point(shape = 3, color = "darkgreen") + geom_smooth(method=lm, color="green") + labs(title="Figure 10: Housing Prices in Ames, Iowa")ggplot(data=AmesHousing) + geom_point(aes(x=log(Gr.Liv.Area), y=log(SalePrice), color=Kitchen.Qual), shape=2, size=2) + geom_smooth(aes(x=log(Gr.Liv.Area), y=log(SalePrice), color=Kitchen.Qual), method=loess, size=1) + labs(title="Figure 11: Housing Prices in Ames, Iowa")ggplot(data=AmesHousing) + geom_point(mapping = aes(x=log(Gr.Liv.Area), y=log(SalePrice), color=Kitchen.Qual)) + geom_smooth(mapping = aes(x=log(Gr.Liv.Area), y=log(SalePrice), color=Kitchen.Qual),method=lm, se=FALSE, fullrange=TRUE) + facet_grid(. ~ Fireplaces) + labs(title="Figure 12: Housing Prices in Ames, Iowa")Comments:geom_point is used to create a scatterplot. As shown in Figure 10, multiple shapes can be used as points. The Data Visualization Cheat Sheet lists several shape options.geom_smooth adds a fitted line through the data. method=lm specifies a linear regression line; method=loess creates a smooth fit curve.se=FALSE removes the shaded confidence regions around each line.fullrange=TRUE extends all regression lines to the same length.the facet_grid command is used to create multiple plots. In Figure 12, we have created separate scatterplots based upon the number of fireplaces.When assigning fixed characteristics (such as color, shape or size), the commands occur outside the aes, as in Figure 10 (color="green"). When characteristics are dependent on the data, the command should occur within the aes, such as in Figure 11 (color=Kitchen.Qual).Questions/Tasks:Create a histogram of the above ground living area, Gr.Liv.Area.Create a scatterplot using Year.Built as the explanatory variable and SalePrice as the response variable. Include a regression line, a title, and labels for the x and y axes.Modify the scatterplot in Question 10 so that there is still only one regression line, but the points are colored by the overall condition of the home, Overall.Cond.The mplot functionTo use this function, you must first install and load the mosaic package. This function involves a helpful pull-down menu for graphic options.Questions/Tasks:In the RStudio Console, type > mplot(AmesHousing) and select 2 for a two-variable plot. Select the gear symbol in the top right corner of the graphics window and choose the following items:Graphics System: ggplot2Type of Plot: boxplotx-variable: Kitchen.Qual representing the condition of the saley-variable: SalePriceAfter selecting these items, click the Show Expression to see the ggplot2 code used to make the boxplot. Now modify the code to add an appropriate title to the plot.Explore the mplot function by creating two new graphs that provide information on the SalePrice of homes in Ames, Iowa. Study the ggplot2 code used to make these graphs.Additional considerations with R graphicsIf you enter the following command in the R console, you will notice that each variable in the AmesHousing data set is assigned a type (e.g., character, numeric, integer, complex, or logical).> str(AmesHousing)For example, the variable Fireplaces is considered an integer. $ Fireplaces : int 2 0 0 2 1 1 0 0 1 1 ...The code below tries to color and fill a density graph by an integer value. ggplot(data=AmesHousing) + geom_density(aes(SalePrice, color = Fireplaces, fill = Fireplaces))These commands appear to be ignored, however.To get the color and fill commands to work properly, we must first create a new variable (Fireplace2). The as.factor command creates a factor (a variable that contains a set of numeric codes with character-valued levels) that can be used to color and fill by Fireplace. The code below creates the desired graphic (note that only houses with fewer than three fireplaces are included).# Create a new variable called Fireplace2AmesHousing$Fireplace2 = as.factor(AmesHousing$Fireplaces)# Create a new data frame includig only houses with fewer than 3 fireplacesAmesHousing2 = AmesHousing[AmesHousing$Fireplaces < 3,]ggplot(data=AmesHousing2) + geom_density(aes(SalePrice, color = Fireplace2, fill = Fireplace2), alpha = 0.2)Customizing graphsIn addition to using a data frame, geoms, and aes, several additional components can be added to customize each graph, such as: stats, scales, themes, positions, coordinate systems, labels, and legends. We will not discuss all of these components here, but the materials in the references section provide detailed explanations. The code below provides a few examples on how to customize graphs.ggplot(AmesHousing2, aes(x = Fireplace2, y = SalePrice, color = Paved.Drive)) + geom_boxplot(position = position_dodge(width = 1)) + coord_flip() + labs(title="Housing Prices in Ames, Iowa") + theme(plot.title = element_text(family = "Trebuchet MS", color = "blue", face="bold", size=12, hjust=0))Comments:position is used to address geoms that would take the same space on a graph. In the above boxplot, position_dodge(width = 1) adds a space between each box. For scatterplots, position = position_jitter() puts spaces between overlapping points.theme is used to change the style of a graph, but does not change the data or geoms. The above code is used to modify only the title in a boxplot. A better approach for beginners is to choose among themes that were created to customize the overall graph. Common examples are theme_bw(), theme_classic(), theme_grey(), and theme_minimal(). You can also install the ggthemes package for many more options.Questions/Tasks:In the density plot above, explain what the color, fill, and alpha commands are used for. Hint: try running the code with and without these commands or use the Data Visualization Cheat Sheet.In the boxplot code, what does the coord_flip() command do?Create a new boxplot, similar to the one above, but use theme_bw() instead of the given theme command. Explain how the graph changes.Use the tab completion feature in RStudio (type theme and hit the Tab key to see various options) to determine what themes are available… in ggplot.Additional resources and : Two introductory videos on ggplot2 by Roger Peng.: Data Visualization with ggplot2 Cheat Sheet: A well-documented list of ggplot2 components with descriptions: Quick-R introduction to graphics: Formal documentation of the ggplot2 package: A tutorial on ggplot2 by Hadley Wickham.: Stackoverflow, an online community to share information.: R Graphics Cookbook, a text by Winston Chang : Sample chapters of Hadley Wickhams text, ggplot2: Elegant Graphics for Data Analysis ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download