STAT 1261/2260: Principles of Data Science



STAT 1261/2260: Principles of Data ScienceLecture 5 - Data Visualization (3): ggplot2 (2/3)Where are we?Implementing the grammar of graphics using ggplot2In Lecture 4, we’ve seen some basics of ggplot2.ggplot2 is based on the Grammar of Graphics, closely related to Nathan Yau’s four elements (Visual Cues, Coordinate System, Scale and Context) of data visualization.library(mdsr) library(tidyverse)library(ggplot2)library(tinytex)Adding context (1)The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the labs() function.ggplot(mpg, aes(displ, hwy)) + geom_point(aes(color = class)) + geom_smooth(se = FALSE) + labs(title = "Fuel efficiency generally decreases with engine size", subtitle = "Two seaters (sports cars) are an exception because of their light weight", caption = "Data from ")Adding context (2)You can also use labs() to change axis labels and legends. It is usually a good idea to replace short variable names with more detailed descriptions, and to include the units.ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = class)) + geom_smooth(se = FALSE) + labs(x = "Engine displacement (L)",y = "Highway fuel economy (mpg)",colour = "Car type")Adding context (3)Context is also provided by guides (more commonly called legends). By mapping a discrete variable to one of the visual cues of shape, color or linetype, ggplot2 by default creates a legend. You may choose to omit the legends by show.legend=FALSE.ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = class),show.legend=FALSE) + geom_smooth(se = FALSE)The geom_text() and geom_annotate() functions can also be used to provide specific textual annotations on the plot.Scales (1)The diamond dataset, which comes in ggplot2, contains information about 54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond.What is the relationship between carat and price?What if we log transform carat and price?ggplot(diamonds, aes(carat, price)) +geom_bin2d()ggplot(diamonds, aes(log10(carat), log10(price))) +geom_bin2d() It’s very useful to plot transformations of your variable. However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot.Scales (2)Instead of doing the transformation in the aesthetic mapping, we can do it with the scale. This is visually identical, except that the axes are labelled on the original data scale.ggplot(diamonds, aes(carat, price)) + geom_bin2d() + scale_x_log10() + scale_y_log10()Scales (2)(cont.)Here’s an identical graph using scale_x_continuous() and scale_y_continuous() functions:ggplot(diamonds, aes(carat, price)) + geom_bin2d() + scale_x_continuous(trans = "log10") + scale_y_continuous(trans = "log10")Scales (3)Another scale that is frequently customised is colour.In the second chunk of code, "Set1" is defined in RColorBrewer package. It is one of the qualitative palettes. The RColorBrewer package provides colorblind-safe palettes in a variety of hues (see Figure 2.11 in MDSR (textbook))ggplot(mpg, aes(displ, hwy)) + geom_point(aes(color = drv))ggplot(mpg, aes(displ, hwy)) + geom_point(aes(color = drv)) + scale_colour_brewer(palette = "Set1")How bar charts are created?Next, consider a basic bar chart, as drawn with geom_bar().ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! Where does count come from?Graphs and New VariablesMany graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:bar charts, histograms, and frequency polygons bin your data and then plot bin counts (geom_bar() and geom_bin2d())smoothers fit a model to your data and then plot predictions from the model (geom_smooth())The figure below describes how this process works with geom_bar().Statistical transformationsThe algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. You can learn which stat a geom uses by inspecting the default value for the stat argument.For example, ?geom_bar shows that the default value for stat is count, which means that geom_bar() uses stat_count().stat_count() is documented on the same page as geom_bar(), and if you scroll down you can find a section called Computed variables. That describes how it computes two new variables: count and prop.Using stat_count to create a bar chartYou can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():ggplot(data = diamonds) + stat_count(mapping = aes(x = cut))This works because every geom has a default stat; and every stat has a default geom.Display proportions in bar chartsYou might want to override the default mapping from transformed variables to aesthetics.ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1)) group="whatever" is a “dummy” grouping to override the default behavior, which is to group by the x variable (cut in this example) in order to separately count the number of rows in each level of the x variable. To compute the proportion of each level of cut among all, we do not want to group by cut. Specifying a dummy group group = 1, i.e.?all are in group 1, achieves this.Display proportions in bar charts (cont.)Without group=1, the plot will look like the following.ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop..)) Calculate the proportion manuallyAlternatively, we can calculate y differently.ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..count../sum(..count..)))+ ylab("Proportion")Override the default statWhen there is no need for any statistical transformation, I can change the stat of geom_bar() from count (the default) to identity, as shown in the example below.library(tibble)demo <- tribble( ~cut, ~freq, "Fair", 1610, "Good", 4906, "Very Good", 12082, "Premium", 13791, "Ideal", 21551)ggplot(data = demo) + geom_bar(mapping = aes(x = reorder(cut,freq), y = freq), stat = "identity") ggplot(data = demo) + geom_bar(mapping = aes(x = cut, y = freq), stat = "identity") # see what happens if `reorder(cut,freq)` is replaced by `cut`. Type `head(diamonds$cut)`Alternative way to create bar chartsYou may also use geom_col to create the same plot:ggplot(data = demo) + geom_col(mapping = aes(x = reorder(cut,freq), y = freq)) + xlab("cut")Color the barsYou can colour a bar chart using either the color aesthetic, or, more usefully, fill:color only changes the outline of the barsfill changes the actually color of the barsggplot(data = diamonds) + geom_bar(mapping = aes(x = cut,fill = cut))ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut,color = cut),alpha=0)Add another variableIt is more useful when fill aesthetic is mapped to another categorical variable, like clarity.library(dplyr)g <- ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) g + geom_bar() The stacking is performed automatically.The position argument specifies the position of bars. By default, it is position="stack".Position adjustments (1)If you don’t want a stacked bar chart, you can set position to be "dodge" or "fill".g + geom_bar(position = "fill")g + geom_bar(position = "dodge")position = "fill" shows relative proportions at each x level by stacking the bars and then standardizing each bar to have the same height.position = "dodge" let the bars for each category be dodged side-to-side.Saving your plotsMany times you would like to save the plots you have created. ggsave() is a convenient function for saving a plot.It defaults to saving the last plot that you displayed, using the size of the current graphics device. It also guesses the type of graphics device from the extension.The usage of the function is as follows:ggsave(filename, plot = last_plot(), device = NULL, path = NULL, scale = 1, ...)filename:?File name to create on disk.plot:?Plot to save, defaults to last plot displayed.device:?Device to use. Can either be a device function (e.g.?png()), or one of “eps”, “ps”, “tex” (pictex), “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg” or “wmf” (windows only).path:?Path to save plot to (combined with filename).scale:?Multiplicative scaling factor.For more details, type ?ggsave in R console.Example: Save a plotggsave("my-plot.pdf")## Saving 6 x 4 in imageggsave() only saves the last plot that you create.The file name of the plot is my-plot.pdf.The path is my current directory.The device is pdf.The scale is 1 by default.You may also save through the “Plots” menu of R Studio. ?Check the following link for more details. Adjustments (2)There is one other type of adjustment that can be very useful for scatterplots. Recall the scatterplot of car engine size (displ) vs.?fuel efficiency (hwy):ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))Did you notice that the plot displays only 126 points, even though there are 234 observations in the data set?Position Adjustments (2) (cont.)The values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. The arrangement makes it hard to see where the mass of the data is.You can avoid this gridding by setting the position adjustment to “jitter.”position="jitter" adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.Because this is a useful operation, ggplot2 comes with a shorthand for geom_point(position="jitter"): geom_jitter.ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy),position="jitter") ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download