STAT 1261/2260: Principles of Data Science



STAT 1261/2260: Principles of Data ScienceLecture 6 - Data Visualization (4): ggplot2 (3/3)Where are we?In Lecture 5, we have discussed the following topicsAdding context to plotsDetails about geom functions, such asstatistical transformationposition adjustmentchange of scalesHow to save a plotToday, ?we will talk aboutCoordinate systemsStatistical graphical displayslibrary(mdsr) library(tidyverse)library(ggplot2)library(tinytex)Coordinate systemsIn ggplot2, the default coordinate system is the Cartesian coordinate system where the x and y position act independently to find the location of each point.There are a number of other coordinate systems in ggplot2:A Cartesian coordinate system which has the x- and y- axes switchedMap coordinate systemPolar coordinate systemCoordinate systems: Switching x- and y-axescoord_flip() switches the x- and y- axes. For example,g<- ggplot(data=mpg,mapping=aes(x=class,y=hwy))g+geom_boxplot()g+geom_boxplot()+coord_flip()Switching x- and y-axes is also useful for long labels.Coordinate systems: The map coordinate systemcoord_quickmap() sets the aspect ratio correctly for maps. This is very important when plotting spatial data with ggplot2. For example,library(maps)nz <- map_data("nz")ggplot(nz, aes(long, lat, group = group)) + geom_polygon(fill = "white", colour = "black")ggplot(nz, aes(long, lat, group = group)) + geom_polygon(fill = "white", colour = "black") + coord_quickmap()Coordinate systems: The map coordinate system (cont.)coord_map projects a portion of the earth, which is approximately spherical, onto a flat 2D plane using any projection defined by the mapproj package. Map projections do not, in general, preserve straight lines, so this requires considerable computation.coord_quickmap is a quick approximation that does preserve straight lines. It works best for smaller areas closer to the equator.The maps package contains functions and data about drawing geographic maps.The map_data() function turns data from the package maps in to a data frame suitable for plotting with ggplot2.nz contains data about New Zealand.For more details, install the maps package and type ?map_data in R console. Note that map_data is a function in ggplot2.Coordinate systems: The polar coordinate systemIn mathematics, the polar coordinate system is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a reference point and an angle from a reference direction.The reference point (analogous to the origin of a Cartesian coordinate system) is called the pole.The ray from the pole in the reference direction is the polar axis.The distance from the pole is called the radial coordinate, radial distance or simply radiusThe angle is called the angular coordinate or polar angle.(Wikipedia)Example: The polar coordinate systemWe can create a pie chart using geom_bar() by changing the coordinate system from Cartesian to polar.ggplot(mtcars, aes(x = factor(cyl),fill=factor(cyl))) + geom_bar()ggplot(mtcars, aes(x = factor(1), fill = factor(cyl))) + geom_bar(width = 1)ggplot(mtcars, aes(x = factor(1), fill = factor(cyl))) + geom_bar(width = 1)+coord_polar(theta = "y")Statistical Graphical DisplaysOver time, statisticians have developed standard data graphics for specific use cases.Numeric dataHistograms and Density plotsQQ-plotsBoxplotsCategorical dataBar chartsPie chartsTwo continuous variablesScatterplotsContour plotsOne numeric variable (1): Histogram and Density PlotsThe distribution of a numerical variable is commonly summarized graphically using a histogram or a density plot. Let us see the distribution of SAT math scores in the SAT_2010 data frame.g <- ggplot(data = SAT_2010, aes(x = math))g + geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.g + geom_density()Both the histogram and the density plot convey the same information, but whereas the histogram uses pre-defined bins to create a discrete distribution, a density plot uses a kernel smoother to make a continuous curve.One numeric variable (1): Histogram and Density Plots (cont.)Note that what is displayed in the histogram is not just the raw data. Instead, to construct a histogram, geom_histogram():Divides the range of values into n equal-width intervals (bins).Counts how many values fall in each bin using count()The default number of bins is bins=30. You may adjust it using binwidth = or bins = argument. For details, type ?geom_histogram.The density plot can be adjusted using adjust = or bw = argument. For more details, type ?geom_density.g + geom_histogram(binwidth = 15)g + geom_density(bw = 5)One numeric variable (2): QQ-PlotsThe quantile-quantile plot (QQ-plot) is very useful when comparing an empirical univariate distribution (in “sample”) with a theoretical distribution. To visually check whether the math variable distributes as a normal distribution:ggplot(data = SAT_2010,aes(sample = math)) + geom_qq()+ stat_qqline()One categorical variable(1): Bar chartsIf the variable is categorical, we can use bar graphs to display the distribution of it.We use the data set mosaicData::HELPrct “Health Evaluation and Linkage to Primary Care” as an example. homeless is a categorical variable which has two categories: homeless and housed.with(HELPrct, table(homeless))## homeless## homeless housed ## 209 244g <- ggplot(data = HELPrct, aes(x = homeless,fill=homeless))g + geom_bar()One categorical variable(2): Coxcomb chartAlternatively, we may create a so-called Coxcomb chart using geom_bar().bar <- ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) + theme(aspect.ratio = 1) + labs(x = NULL, y = NULL)barbar + coord_polar()The Coxcomb chart is just a bar chart in the polar coordinate system. Apparently, the radius represents the count.Two categorical variables(1): Tile plotConsider another categorical variable in the homeless dataset, substance, which has three categories: alcohol, cocaine, and heroin. How to display the distribution of both homeless and substance?Perhaps you want to expand the coordinate system and create a 3D bar graph, which has been widely criticized in recent times by statisticians. Instead, we can use a flattened version of this, using the visual cue of shade to map the count of each combination of categories of homeless and substance.HELPrct %>% group_by(substance,homeless) %>% summarize(count = n()) %>% ggplot(aes(x = homeless, y = substance, fill = count)) + geom_tile()## Warning: The `printer` argument is deprecated as of rlang 0.3.0.## This warning is displayed once per session.Two categorical variables(2): Stacked bar plotswith(HELPrct, table(substance,homeless))## homeless## substance homeless housed## alcohol 103 74## cocaine 59 93## heroin 47 77The 2-by-2 contingency table shows the joint distribution of the two variables.For a small data set, the tile plot is not so effective. So, let us expand the bar graph to a stacked bar plot. We will use the color to map the variable substance.ggplot(data = HELPrct, aes(x = homeless, fill = substance)) + geom_bar()ggplot(data = HELPrct, aes(x = homeless, fill = substance)) + geom_bar()+coord_flip()Two categorical variables(2)(cont.)Position adjustments determine how to arrange geoms.g <- ggplot(data = HELPrct, aes(x = homeless, fill = substance)) + coord_flip()g + geom_bar(position = "dodge") g + geom_bar(position = "fill")Notice that by using position = "fill", we are plotting a different quantity, called conditional probability. It can be effectively used to answer conditional and comparative questions.What is the proportion (or probability) of alcohol use among the homeless?Is there a higher chance of using heroin if the person is housed, as opposed to being homeless?Example: Bar plotsHow would you answer the following questions?What is the proportion of homeless among all people whose primary substance of abuse is heroin?What is the proportion of homeless among all people whose primary substance of abuse is alcohol?Are those two proportions noticeably different?g <- ggplot(data = HELPrct, aes(x = substance, fill = homeless )) + coord_flip()g + geom_bar(position = "dodge") g + geom_bar(position = "fill") This method of graphical display enables a direct comparison of proportions. In this case, it is clear that homeless participants were more likely to identify as being involved with alcohol as their primary substance of abuse.Discrete Numeric and Ordinal VariablesOrdinal variables are categorical variables with intrinsic order.?A typical example of this type of variable is cyl (number of cylinders of a car) in the mpg dataset.table(mpg$cyl)## ## 4 5 6 8 ## 81 4 79 70summary(as.factor(mpg$cyl))## 4 5 6 8 ## 81 4 79 70ggplot(data = mpg, aes(x = cyl,alpha=0.5)) + geom_bar()??The resulting bar plot is now very similar to a histogram, and it now makes sense to discuss the shape and location of the distribution. Many well-known statistical distributions are actually of this type: Bernoulli, Binomial, Geometric, Poisson distributions and so paring two or more numeric distributions (1)We discussed how stacked bar plot is used to compare two categorical variables. To compare multiple univariate numeric distributions, a side-by-side boxplot is the best option for its simplicity.ggplot(HELPrct, aes(x = homeless, y = pcs)) + geom_boxplot() + facet_wrap(~ substance) pcs is the Physical Component Score (measured at baseline, lower scores indicate worse status)Comparing two or more numeric distributions (2)We can also use frequency polygon to compare different distributions.It is hard to see the difference in distribution because the overall counts differ so much.ggplot(data = diamonds, mapping = aes(x = price, colour = cut)) + geom_freqpoly(binwidth = 500)Comparing two or more numeric distributions (2) (cont.)To make the comparison easier, instead of displaying count, we’ll display density, which is the count standardized so that the area under each frequency polygon is one.ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)Two numeric variables: scatterplot and contour plotggplot(SAT_2010, aes(math, salary)) + geom_point() + geom_density2d(aes(colour = ..level..)) We can use the scatterplot to show the relationship between two numeric variables.The density function for a bivariate distribution is graphed as a mountain:geom_density2d() first computes estimates of density heights, then transforms the mountain into sets of contours (points of equal elevation), and use shades of color to map the contours. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download