Concepts - Brigham Young University



ConceptsIdentify and define the three different types of variables and identify at least one geometric object or geom that would be appropriate for visualizing each type of variablesCategorical variables are either nominal or ordinal. Categorical variables are often alphabetic (strings or characters). When we have a fixed number of possible values for these variables, we often call them factors. Factors can have levels that can potentially be ordered. There are two different kinds of categorical variables:Nominal: character or string values that don’t have an inherent order. We often times use geom_bar to summarize this type of variable. (We can also use text processing to look for patterns in these variables.)Ordinal: character or string values that have an inherent order. Class grades, categories of income levels, seating on airlines, first or second years in a graduate program, etc. all have an inherent ordering. We can use geom_bar and ordered levels to visualize this type of data.Numerical: data represented by numbers. These can be real values (double) or integers. Possible geometric objects to summarize this data are geom_point, geom_histogram, geom_freqpoly, geom_density, and geom_violin work with numerical data.Integers are interesting because we can also consider them as ordered categories or ordinal data. Integers can be treated as either numerical or categorical variable types.Carefully explain how bin width and bandwidth affect geom_histogram and geom_density.Bin width and bandwidth both control the amount of smoothing in graphs of the distribution of numerical data. The wider the bin width, the fewer the categories and this gives histograms that are smoother. There are larger spans on the x-axis and the frequencies are larger. This means that there is much less variation among frequencies. The same is true with bandwidth. Larger bandwidths sum over a larger range of values and this causes an estimate of the density to be smoother.On the above boxplot, locate each statistic in Tukey's five number summary. Explain how the whiskers are calculated and drawn.Tukey’s five number summary is the minimum, first quartile, median (second quartile), the third quartile, and the maximum. These are indicated on the graph above. The minimum and the maximum can either be an outlier or they can be the value on the whisker.The whiskers are calculated by taking 1.5 times the interquartile range (the third quartile minus the first quartile or the length of the rectangle.) We then measure left from the first quartile and then slide back to the right until we encounter an observation. This positions the left whisker. A similar process happens on the right-hand side of the boxplot. We measure to the right 1.5 times the interquartile range and then slide back to the left until we encounter an observation. In the above example, there are data points that locate more than 1.5 times the interquartile range above the right side of the rectangle or box. These are potential outliers. The maximum value is the largest observation.First define a distribution and then use the above Age of Titanic Passengers boxplot to define and illustrate each of the four characteristics of the age distribution.A distribution has two parts:Range of valuesFrequencies or probabilities for each valueThe four characteristics of a distribution are:Location or center: measured as the median and mean in the diagram above.Scale or spread: measured as the length or the box or the interquartile range (IQR)Symmetry: whether the distribution is a mirror image. The above distribution is positively skewed. We know this becauseThe median is closer to the lower boundary of the rectangle (hinge)The mean is greater than the medianThe right whisker is longer than the left whiskerOutliers: There are positive values that are outside the whiskers.The following code includes each of the seven different parts of the grammar of graphics and creates the accompanying bar chart. Use the code to define and explain each of the seven parts of the grammar of graphics.Data: in this case line 1 identifies mpg as the data frame or tibble from which to get data.Aesthetics: line 2 assigns class to the x axis and fill to the type of drive train.Geometric objects or geom: Line 3 choose a bar chart for the geometric object.Stats: Encompassed in geom_bar is a counting function which counts the number of observations in each category or levelPosition: also in the geom_bar. This locates the bars side by side rather than stackedCoordinate system: The x and y axes are flipped in line 4Faceting: multiple graphs, one for each year are created by line 5.Code Interpretation1. Interpret lines 2 - 5Line 2: Defines the aesthetics. The x variable is drv. The fct_infreq is a function that counts the number of observations in each type of drv and then orders the levels of drive based on the size of the frequency. The fill = drv specifies that all objects in the graph that each level of the drive train have a fill characteristic will have a specific color mapped to them.Line 3: Each geom creates a layer on our graph. This line creates a layer for the bar chart.Line 4: Flips the x and y axes which means that we have a horizontal bar chart rather than one that is vertical.Line 5: Changes the color palette for the fill from the default R and ggplot to the Pastel1 palette from the color brewer project. 2. Interpret lines 2 – 5Line 2: Specifies the aesthetics for the graph. It anticipates that we are drawing a scatterplot. Engine displacement maps to the x-axis and highway miles per gallon maps to the y-axis.Line 3: Creates a layer in which each combination of displacement and highway mileage are plotted.Line 4: Creates a second layer that is superimposed on the scatterplot. It graphs a smoothed line that is calculated by completing a simple linear regression. The lines are all colored blue because we haven’t specified a color in the aesthetics in line 2.Line 5: Creates three different graphs, one for each value of the drive train. This is the purpose of faceting, to create multiple graphs based on a variable. 3. Interpret lines 2 - 4Line 2: Specifies the aesthetics for the graph. It anticipates that we are drawing a scatterplot. Engine displacement maps to the x-axis and highway miles per gallon maps to the y-axis. In addition, everything the graph that has a color attribute will be assigned a color based on a mapping from the default color palette to each observation based on its drive train.Line 3: Creates a layer in which each combination of displacement and highway mileage are plotted. The color of each point is determined by the color = drv part of line 2.Line 4: Creates a second layer that is superimposed on the scatterplot. It graphs a smoothed line that is calculated by completing the default or loess smooth. This also overrides the color aesthetic from line 2 by assign color to be blue for this layer only. It also draws a dashed line. These characteristics are not inside and aesthetic so they don’t vary based on the values of a variable.4. Interpret lines 2 - 4Line 2: Even though we anticipate drawing layers that only have one variable, one on the horizontal axis, we need to specify the y value because we are going to overlay a density plot on a histogram. These geoms have different scales for the y axis. This means that in the aesthetics, engine displacement is identified as the variable of interest. The y = ..density.. means that we want a probability density scale for the y-axis rather than a count.Line 3: Create a layer for the histogram. We specify the amount of smoothing by declaring the number of bins. The bin width will then be the maximum minus the minimum divided by the number of bins. Each one of the bars in the histogram is filled with the grey60 ggplot color.Line 4: On top of the histogram, we create a second layer for the density trace. The bw = .25 alters the amount of smoothing in the density trace. The alpha = 0.5 means that we have a transparency of 50%.5. Interpret lines 2 – 5Line 2: We are investigating the probability distribution of the highway miles per gallon for each different type of drive train. The violin plot and the boxplot both require two arguments. The x value is a categorical variable and the y variable is a numerical variable. The x = fct_reorder, orders the factor drv by the size of the median for city miles per gallon. Any aspect of the graph that contains a fill characteristic has mapped to it a fill color based on the drv variable.Line 3: draws a violin plot in a new layer. The show.legend = FALSE is needed to insure that a redundant legend for fill isn’t created for drive train. Drivetrain is already specified from the x declaration.Line 4: Creates a boxplot for each drivetrain level. The width = 0.1 specifies the width or the rectangle and fill each box with the grey color. The fill = “grey” takes precedents over the fill = drv in line 2.Line 5: Exchanges the x and y axes so that we have a horizontal real violin plot. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download