Concepts



ConceptsUse the following graph to compare the shapes of the distributions of birth weights for smoking and non-smoking mothers:Location:Measured by the mean and the medianNonsmokers have larger babies as measured by both the mean and the medianScale:Measured by the IQR (also by the sd which isn’t shown or known)Nonsmokers appear to have more variability in birthweightsSymmetry:Things to consider: (not required in answer)Relationship of mean and medianRelative length of whiskersPosition of median relative to hingesPrevalence of outliersEvidenceThe nonsmoker median is less than the median which suggests negative skewness.The left whiskers seem longer for both smokers and nonsmokers.The medians are very close to the middle of the boxes.Smokers seem to be more symmetric than nonsmokers. Outliers:Smokers have one baby with an outlier birth weightDefine categorical variables and explain their relationship with factors.The possible values for a categorical variable are usually limited and fixed. The number of categories should also be manageable. Each individual observation can be assigned to a particular group or category. The categories themselves can be unordered (nominal) or ordered (ordinal).Statistical software often uses factors to represent categorical data. The groups can be labeled using codes, numbers, or characters. Factors have levels and labels which can be changed, reordered, and manipulated, especially when creating data visualizations.Define and explain the three time series components.Trend\Cycle:Trend: General upward and downward movements in a time series. A global trend is a general tendency for the time series variable to increase or decrease throughout the series.Cycles are local tendencies to increase or decrease along the global trend line. Because cycles don’t have a constant frequency (period), they are combined with the trend.Season:Regular time series patterns that occur with a recognizable frequency that is often associated with calendar events. The constant frequency means that it can usually be identified as a daily, weekly, month, semi-annual frequency.Remainder or irregular:That part of the time series that isn’t allocated to the trend\cycle or seasonal patterns. If the model is additive then we subtract the trend and seasonal from the time series to get the remainder.Sketch a boxplot and carefully explain how each part is constructed.Tukey’s five number summary is the minimum, first quartile, median (second quartile), the third quartile, and the maximum. The purpose of the boxplot is to represent these give quantities. They are indicated on the graph above. The minimum and the maximum can either be an outlier or they can be the end value on the whisker.The whiskers are calculated by taking 1.5 times the interquartile range (the third quartile minus the first quartile or the length of the rectangle.) We then measure left from the first quartile and then slide back to the right until we encounter an observation. This positions the left whisker. A similar process happens on the right-hand side of the boxplot. We measure to the right 1.5 times the interquartile range and then slide back to the left until we encounter an observation. In the above example, there are data points that locate more than 1.5 times the interquartile range above the right side of the rectangle or box. These are potential outliers. The maximum value is the largest observation.How are geometric objects and stats related to each other in the context of the grammar of graphics. Illustrate your answer with examples.Geometric objects create layers in data visualizations. Each geom creates a new layer. If the new layer requires calculations, there is a related statistical procedure or stats which accomplishes this. Examples include:geom_bar requires a statistical procedure to count the number of observations in each category.geom_smooth: requires a statistical method which calculates the a simple regression or creates a nonlinear relationship using loess.geom_col: doesn’t require an associated stats because the counts have been calculated elsewhere.Code InterpretationCode Chunk #1: Interpret lines 2 - 5Line 2: Uses the aesthetic to map Date to the x axis and Sales divided by 1000 to the y axisLine 3: Creates first layer for line plot using the geom_line. The characteristics of the line are linetype, color, and transparency which are specified outside of an aes declaration because they apply to all the points in the data frame.Line 4: Creates a second layer which plots individual points at a reduced size from the default. The smaller size is to accommodate the large number of points in the graph.Line 5: Creates a third layer which superimposes a loess line on top of the line plot and the points. In this case, the method = loess by default. Span = .20 gives the width of the smoothing window used by the loess procedure. The se = FALSE keeps ggplot from giving a confidence interval around the loess line. The size = 1 controls the size of the line.Code Chunk #2: Interpret lines 2 – 4Line 2: Uses the aesthetic declaration to assign or map values to the x and y axes. The variable for the x-axis is the city miles per gallon variable. The y-axis gets its values from a statistical procedure that estimates probability values which are given as ..density.. .Line 3: Created a layer for a histogram. The x-axis is first divided into intervals 2 miles per gallon wide. Then each observation is sorted into a category. The binwidth affects the amount the smoothness of the histogram.Line 4: Created a second layer wherein an estimate density is superimposed on top of the histogram. The bw = 1 or bandwidth controls the amount of smoothing of the density. The area beneath the estimated density curve is filled with lightblue and the transparency of the fill is reduced to ? so that we can see the histogram, which lies behind the density.Code Chunk #3: Interpret lines 2 – 4Line 2: Uses the aesthetic to assign race to the x-axis. Race is a categorical variable which must have been assigned an order because the fct_rev reverses the order of the levels and associated labels.Line 3: Creates a graphical layer for a bar chart. It also creates a proportion using stats as it first counts and then figures out the percentage for each race. This is accomplished with the position = “fill” part of the geom_bar.Line 4: Switches the orientation from vertical to horizontal. In case you are interested, the new version of ggplot, which was just released, now accomplishes this with orientation = y. This replaces coord_flip().Code Chunk #4: Interpret lines 2 – 10Line 2: Specifies the aesthetics by choosing race as the categorical variable for the violin plot and age for the numerical. It also orders the categorical variable face by the median of age. This is the purpose of the fct_reorder function in the aesthetics declaration.Line 3: Puts a violin in the first layer and colors or fills the inside with a lighter grey equal to grey60Line 4: Creates a second layer for the boxplot and fills the inside with a darker grey40. The width of the boxplot is reduced to 0.10 so that it doesn’t obscure the violin outline of the plotLines 5 – 9: This calculates the mean of the numerical variable city mileage and plots the result as a point using the hollow diamond shape. The fill of the shape is white and the size of the plotting symbol is 1.5.Line 10: Flips the coordinates so that the orientation is horizontal. In case you are interested, in the new ggplot this is accomplished by assigning the category to the y-axis and the numerical variable to the x-axis.Code Chunk #5: Interpret lines 2 - 4Line 2: Maps the variables to parts of the graphic. The x-axis receives the value lwt which is the pre-pregnancy weight, y is the birthweight, and the geoms that follow depend on the color = smoke aesthetics. This will color the points and the lines in the geoms that follow depending on whether the observation corresponds to a smoker or nonsmoker.Line 3: Creates a new graphical layer and maps the shape to smoke so that color and shape for the point geom both correspond to the smoker variable. The shape only applies to the point.Line 4: Creates a second layer and creates a linear regression line with no confidence interval. It inherits the color from the aesthetic statement in line 2. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download