ChrisBilder.com



Project #1 AnswersSTAT 801 Fall 2020Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Note that you will need to edit your output and code in order to make it look nice after you copy and paste it into your Word document. (22 total points) Prices for diamonds depend on color (D, E, F, G, H, or I), clarity (IF, VVS1, VVS2, VS1, or VS2), and carats (size). The purpose of this problem is for you use graphical and numerical summary measures to examine how these three items affect diamond prices. The file diamonds.csv contains a sample of diamonds and their prices. This file is available from the graded webpage of the course website. (2 points) Read the data into a data frame called “diamond”. Use the head() function to verify that you read it in correctly.> diamond <- read.csv(file = "diamond.csv")> head(diamond) #Shows first 6 observations carat color clarity price1 0.30 D VS2 745.91842 0.30 E VS1 865.08203 0.30 G VVS1 865.08204 0.30 G VS1 721.85655 0.31 D VS1 940.13226 0.31 E VS1 890.8626(2 points) To help with future calculations, I would like you to re-order the levels for diamond$clarity to match those given in the initial description of this data. This is done by making clarity a “factor” type of variable within the data frame. Run the two lines of code below to re-order the levels and make sure your levels match what is shown. > diamond$clarity <- factor(x = diamond$clarity, levels = c("IF", "VVS1", "VVS2", "VS1", "VS2"))> levels(x = diamond$clarity)[1] "IF" "VVS1" "VVS2" "VS1" "VS2" (3 points) Find the sample mean and standard deviation of prices for each color. Find the same summary measures for clarity. Interpret the values.> aggregate(formula = price ~ color, data = diamond, FUN = mean) color price1 D 4067.5332 E 3155.5053 F 2742.1404 G 2535.7705 H 2844.4686 I 2964.782> aggregate(formula = price ~ color, data = diamond, FUN = sd) color price1 D 3210.8852 E 2065.2823 F 1919.5134 G 1800.8805 H 1668.3286 I 1756.749> aggregate(formula = price ~ clarity, data = diamond, FUN = mean) clarity price1 IF 1543.8412 VVS1 3189.7093 VVS2 3068.7794 VS1 2897.1875 VS2 3356.157 > aggregate(formula = price ~ clarity, data = diamond, FUN = sd) clarity price1 IF 1727.9072 VVS1 2104.4333 VVS2 1898.5784 VS1 1864.5495 VS2 1746.826Color: D has the largest mean price and G has the smallest mean price. D also has the largest variability. Clarity: IF by far has the smallest mean price. VS2 has the largest mean price. VVS1 has the largest variability.One could also use the rule of thumb for the number of standard deviations all data lies from it mean to further interpret these summary measures. (4 points) Construct box and dot plots for the diamond prices at each level of color. Overlay the dot plot on the box plot. Find the same plots for each level of clarity. Make sure to avoid the double plotting of outliers, and interpret the plots. > boxplot(formula = price ~ color, data = diamond, main = "Box and dot plot for color", ylab = "Price", xlab = "Color")> stripchart(x = diamond$price ~ diamond$color, lwd = 2, col = "red", method = "jitter", vertical = TRUE, pch = 1, main = "Dot plot", add = TRUE)> par(mfrow = c(1,1))> boxplot(formula = price ~ clarity, data = diamond, main = "Box and dot plot for clarity", ylab = "Price", xlab = "Clarity", pars = list(outpch=NA), col = NA)> stripchart(x = diamond$price ~ diamond$clarity, lwd = 2, col = "red", method = "jitter", vertical = TRUE, pch = 1, main = "Dot plot", add = TRUE)The amount of variability tends to decrease from D to I. F and G have the smallest median. There are no outliers. There is a large number of observations at the very low prices for IF. There are some outliers for the IF, VVS1, and VVS2 plots. VS2 has the largest median price. (3 points) Particular levels of color and clarity are more desirable than others. The ordering for color is D > E > F > G > H > I (e.g., D is more desirable than E). The ordering for clarity is IF > VVS1 > VVS2 > VS1 > VS2. Prices are expected to follow this same pattern (e.g., color = D diamonds should be more costly than color = I diamonds). Does this pricing structure hold here? Use your answers for parts c) – d) to justify your answer. It does not appear to completely hold true. With respect to the sample means, we have the orderings of D > E> F > G, but then G < H and H < I for color. Also, we have IF < VVS1 and VS1 < VS2 for clarity with respect to the sample means. We have similar findings with respect to the dot and box plots. For example, the distribution of sampled values for IF is largely shifted toward the lower prices in comparison to the other clarity levels. (4 points) Larger diamonds are generally more desirable than smaller diamonds. Prices are expected to follow this same pattern as well (larger are more costly). One way to determine if this occurs with our data is to construct a scatter plot. Construct a scatter plot with price on the y-axis and carat on the x-axis. Describe any trends that you see in the plot and relate them to the expected pricing structure. > plot(x = diamond$carat, y = diamond$price, main = "Price vs. carat", xlab = "Carat", ylab = "Price", col = "red")There is an upward trend – as carat increases, price increases. Notice that the variability in prices increases as carat increases. (4 points) The code below extends the scatter plot in the previous part to include the color and clarity variables. These variables are included by representing one variable through the plotting point and representing the other variable through separate scatter plots. Run the code and describe the relationship that carat, color, and clarity have with price. Do the pricing structures mentioned in the previous parts appear to be true once all variables are accounted for? Note that this code uses functions from the ggplot2 package. You will not be asked to use this code on an exam, but you may need to interpret similar types of plots. # Need to install package firstlibrary(package = ggplot2)# Open a wider graphics windowx11(width = 10)# Change plotting theme to better set of colorstheme_set(new = theme_bw())# Basic part of plotsave.plot1 <- ggplot(data = diamond, mapping = aes(x = carat, y = price, color = color, shape = color))# Add items to plot and plot it!save.plot1 + facet_wrap(~clarity, ncol = 5) + ylab("Price") + xlab("Carat") + geom_point() + ggtitle("Diamond data plot #1") + scale_color_manual(values = c(D = "blue", E = "purple", F = "darkgreen", G = "black", H = "red", I = "lightgreen")) + theme(plot.title = element_text(hjust = 0.5)) + theme(panel.grid.major = element_line(color = "darkgray", linetype = "dotted"))x11(width = 10)save.plot2 <- ggplot(data = diamond, mapping = aes(x = carat, y = price, color = clarity, shape = clarity))save.plot2 + facet_wrap(~color, ncol = 6) + ylab("Price") + xlab("Carat") + geom_point() + ggtitle("Diamond data plot #2") + scale_color_manual(values = c(IF = "blue", VVS1 = "purple", VVS2 = "darkgreen", VS1 = "black", VS2 = "red")) + theme(plot.title = element_text(hjust = 0.5)) + theme(panel.grid.major = element_line(color = "darkgray", linetype = "dotted"))Accounting for carat helps to show that the expected pricing structures generally hold true. For example, we see a definite D > E > F > H > I for larger carat values in the VVS1 plot above. We see this occur too in the F plot for IF > VVS1 > VVS2 > VS1 > VS2. Note that there are not many large in size IF diamonds, which helps to account for the average price for these to be lower than lesser quality diamonds. Here’s an alternative way to produce the plots using the lattice package> library(package = lattice)> trellis.par.set(superpose.symbol = list(pch = 1:7)) #Set plotting symbols> win.graph(width = 10, height = 7, pointsize = 12)> xyplot(x = price ~ carat | clarity, data = diamond, layout = c(5,1), groups = color, main = "Carat vs. Price", xlab = "Carat", ylab = "Price", auto.key = list(points = TRUE, space = "right"))> win.graph(width = 10, height = 7, pointsize = 12)> xyplot(x = price ~ carat | color, data = diamond, layout = c(6,1), groups = clarity, main = "Carat vs. Price", xlab = "Carat", ylab = "Price", auto.key = list(points = TRUE, space = "right"))(24 total points) Below is a contingency table summarizing field goals from the 1995 NFL season (Bilder and Loughin, Chance, 1998). The events in the table correspond to stadium type (dome or outdoors) and field goal result (success or failure). Field goal result?SuccessFailureTotalStadium typeDome33552387Outdoor9271111038?Total12621631425While these field goals represent a sample from the population of all field goal attempts, all probabilities should be calculated with respect to the above table only.(3 points) Find the contingency table with the joint and marginal probabilities in each cell.There are a few different ways to do this in R. First, one could simply type the necessary calculations at the command prompt:> # P(FG = Success and Stadium = Dome)> 335/1425[1] 0.2350877Alternatively, you could create vectors of data with the c() function and do calculations using it. You could even put the data into data frame (via Excel or within R itself) and perform the needed calculations:> count <- c(335, 927, 52, 111)> stadium <- c("Dome", "Outdoor", "Dome", "Outdoor")> field.goal <- c("Success", "Success", "Failure", "Failure")> set1 <-d ata.frame(stadium, field.goal, count)> set1 stadium field.goal count1 Dome Success 3352 Outdoor Success 9273 Dome Failure 524 Outdoor Failure 111> n <- sum(set1$count)> set1$prob <- round(set1$count/n, 4)> set1 stadium field.goal count prob1 Dome Success 335 0.23512 Outdoor Success 927 0.65053 Dome Failure 52 0.03654 Outdoor Failure 111 0.0779> #P(FG = Success)> sum(set1$prob[1:2])[1] 0.8856Lastly, in my categorical data analysis course, we create the table using the following code (not responsible for in our class):> c.table <- array(data = c(335, 927, 52, 111), dim = c(2,2), dimnames = list(Stadium = c("Dome", "Outdoor"), FieldGoal = c("Success", "Failure")))> c.table # body of table FieldGoalStadium Success Failure Dome 335 52 Outdoor 927 111> c.table[1,1] # (1,1) element[1] 335> c.table[1,] # row 1Success Failure 335 52 > sum(c.table[1,]) # sum a row[1] 387> rowSums(c.table) # row counts Dome Outdoor 387 1038 > colSums(c.table) # column countsSuccess Failure 1262 163 > # Some of the probabilities> round(c.table/sum(c.table),4) FieldGoalStadium Success Failure Dome 0.2351 0.0365 Outdoor 0.6505 0.0779Below is the final table Field goal result???SuccessFailureTotalStadium typeDome0.23510.03650.2716Outdoor0.65050.07790.7284?Total0.88560.11441(2 points) What is the probability a field goal was successful regardless of stadium type?P(FG = Success) = 0.2351 + 0.6505 = 0.8856 (2 points) What is the probability a field goal was a success and attempted in a dome stadium?P(FG = Success Stadium = Dome) = 0.2351(3 points) Given the field goal is attempted in a dome stadium, what is the probability it was a success?P(FG = Success | Stadium = Dome) = P(FG = Success Stadium = Dome) / P(Stadium = Dome) = 0.2351 / 0.2716 = 0.8656(3 points) Is field goal result independent of stadium type? Explain.Note that P(FG = Success) = 0.8856 0.8656 = P(FG = Success | Stadium = Dome)Because they are not equal, they are dependent. However, similar to the Larry Bird in-class example, the probabilities are close. Although they are dependent, there is not much dependence between the events. There are many other ways independence could have been checked here. Similar to the Larry Bird example in class, one typically would consider these field goals as being a sample from the population of all field goals. This would be especially true if conditions from year to year in the NFL remain the same (no rule changes, abilities of field goal kickers remain constant, …). Questions about whether this is a representative sample would need to be addressed. Assuming it was a representative sample, one may be interested in drawing an inference from the sample to the population all field goals. A hypothesis test for independence could be conducted using the data. The result is there is not sufficient evidence to prove dependency. We will look at ways to perform of hypothesis tests for this table later in the course. (5 points) Suppose a field goal kicker has two different offers of where to play next season: Kansas City Chiefs (outdoor stadium) or New Orleans Saints (indoor stadium). Each team plays eight of sixteen home games at their own stadium. Assume each team has the same number of away games in outdoor and dome stadiums. Using ONLY the resulting data above, which team would be better for the field goal kicker to play for or does it matter? Explain. P(FG = Success | Stadium = Dome) = 0.2351 / 0.2716 = 0.8656P(FG = Success | Stadium = Outdoor) = 0.6505 / 0.7284 = 0.8931Given the field goal was attempted in the outdoor stadium, the probability of success is higher. Using this information only, the field goal kicker should choose Kansas City which has an outdoor stadium. However, notice again how close these two probabilities are. There is not much dependence between these events. There are many other factors which affect the success or failure, such as: some stadiums are more affected by wind than others, coaches may be more likely to have their field goal kicker attempt longer field goals in dome stadiums,… . These factors should be considered before an actual decision is made. In categorical data analysis course, we examine ways to account for these additional factors. (3 points) “Odds” are a rescaling of probabilities. Specifically, the odds of an event A are P(A)/(1-P(A)) = P(A)/P(). For this problem with conditional probabilities, the odds of a success given dome or outdoor stadium can be found from usingOdds of a success given ____DomeOdds(FG = Success | Stadium = Dome) = P(FG = Success | Stadium = Dome) / P(FG = Failure | Stadium = Dome)OutdoorOdds(FG = Success | Stadium = Outdoor) = P(FG = Success | Stadium = Outdoor) / P(FG = Failure | Stadium = Outdoor)Find the numerical values for the odds of a success given dome or outdoor stadium and provide an interpretation. Odds of a success given ____Dome6.44Outdoor8.35The probability of success is 6.44 times larger than the probability of failure for the dome stadiums. The probability of success is 8.35 times larger than the probability of failure for the outdoor stadiums. (3 points) In many scientific studies, the “odds ratio” is used to compare two different odds. The odds ratio in this case is defined as OR = Odds(FG = Success | Stadium = Dome) / Odds(FG = Success | Stadium = Outdoor). What value for the odds ratio corresponds to independence and relate this to the field goal kicking problem here. There would be independence when OR = 1. Notice that the odds of success are the same for dome and outdoors when P(FG = Success | Stadium = Dome) = P(FG = Success | Stadium = Outdoor), which means independence (remember that P(B|A) = P(B|) = P(B) under independence). For this problem, the odds ratio is 0.7714, which means there is dependence for this sample. However, for the same reasons as before, this is close to independence. Often, the odds ratio will be inverted so that it is always greater than 1, which results in a value of 1.2963 here. This can be interpreted as: “The odds of success are 1.2963 times as large as for field goals in outdoor stadiums than in dome stadiums.” See my Section 1.2 notes at if you would like more information about odds ratios. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download