WordPress.com



Data Mining Case AnswersBy: Jake Nickola and Hamdee KhaderBUS 443 (001)Problem 1: Hierarchical Cluster Analysis with the Football Bowl Subdivision (FBS)Which cluster has the largest average football stadium capacity?The cluster with the largest average football stadium capacity is Cluster 3, with an average of 87,615.80.Which school and cluster has the highest football stadium capacity?Michigan Wolverines had the highest stadium capacity, with 114,804. Furthermore, Cluster 5 had the highest amount of stadium capacity in total, with a sum of 1,921,325.How would you characterize the universities in this cluster (Cluster 5)?Cluster 5 has the greatest amount of schools within it, per the higher count of observations in comparison to all other clusters. Accordingly, Cluster 5 has the highest sum of overall stadium capacity. Cluster 5 also has the third largest average of stadium capacity out of all 10 clusters. These factors have critical implications for the universities within this cluster. The universities of Cluster 5 are mostly mid-tier in regards to many variables. These universities bring home a decent amount of athletic revenue, are in the middle of the pack in terms of stadium capacity, and have considerable (but not the highest) endowment and enrollment numbers. Cluster 5 doesn’t have any of the perennial football powerhouses like Ohio State and Michigan, or any of the colleges that boast an extremely high endowment like Stanford, but it contains formidable programs, such as NC State, Virginia Tech, and Ole Miss.What is the smallest cluster (the one with the fewest observations) and what makes it unique?The smallest cluster with the fewest observations is Cluster 10. This cluster is unique because only one observation (university) is present within it, compared to the other 9 clusters that contain multiple universities. The only university within Cluster 10 is Stanford University. Stanford commands its own cluster due to its significantly larger endowment in comparison to the other universities within the sample. Its endowment is so high in comparison, that its medium-range stadium capacity of 50,000 and medium-range enrollment do not drag it into any of the other clusters.What number of clusters seems to be the most natural fit based on the distance (after examining the dendrogram)?9 clusters seem to be the most natural fit based on the distance.Why should we rerun the cluster analysis using different variables or a different number of clusters (after counting the number of schools per cluster)?The cluster analysis should be reran for a different number of clusters. This is because one cluster only has one school, which would not be appropriate for the competitive environment of a football conference. Another cluster has upwards of 30 schools within it, which is much larger than the other 9 clusters and creates an imbalance within the Football Bowl Subdivision (FBS) landscape.Which is the better method (clustering with the variables above or clustering with just latitude and longitude as variables)?Clustering with stadium capacity, latitude, longitude, endowment, and enrollment is much more effective than clustering with just latitude and longitude. One of the reasons is that clustering with only latitude and longitude resulted in a cluster with 98 schools, while many other clusters only had 1 school. There is too much of a disparity between the count of schools for this clustering method to be viable. This method also does not effectively recognize the differences between each school.Problem 2: k-Means Cluster Analysis with the Football Bowl Subdivision (FBS)What is the smallest cluster?The cluster with the fewest observations is Cluster 8, with 1 observation.What is the least dense (most diverse) cluster?Cluster 10 is the least dense cluster, as it has the largest average distance in the cluster. The least dense cluster is the most diverse because a very dense cluster would constitute distances that are very close to each other. In our case, the least dense cluster contains observations with much farther distances from each other, which emphasizes their dissimilarity.What problems do you see with the plan of defining the school membership of the 10 conferences directly with these 10 clusters?Defining school membership under these 10 clusters would bring about a variety of issues. First of all, one cluster, Cluster 8, only has 1 observation, which is unfit for a football conference that inherently requires competition. Secondly, certain clusters contain way too many observations in comparison to others, like Cluster 2 with 22 observations in comparison to Cluster 3 with 5 observations. In accordance with FBS regulations in determining post-season bowl games, how will a committee be able to accurately measure up a school that has 21 similar opponents to face in the course of a regular season with a school that only has 4 other opponents.Problem 3: Both Types of Cluster Analysis with the Football Bowl Subdivision (FBS)What problems do you see with this plan? How could this approach be tweaked to solve the problem?By utilizing clusters by region and sub clusters by stadium capacity, endowment, and enrollment, a few problems become apparent. For example, the west region was broken up into two sub clusters. West Sub Cluster 1 has 17 schools within it, and West Sub Cluster 2 has 9 schools within it. It is our recommendation that for the west region, the number of sub clusters be optimized to have a more similar school amount across all the sub clusters. The east region shares this same uneven distribution of schools throughout its 3 sub clusters, with East Sub Cluster 1 having 24 schools within it, East Sub Cluster 2 having 29 schools, and East Sub Cluster 3 having only 4 schools. This disparity will cause an imbalance in the east region conferences and should be optimized through a more appropriate amount of sub clusters. The same school count disparities as described above are also present in the south region, which could also benefit from further optimization.Problem 4: Market Basket Analysis on Cookie Monster, Inc. (Problem 8 in our Textbook)What information does this analysis provide Cookie Monster regarding the online behavior of individuals? Be sure to address the lift ratios (and the meaning of the lift ratios) in common terms that a business user would immediately understand.For most of the rules, if a user visits a specific website, the confidence that the specific user will visit the consequent website is not that strong (most have a confidence around 50%). On the other hand, this is not the case for rules number 9, 10, and 11. For rule number 9, if a user visits YouTube and Twitter, then with a fair confidence of 78%, the user will also visit Facebook. The confidence levels for rules number 10 and 11 are the strongest. For rule number 10, if a user visits Facebook and Twitter, then with 94.58% confidence the user will also visit YouTube. As for rule number 11, if a user visits Facebook and YouTube, then with 93.93% confidence the user will also visit Twitter.For rules with weak confidence levels such as rule 1, the relationship between users who visit Pinterest alongside Amazon is weak. For example, for rule number 1, 1690 users visited Pinterest (as shown by support for A), and 1840 visited Amazon (as shown by support for C), on the other hand, from the 1690 users who visited Pinterest and 1840 who visited Amazon, only 891 users visited Amazon alongside Pinterest. On the other hand, for rules with reasonable and strong confidence levels (rules 9, 10, and 11), the support is satisfactory and strong. For rule number 9, which has a confidence level of 78.02%, 1051 people visited YouTube and Twitter, while 1673 people visited Facebook. Consequently, derived from these numbers, 820 people visited Facebook alongside YouTube and Twitter. Furthermore, for rule number 10, which has the strongest confidence level out of the 14 rules (94.58%), 867 users visited Facebook and Twitter and 1825 users visited YouTube. Out of those numbers, 820 people visited YouTube alongside Facebook and Twitter. As for rule 11, which has the second strongest confidence level, 873 people visited Facebook and YouTube and 1854 people visited Twitter. Out of these numbers, 820 people visited Twitter alongside Facebook and YouTube. As for the lift ratio, a lift ratio larger than 1.0 implies that the relationship between the antecedent (A, the websites under this column) and the consequent (C, the websites under this column) is more significant than would be expected if the two sets were independent (if there was no relationship between users who visited the websites under the antecedent alongside the websites under the consequent). The larger the lift ratio, the more significant the association. Thus according to the lift ratios, the relationships between the antecedent and the consequent for all of the rules are significant as they are all over 1. The cases with the highest lift ratios (thus which have the strongest relationship between websites listed in the antecedent (A) and consequent (C)) are rules 9, 10, and 11. Notice that these rules also have the strongest confidence levels. Let’s take rule 10 for example. Rule 10 has the highest confidence level (94.58) and also has the highest lift ratio (10.37). Since confidence level and lift ratio are related, you also need the confidence level in order to calculate the lift ratio. This implies that there is a strong relationship between people who visit Facebook and Twitter alongside YouTube.Problem 5: k-Nearest Neighbors Data Mining for Finding Undecided Voters for Campaign OrganizersFor k = 1, why is the overall rate equal to 0 percent on the training set? Why isn’t the overall rate equal to 0 percent on the validation set?The error rate for the training set is calculated by comparing each training set observation to the variable “k” most similar observations. In many cases, the most similar observation in the training set is that observation itself (at k = 1), which will yield a very low error rate. The overall rate in the validation set is not equal to 0 because validation set observations are compared to many of the k most similar observations within the training set. At k = 1, an observation in the validation set will most likely have a different classification from the most similar observation in the training set.For the cutoff probability value 0.5, what value of k minimizes the overall error rate on the validation data? Explain the difference in the overall error rate on the training, validation, and test data.The value of k that minimizes the overall error rate on the validation data is k = 3. The error rate is lowest overall on the training set data, since training set observations always include itself, which provides for lower error rates. The overall error rate for the validation data can be tricky to understand, as the overall error rate is the lowest error rate for all k values. If k = 3 is to be applied to the test data, it will result in bigger error rates since the test data is not being used to find k's best value.Examine the decile-wise lift chart. What is the first decile lift on the test data? Interpret this value.The first decile lift on the test data is 2.29. Within the test data set are 2000 observations with 791 undecided voters. If we randomly selected 200 voters, we would find that on average, 79.1 percent would be undecided. After utilizing k-nearest neighbors with k = 3 to identify which 200 voters are most likely to be undecided, then 181.14 would actually be undecided (calculated from 79.1 * 2.29).For cutoff probability values of 0.5, 0.4, 0.3, and 0.2, what are the corresponding Class 1 error rates and Class 0 error rates on the validation data?Cutoff ValueClass 1 Error RateClass 0 Error Rate0.525.72%17.36%0.425.72%17.36%0.38.68%36.80%0.28.68%36.80%Problem 6: Logistic Regression to Predict the OscarsWhat is the resulting logistic regression calculation?The resulting logistic regression calculation to help predict winners of the Best Picture Oscar is:Log Odds of Winning Best Picture = -8.21 + (0.57 * OscarNominations) + (1.03 * GoldenGlobeWins)What is the overall error rate on the validation data?The overall error rate on the validation data is 15%, with 6 errors out of 40 cases.Use the model to score the new data (2011). Which movie did the model select as the most likely to win the 2011 Best Picture Award?The model selected “The Artist” as the most likely winner of the 2011 Best Picture Award at the Oscars, with a win probability of 64.58%.Problem 7: Logistic Regression to Predict the Organic Customers using SAS Visual StatisticsThe following variables were removed from the model due to their statistical insignificance:Recent 12-Month PurchasesRecent 3-Month PurchasesThe resulting logistic regression calculation is:0.442294 + (-0.05648 * Age) + (-0.00009 * Recent 9-Month Purchase) + (0.000129 * Recent 6-Month Purchase) + (1.817174 * Gender F) + (0.85792 * Gender M)What is the overall r square for the model?The overall r square is 0.1481.What is the cumulative lift at the 20th percentile? Is this value low?The cumulative lift at the 20th percentile is 4.0618. This number is definitely low, as targeting such high probability customers should yield more than 4 successes.What does the ROC chart suggest?According to the shape of the line within the ROC chart that has a steep initial slope that eventually levels off, we have derived that it is a good, viable ROC chart. Due to the chart's viability, we can have faith in the ability of the model to avoid false positive and false negative classifications.How many customers were classified as false positives? Should this model be refined more in light of this?52,669 customers were classified as false positives. The model could be refined more, but doesn’t necessarily need to be, as 52,669 false positives out of 1,574,340 observations results in only 3.34% false positives.Which variables have a high influence on predicting whether a customer will buy organic?The variables with the highest influence on predicting whether a customer will buy organic are Gender (Female) with a z-value of 246.19, Gender (Male) with a z-value of 104.43, and Age, which has the highest influence, with a z-value of -347.59. The further away the z-value is from 0, the higher the influence will be on whether or not the customer will buy organic food.Problem 8: Logistic Regression to Predict the PVA Donors using SAS Visual StatisticsThe following variables were removed from the model due to their statistical insignificance:File_avg_giftHome_ownerHouse_incomeLifetime_min_gift_amtThe resulting logistic regression calculation is:0.207235 + (age * 0.002947) + (last_gift_amt * -0.00497) + (file_card_gift * 0.026849) + (home_value * 0.000122) + (lifetime_avg_gift_amt * -0.00376) + (lifetime_gift_count * -0.00675) + (lifetime_gift_range * -0.00952) + (lifetime_max_gift_amt * 0.00981) + (lifetime_prom * -0.00397) + (months_since_first_gft * 0.002652) + (months_since_last_gft * -0.0349) + (card_prom_12 * -0.0067) + (frequency_status_97k 1 * -0.59969) + (frequency_status_97k 2 * -0.39952) + (frequency_status_97k 3 * -0.20899) + (gender F * -0.05258) + (gender M * -0.07281) + (in_house N * -0.23136) + (income_group * -0.12208) + (income_group 1 * -0.43192) + (income_group 2 * -0.21641) + (income_group 3 * -0.12173) + (income_group 4 * -0.08201) + (income_group 5 * -0.06164) + (income_group 6 * -0.02756) + (overlay_source B * 0.042168) + (overlay_source M * 0.025855) + (recency_status_96nk A * 0.008953) + (recency_status_96nk E * 0.197879) + (recency_status_96nk F * -0.2279) + (recency_status_96nk L * -0.34611) + (recency_status_96nk N * -0.03871) + (pep_star 0 * -0.2124) + (age_miss * -0.0492)Comment on the usefulness of the model.First of all, R Square for this model is only 0.0339, which isn’t very close to 1. This means that not much of the variation in y-values is explained by the overall regression model. The cumulative lift is only 3.96 at the 20th percentile. The ROC chart is somewhat steep, which means that we can have some trust in the model’s ability to avoid false positive and false negative classifications. It is our recommendation that more variables be tested for in order to increase the variation that the overall regression model can explain. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download