Storm.cis.fordham.edu



Data MiningWEKA Homework (44 points: 10 Questions)For this problem, you must use WEKA (however, for your project you can use any data mining toolkit.)Preparation StepsDownload and install the latest version of WEKA to your computer or use the computers in the CS department’s instructional laboratories.For this assignment you will be working with "Explorer” interface.View the WEKA instructions at the following url: the WEKA manual that matches your version of WEKA (perhaps 3.6.9) and browse it or refer to it as necessary. You can find the manuals at: this assignment you will use the?Adult dataset from the US Census Bureau, located at: . The relevant files are available in the data folder. The “adult.data” file contains labeled examples, one example per line (a separate test set is available in “adult.test” but you can ignore that and just partition the “adult.data” records into both a training and test set). The “adult.names” file contains information on the attributes used in the data (as does the main web page I sent listed above). Note that these files are set up for use byC4.5, not Weka, and hence you will need to generate an .arff file before you can use this data. Download the “adult.data” file and manually convert it to a arff format by creating an appropriate header, using the features names from the “.names” file. This conversion process will ensure that you can also create other data to the required format, which may be necessary for your project. There are some detailed instructions on creating the arff file on the Weka page that I maintain at you get completely stuck, you can search for a version of adult.arff on the web or try . But I strongly encourage you to try it by yourself first and spend up to an hour on the conversion process. If you do the conversion simply by copying the file from the web, you still need to write down the steps for the conversion, or you will lose points for Step 1 on the homework.Required Steps for the Homework AssignmentNote: Handwritten submissions will not be accepted. Please submit an electronic version of your homework on or before the due date. Use the numbering below when answering the questions. In some cases it may make sense to include some of the figures generated by Weka into your submission. You can make a copy of a screenshot or individual graphic in WEKA by using Alt+Shift+Left-Click. You can then save that into a file. If you then edit the file, you can copy the image and then paste it into your submission.Briefly describe how you generated the arff file for the adult data set, including all of the steps required to convert it into the .arff format, with appropriate variable names. (5 points)Read the adult.names file and study the raw data (you can view the raw data most easily via the “Edit …” button while in the preprocess mode of WEKA). Understanding the domain is a key step in data mining. Write down three issues or hypotheses that would be interested in investigating. As one example, you could switch the class variable to “sex” and try to predict that variable from the rest of the variables. You can also consider other things, such as strategies for preprocessing the data to possibly improve predictive performance. In short, come up with three issues/strategies that you could follow up on. (3 points)After loading the adult dataset via the arff file, while in the preprocess tab, look at how each feature correlates with the class variable (you can do this all at once by clicking on the “Visualize All” button). List 3 features that you think will be useful for predicting the class variable and, in one or two sentences, justify why you think these features will be useful. (3 points)Run Weka’s J48 classifier on the initial data with the test option set to 66% so that 66% of the data is used for training and the rest is used for test. Answer the following questions (4 points):What is the accuracy of the classifier on the test data? How many leaves are there in the decision tree?How long did it take to build the model?Copy and paste the confusion matrix into your homework answer. Repeat the previous step but this time configure the options/properties for the J48 classifier so that binarySplits is true (i.e., all splits are binary). Then answer the same four questions (a –d) as in the prior step (2 points)Repeat the prior step but now also change the “unpruned” option from “False” to “True”, so that pruning is not performed. Then answer the same 4 questions (a-d) as in the prior steps (2 point) Summarize the main differences between the 3 runs of J48 (from questions 4-6) in terms of changes in accuracy, number of leaves in the decision tree, and how long it takes to build the model. (5 points)On the initial adult data set, run the ZeroR classifier, which can be found under the rules classifiers in WEKA. Answer parts a, c, and d from Question 4 for this classifier, and characterize the differences between these results and those for Question 4. What do you think that the ZeroR method is doing? Why can’t you visualize this decision tree? (5 points)On the original data set, run the RandomForest method from under the tree classifiers in WEKA. Report the answers to parts a, c, and d from question 4. Then answer the next two parts: (5 points)How do the accuracy results of this method compare to those of J48 with its default parameters from Question 4?In a few sentences, describe how the random forest method works (look this up).On the unaltered data set, use J48 to build a classifier to predict “sex” rather than the default class(named “class” but represents income level). You can do this by using the “Classify” tab to select J48, as usual, and then you can go to the left column to the button under the “More Option …” button and manually change the class/target variable from “class” to “(Nom) sex”. As before, use 66% for training data. Answer the questions from parts a-d for Question 4. Then answer the following additional questions (10 points):Based on the results in the confusion matrix, specify the number of females and males in the test set as counts (whole numbers) and as percentages. If you had to build a simple classifier to always guess the most common sex, what would the accuracy of the classifier be?What do the actual accuracy results say about one’s ability to predict sex from the other variables? Compare the actual classifier results to the value in part f.Examine the induced decision tree, and the attributes involved, and state why the high accuracy results are almost completely meaningless. What is the problem? Note: this may not be obvious, and it is okay if you do not answer it correctly (it is only worth a few points). The main point of this question is that you need to understand your data. Fix the problem you identified in the prior part and rerun J48. Look at the resulting decision tree to make sure that there are no obvious problems. Specify how you fixed the problem and what the accuracy is after the problem is fixed. (2 points) ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download