Www.webpages.uidaho.edu



Stat 504 HW#2This assignment uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition. The data set is at of the data: Each observation in this dataset is a review of a particular business by a particular user. The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) The "cool" column is the number of "cool" votes this particular review received from other Yelp users. There is no limit to how many "cool" votes a review can receive. The "useful" and "funny" columns are similar to the "cool" column.Read yelp.csv into a DataFrame.Create a new DataFrame that only contains the 5-star and 1-star reviews.Use a random seed 1234567, split the new DataFrame into training and testing sets, using the review text as the feature and the star rating as the response variable.Use CountVectorizer to create document-term matrices from X_train and X_test.Hint: If you run into a decoding error, instantiate the vectorizer with the argument decode_error='ignore'.Use Logistic Regression, k-Nearest Neighbors, Classification Tree to predict the star rating for reviews in the testing set, and calculate the AUC and plot the ROC Curve for the three models.Hint 1: Make sure to pass the predicted probabilities to roc_auc_score, not the predicted classes.Hint 2: roc_auc_score will get confused if y_test contains fives and ones, so you will need to create a new object that contains ones and zeros instead.Add vote types (cool/useful/funny) as an additional features into the modeling process and refit the three models to predict the star rating in the testing set, and calculate the AUC and plot the ROC Curve for the three models. Any improvement or not? Comment on your findings.For your reference, the solution with python codes for a similar homework using Navie Bayes model on the same data is available at ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download