Mathstat.carleton.ca



Classification of Amazon Product Metadata: Support Vector Machine & Kernel MethodsResearch PaperSTAT5703 | Amy Li, Mitchell HughesSTAT4601 | Nolan HodgeMonday, March 26, 2018ABSTRACTAutomated classification of metadata into predefined categories is an important means to manage and process the expanding amount of information that Amazon stores. This source of enormous web documentation lacks structure, which prevents derivation of potential useful knowledge from its collection. Classification is a key technique for organizing such digital data: in this paper, we research and apply Support Vector Machines (SVM) and kernel methods onto ’s product metadata. From our model, we predict products that are likely to be co-purchased or viewed based on its textual title and category classification. We find that a high degree of accuracy can be obtained with our parsimonious model. Keywords: Machine learning, Support Vector Machines, Kernel method, Text Classification, Data Mining, Metadata TABLE OF CONTENTSSECTION PAGEI: STATEMENT OF PROBLEM………………………………………………………...……. 3II: LITERATURE REVIEW………………………………………………………….…………4IIIB: ANALYSIS OF SIMILAR APPLICATIONS…………………………………………… 7IIIA: DISCUSSION OF TECHNICAL & INTUITIVE MODEL FRAMEWORK……………10IV: APPLICATION OF METHODOLOGY & CODE OVERVIEW………………...…..…….13V: CONCLUSION………………………………………………………………………………18BIBLIOGRAPHY………………………………………………………………………………..20APPENDIX I: CODE……………………………………………………………………………21APPENDIX II: WORK DISTRIBUTION STATEMENT………………………………………22STATEMENT OF PROBLEM In this paper, we are interested in uncovering Amazon consumer purchasing habits – classifying the relationship between products and their co-purchases. We thus seek to research the underlying technical aspects of designing such a model where co-purchasing products can be predicted from Support Vector Machines (SVM), specifically by k-means technique. By deciphering from product titles, descriptions, categories and Amazon’s unique identifier (ASIN), we categorize common consumer baskets. This posited problem inherently requires modeling from the Amazon metadata and analyzing purchasing patterns from alphanumeric inputs. First and foremost, this research paper serves to form a deeper understanding of the topic by thoroughly digesting scholarly and technical articles. From these expert sources, we formulate a review and discussion of methodologies involved in constructing SVMs with k-means. The task of classifying data under predefined categories – where either single or multi labels exists – is meant to automatically class based on content. This technique is valuable for data mining raw material. Classification is defined as algorithmically assigning a document or object to one or more classes based on attributes, behavior or subject. The broad problem is identified as training a dataset consisting of records, such that each record has a unique record identifier and corresponding fields. The increasing amount of information available online is rapidly increasing – therefore, the research of automatic classifiers has integral meaning for machine learning and information extraction. In addition, the evolution of this technique has vital implications on its present-day applications. The goal is to create a model from the training dataset based off class label’s attributes which can classify new data. With regards to machine learning approaches, this area of research has been increasingly expanding in the published domain and likewise, SVM algorithms have been rapidly adopted on a plethora of applicable databases. The paper is organized as follows. Section II describes the literature that encompasses SVM; Section III we discuss the both the applications of the Amazon dataset performed by others, and our own models’ technical and intuitive framework. The penultimate section outlines our application upon the Amazon dataset, code analysis and evaluates our SVM model prediction. Finally, we conclude in Section V about future directions and summarize this research project. LITERATURE REVIEW This research paper discusses the varied architecture and approaches to machine learning classifiers and data mining alternative methods, and we consult the following experts’ writings. Our literature review comprises of peer-reviewed papers and less formal yet informative sources of SVM topics on relevant datasets. We must stress that though these informal sources are not traditionally articled scholars; due to the rapidity of machine learning developments, these data scientists’ blogs contain a breadth of cutting-edge knowledge.This section begins with a brief outline of SVM origins, and then relays relevant papers that touch on our research topic before reviewing more specific sources that utilize this specific dataset. This latter literature necessitates further technical discussion; thus, we include the informal sources within Section III.SVM is a fairly new learning method – its original algorithm roots were introduced to solve statistical problems before the online era. Vapnik et al (1963) developed nonlinear classifiers by applying the kernel method to maximize-margin hyperplanes. Vapnik continued developing this technique into the 1990’s to become more holistic. In another pivotal paper, Thorsten Joachims explores the use of SVM on text data and highlights its automation feature as it eliminates the need for manual parameter tuning. His paper provides an excellent overview about its functions; he notes that SVMs are very universal learners and at its fundamental form, SVMs learn linear threshold function, but remain pliable for other polynomial classifiers, such as kernel function. This could be extended into neural nets and radial basic function networks as well. Joachim continues to explain that a unique property of SVM is its learn ability can be independent of its dimensionality of the feature space since it “measures the complexity of hypotheses based on the margin with which they separate the data, not the number of features”. This paper proposes SVMs for text categorization. The authors espouse that SVM uses overfitting protection and thereby, it does not depend on the number of features, and it can handle large spaces. For a paper that encapsulates recent research, Jindal et al’s Techniques for text classification: Literature review and current trends (2015) processes existing work in this discipline and evaluates competing methodologies. They begin by defining: “Text classification consists of document representation, feature selection or feature transformation, application of data mining algorithm and finally an evaluation of the applied algorithm.” (Jindal et al 2) Their collection of varied research from digital and analog portals. Jindal et al found that most authors have studied SVM algorithms and it was the most popular means of text classification (followed by k-nearest neighbours) though many proposed advanced methods to enhance its applicability. 65% of papers of papers involve these two algorithms from their sample of 132 papers; thus, researchers display a clear preference for SVM and KNN machine learning in the 88 papers. For instance, Leopold (2002) attest that kernel function requires much pre-processing and feature selection, and argue that weighting methods can reduce dimensionality and larger impact on SVM performance. However, Namburuet et al (2005) argue that SVM is more suitable for binary classification. Saporta (1990) states linear association between variables and suitable transformation of the original variables, or proper distance measures, can produce satisfactory solutions. Perhaps optimistically, these authors note that though many papers’ conclusions are data-dependent and contain black-box solutions, it remains promising that there is a seismic shift from traditional statistical methods towards modern machine learning. In terms of this specific dataset, there have been a few projects and academic research conducted to reveal new machine learning practices. Julian McAuley et al (2015) proposed content-based recommender systems to model user preference towards types of foods – this is akin to our approach. He analyzes metadata from a user’s previous activity, instead of collaborative means of recommending based off other users’ activities. He combines the two methods to address sparsity and cold-start problems (no reviews yet). In a similar paper, he leverages the same dataset to be scalable, personalized, temporally evolving and interpretable as a visually-aware recommender system. McAuley et al address how the corpora for “long-tailed” new items being continually introduced like cold-starts are common problems that require remedies. III-A. ANALYSIS OF APPLICATIONSIn this twofold section, we begin by referencing machine learning works that have already been conducted on this Amazon dataset. Secondly, we examine the technical components along with an intuitive explanation of our SVM model.Professor Julian McAuley of the University of California San Diego is the source of our dataset. It ranges from Amazon’s online debut in 1995 to 2013. Analyzing the 9.4 million products (roughly 10 GB) proved to be a difficult computational task for any laptop so we create a subset. Of these products, if we particularize the entries, we find that there are less distinct products. This means many are variations of the same product (i.e. different colours of a mug) and there contains many null entries without pricing information. Max Woolf, an associate data scientist at BuzzFeed, has performed a similar analysis on McAuley’s review dataset. This collection contains product reviews which totals 142.8 million, and he concluded that certain categories that are frequently rated 4 or 5 stars out of 5 are recognized as helpful by other users; and likewise, Amazon Electronics with 1-star reviews (thereby signalling harsh disapproval) is also considered helpful. However, reviews of 2-3 stars are not, which Woolf notes as a signal that Amazon could benefit from a binary like/dislike system instead of its present rating schema. Fig. 1: Woolf rating distributionFig. 2: Woolf review summaryTrevor Smith, another data scientist at Metris who is by training an economist, delves into this dataset as well for machine learning applications. Smith transforms the data using a ‘Bag-of-Words’ technique by reviewing all text and creating a sparse matrix of said words then configures classification algorithm as a feature/attribute: Fig. 3: Smith classifier matrixFrom above, it is evident that the two sentences do not both contain unique words – only unique words are filled within the matrix. Parsing through the reviews, Smith examines if any entries contain the word in the first position of the matrix: if yes, a 1 is assigned and elsewise, a 0. This process is repeated for every column. The next step is train data to fit the SVM classification model. The resulting accuracy is quite high for a simple application practice. III-B: DISCUSSION OF TECHNICAL MATERIALFor a rudimentary example, a training dataset: with points where 1 or -1 are the values for - indicating the class to which belongs. Each is a -dimensional real vector and our objective is to maximize the distance between the group of points for which = 1 from those that = -1, thus hyperplane and nearest point either group are as far apart as possible. From there, clusters are defined to be externally heterogenous and internally homogenous – meaning members are like one another yet dissimilar to members of other clusters. In other words, SVM algorithms aim to create strong association structure among variables. Our classifier’s aim is to use a set of pre-classified data to classify those which have not yet been seen. Thusly, our first objective in classification is to include the maximum number of product entries that are filtered per the defined inclusion criterion. Selecting relevant subset training set is essential – this was undertaken in these steps: Firstly, the data is downloaded in JavaScript Object Notation (JSON) format and transformed into comma-separated values (CSV) by parsing through the entries for all key headings. We remove irrelevant entries where there are null values for any of the observed columns. Once this is list is created, the entries are flattened into their respective columns. However, the categorical column remained compacted with subcategories that are separated by square brackets and begin from labelling the product entry as an aggregated category title to increasing granular subcategories. Secondly, once preprocessing has completed, we transform the dataset which contains strings of characters into a suitable format for the learning algorithm and classification task. Word stems are one means of accomplishing this, ordering concern is negligible for the purposes of our project. Then we attribute value representation of text. From there, one could assign each distinct word to a corresponding feature and the number of times it occurs in the document or dataset as its value. For compactness, words correspond to features if they are present in the document at least thrice and if they are not “stop-words” (i.e. prepositions, articles, etc.). Bag-Of-Words is a commonly utilized representation method, where a document represents a collection of words which occurs at least once. We evaluate our model using a novel corpus of words from Amazon’s product titles, or descriptions and categories. For simplicity, we design our clustering in a binary manner. In two groups, we posit if we can accurately predict, given a product’s title, whether said product will appear as also_bought with another product. In the array for also_bought, we loop through the whole CSV file for each unique ASIN identifier and verify if its value was in another product’s also_bought array. If this is confirmed, we increment the top level ASIN counter, and repeat the process. For an individual product, also_bought_count is the total amount of times it appears in another product’s also_bought array. We construct a model which learns words that will most likely lead to a co-purchasing scenario given a product’s specific attribute. Our model deciphers “someword” → % likelihood it will exist in the dataset and a product will be within also_bought. Some corpora of words selected from both training and test sets’ title column informs our model prediction such that it strips out words separated by spaces and punctuation. The True/False indicator is the result of the model’s predictions against the test set. 1,539 products were never listed in also_bought. Most of the results indicate that solo purchases commonly occur among Amazon consumers. IV: APPLICATION OF METHODOLOGY & CODE OVERVIEWThe initial format for Amazon metadata consists of newline separated JSON packets, as displayed below:{?“asin”: “0000031852”,?“title”: “Girls Ballet Tutu Zebra Hot Pink”,??“price”: 3.17,??“imUrl”: “",??“related”: {??..“also_bought”: [“ASIN_0”, ASIN_n-m, ASIN_n],?..“also_viewed”: [“ASIN_0”, ASIN_n-m, ASIN_n],?..“buy_after_viewing”: [“ASIN_0”, ASIN_n-m, ASIN_n],?..“bought_together”: [“ASIN_0”, ASIN_n-m, ASIN_n],??..},??“salesRank”: {“Toys & Games”: 211836},??“brand”: “Coxlures”,??“categories”: [[“Sports & Outdoors”, “Other Sports”, “Dance”]]?}?Source: loose description of non-intuitive fields that we are concerned with:asin | unique identifier for a productalso_bought | list of asins | products purchased alongside the top level asinalso_viewed | list of asins | products viewed alongside the top level asinbuy_after_viewing| list of asins | products purchased after viewing asinbought_together| list of asins | common products purchased alongside asinSince we will be using a linear SVM classification model, we require a means to rank the data. This is where the “related” co-purchasing fields will be used – these are the latter four headings.Using Python, the JSON format raw metadata is converted into a workable comma-separated values CSV file that would, for each ASIN, record a count of how many times in which a product appears in a related field for any other product.Grabbing the values that we require from shuffled JSON packets, each dictionary entry is written as a?CSV?line.Below displays how the data looks in said format: We add a 10,000 line threshold to?begin our testing phase. Now, R-based SVM algorithms can be used to build the classification model. We develop and apply our model on an example that will use the title column and also_bought_count. Here are the range of values for also_bought_count.0 1 2 3 4 5 6 7 8 9 10 11 12 13 1539 224 89 50 27 25 15 8 8 3 2 2 1 14 17 19 241 2 2 1 1The model will classify products by having been also bought, or not. In other words 1 for also_bought (also_bought_count > 1), 0 for not also_bought (also_bought_count < 1).The training set will consist of the 1st 2000 entries. The test set will consist of the next 500 entries. CODE ANALYSIS Using a corpus of words (akin to the ‘Bag of Words’ technique), we collect the terms in all rows for titles. From there, we construct a Document Term Matrix for all terms, and reduce the frequency of non-common terms such that the sparsity is less than 0.95.<<DocumentTermMatrix (documents: 2000, terms: 45)>>Non-/sparse entries : 3304/86696Sparsity : 94%Maximal term length : 12Weighting : term frequency (tf)Here is a look at the SVM model:> svm.model.title.also.bought.countL2 Regularized Support Vector Machine (dual) with Linear Kernel 2000 samples 45 predictor 2 classes: '0', '1' No pre-processingResampling: Bootstrapped (25 reps) Summary of sample sizes: 2000, 2000, 2000, 2000, 2000, 2000, ... Resampling results across tuning parameters: cost Loss Accuracy Kappa 0.25 L1 0.8870779 0.0031141372 0.25 L2 0.8884432 0.0003427000 0.50 L1 0.8868551 0.0040584027 0.50 L2 0.8881687 0.0005571252 1.00 L1 0.8864654 0.0053518981 1.00 L2 0.8882257 0.0024991084Accuracy was used to select the optimal model using the largest value.The final values used for the model were cost = 0.25 and Loss = L2.Here is a look at the first 5 rows and columns in the Document Term Matrix:TermsDocs memoir rock roll soldier anderson 1 1 1 1 1 0 2 0 0 0 0 1 3 0 0 0 0 0 4 0 0 0 0 0 5 0 0 0 0 0OUTPUT:FALSE TRUE 0.208 0.792Using this model, the test set had predicted values with ~79% accuracy. V. CONCLUSIONIn this paper, we research SVM methodology from widespread sources to develop a deeper understanding of this machine learning technique and apply this knowledge upon a chosen dataset. In our own application, we uncover Amazon consumer purchasing habits by classifying the relationship between products and their co-purchases, which are predicted from Support Vector Machines (SVM) and k-means. By deciphering from product titles, descriptions, categories and Amazon’s unique identifier (ASIN), we wield Amazon metadata and analyzing purchasing patterns from alphanumeric inputs. Our parsimonious text classification model allows us to predict with high accuracy on co-product relations. For future research endeavors, we would be able to flexibly expand our analysis by including an additional row in the initial CSV data entry process to include other attributes with associated ASINs, such as categories or descriptions instead of titles. This would lend to changing the product.container init function within the R code to make testSize the last row. The aftermath would construct a prediction on whether, for instance, a certain product would show up pairwise from also_bought with other products. In addition, this analysis could become more granular and thorough if we further investigated the True also_bought_count co-product relationships. BIBLIOGRAPHY Hastie, Trevor, et al.?The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2017.He, Ruining, and Julian Mcauley. “Ups and Downs.”?Proceedings of the 25th International Conference on World Wide Web - WWW '16, 2016, doi:10.1145/2872427.2883037.Indal, Rajni, et al. “Techniques for Text Classification: Literature Review and Current Trends.”Webology, vol. 12, no. 2, Dec. 2015.James, Gareth, et al.?An Introduction to Statistical Learning: with Applications in R. Springer, 2017.Joachims, Thorsten. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.”?Universitat Dortmund.Leopold, E., & Kindermann, J. Text categorization with support vector machines. how to represent texts in input space? 2005 Machine Learning, 46, 423–444.Namburu, S.m., et al. “Experiments on Supervised Learning Algorithms for Text Categorization.”2005 IEEE Aerospace Conference, 2005, doi:10.1109/aero.2005.1559612.Mcauley, Julian, et al. “Image-Based Recommendations on Styles and Substitutes.”?Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '15, 2015, doi:10.1145/2766462.2767755SAPORTA, G. Probabilités Analyse des Données et Statistiques, Technip, Paris. 1990Vapnik, V. Pattern Recognition Using Generalized Portrait Method. 2005 Automation and Remote Control, 774-780.APPENDIX I: CODE######################################################## File and Directory setup?#######################################################dir.root ? ?<- paste("C:", "Users", "trash", "dev", "data-mining", "support-vector-machines", sep="/") # set medir.code ? ?<- "Code"dir.data ? ?<- paste(dir.root, "Data", sep="/")amazon.csv ?<- paste("asin_metadata_no_list_dict","csv",sep=".")amazon.data <- paste(dir.data, amazon.csv, sep="/")# Get dataamazon.counts.csv <- paste("bought_with_count_categories_asin_metadata_small","csv",sep=".")amazon.counts.data <- paste(dir.data, amazon.counts.csv, sep="/")data.amazon <- read.csv(amazon.counts.data, header = TRUE, fill = TRUE, sep=",")########################################################################################################################################################################################################################################################################################################################################################################library(RTextTools)library(tm)library(dplyr)data.amazon <- read.csv(amazon.counts.data, header = TRUE, fill = TRUE, sep=",")data.clean <- data.amazon %>% mutate(count_new = if_else(also_bought_count > 1, 1, 0))product.matrix <- create_matrix(data.clean$category, language = "English",?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? removeNumbers = TRUE,?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? removePunctuation = TRUE,?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? removeStopwords = FALSE, stemWords = FALSE)product.container <- create_container(product.matrix,? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? data.clean$count_new,?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? trainSize = 1:1000, testSize = 1051:1074,?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? virgin = FALSE)product.model <- train_model(product.container, algorithm = "SVM")product.result <- classify_model(product.container, product.model)x <- as.data.frame(cbind(data.clean$count_new[1051:1074], product.result$SVM_LABEL))colnames(x) <- c("actual.count", "predicted.count")x <- x %>% mutate(predicted.count = predicted.count - 1)round(prop.table(table(x$actual.count == x$predicted.count)), 3)APPENDIX II: WORK DISTRIBUTION STATEMENT This research project was done in a collaborative manner. All team members contributed insights and coordinated tasks. AMY LI – Writing, I, II, IIIA, IIIB, V, Slide Deck & PresentationNOLAN HODGE – Coding, IIIB, IV MITCHELL HUGHES – Editing, BIBLIOGRAPHY, Slide Deck & Presentation ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download