Boun.edu.tr



BO?AZ??? UNIVERSITYDEPARTMENT OF MANAGEMENT INFORMATION SYSTEMSEvent-Based Clustering on Turkish Daily NewsDoruk Güne?Fatih OkJuly 2017Table of Contents TOC \o "1-3" \h \z \u 1Introduction PAGEREF _Toc481780267 \h 31.1Information Retrieval PAGEREF _Toc481780268 \h 31.2Machine Learning PAGEREF _Toc481780269 \h 42Background PAGEREF _Toc481780270 \h 42.1Bag of Words PAGEREF _Toc481780271 \h 42.2Vector Space Model PAGEREF _Toc481780272 \h 52.2.1TF-IDF value PAGEREF _Toc481780273 \h 52.3Similarity and Distances PAGEREF _Toc481780274 \h 52.3.1Euclidean Distance PAGEREF _Toc481780275 \h 62.3.2Cosine Similarity PAGEREF _Toc481780276 \h 62.4Clustering PAGEREF _Toc481780277 \h 62.4.1K-Means Clustering PAGEREF _Toc481780278 \h 62.4.2Agglomerative Clustering PAGEREF _Toc481780279 \h 72.4.3DBSCAN PAGEREF _Toc481780280 \h 72.4.4Incremental Clustering PAGEREF _Toc481780281 \h 72.5Evaluation Measures PAGEREF _Toc481780282 \h 82.5.1Internal Clustering Validation Measures PAGEREF _Toc481780283 \h 82.5.2External Clustering Validation Measures PAGEREF _Toc481780284 \h 83Methodology PAGEREF _Toc481780285 \h 93.1Online and Incremental Clustering PAGEREF _Toc481780286 \h 103.2Implementation PAGEREF _Toc481780287 \h 113.3Evaluation PAGEREF _Toc481780288 \h 113.4Discussion PAGEREF _Toc481780289 \h 114Conclusion PAGEREF _Toc481780290 \h 115References PAGEREF _Toc481780291 \h 12IntroductionSince news industry had inevitably transformed its traditional way of publishing to online, news contents are being generated rapidly. In Turkey, the large part of news articles has been published by more than a hundred digital providers. Besides, readers can follow all these resources not only from social media but also from some aggregator apps like Flipboard or Bundle. However, this excellent innovation also brings along some problems. Most of the time, the contents talk about the same story or they are directly duplicated from one another. For this reason, the end user may sometimes be faced with unnecessary news streams. Thus, there is a strong need for more artificial solutions to organize online articles. Such aggregators like Google News and Bing News present their content in event based clusters to solve this problem. These systems make use of some methods and algorithms that will be discussed in this report. But mainly, since these solutions focused on Non-Turkish stories, there are still problems to be solved for Turkish news sources.To sum up, the main purpose of this research is to test and evaluate the clustering methodologies for Turkish news articles produced from various resources according to the mentioned rmation RetrievalInformation Retrieval (IR) is the acquisition of resources related to a need for information from the information resources. In this project, IR is the key process that needs to be completed at first. Thus, we developed a web crawler (a spider) that allows us to extract all the data that we need from a regular article. The spider continuously track 5 main digital publishers’ home page and fetches newly published stories then save them in a MongoDB instance. The crawler is developed with the help of a python framework called Scrapy. Scrapy contains all necessary modules like downloaders, schedulers, parsers and item pipelines. Only required job is to connect this modules and source specific selectors like xpath ids or css classes. The architecture overview of a Scrapy powered web crawler system can be found in Figure 1.Figure SEQ Figure \* ARABIC 1 - The architecture overview of a Scrapy powered crawler. (source: )Machine LearningMachine learning is a science that deals with the design and development processes of algorithms that enable learning based on data types such as computers, sensor data, or databases.All machine learning techniques that have been used for document clustering will be explained and evaluated in the background section.BackgroundBag of WordsBag-of-words model is the representation of a text document with unsorted bag or list of words in this case bag-of-words refers to what kind of information we can extract from a document (unique unigram words). Figure SEQ Figure \* ARABIC 2 - Bag-of-words representationVector Space ModelVector space model is an algebraic model for representing text documents vectors of identifiers. Since the k-means efficiently cluster only the numerical values we need such techniques to perform clustering. Given the bag-of-words that extracted from the document, Creation of feature vector of the document where treating the word (term) as a feature and the value is term weight. Term weight can be a TF-IDF value.TF-IDF valueTD-IDF or term frequency-inverse document frequency intended to reflect the importance of a word in a document. TF-IDF not only consider the frequency of a word in a document (TF), but also weigh those terms with the inverse of the document frequency (IDF).Similarity and DistancesAfter transforming vectors from the documents to perform a clustering which is based on distances between points (e.g. k-means) there is need of selection a similarity measurement. There is no measure that is universally fit to all kinds of clustering problems CITATION Hua08 \l 1055 (Huang, 2008). However, Euclidean Distance and Cosine Similarity metrics can be considered for document clustering.Euclidean DistanceEuclidean distance is a metric for distance between two points in a two or three-dimensional space and it can also be used in clustering to measure the distances between text documents.The distance between two documents (da and db) that their term vectors are ta and tb respectively can be calculated as DEta, tb=t=1mwt,a-wt,b2where wt, a=tfidf(da,t).Cosine SimilarityCosine similarity is a measure of the similarity between the two non-zero vectors of an inner product domain that measures the cosine of the angle between them. It’s one of the mostly used measure for document clustering. If documents are represented as term vectors, the similarity of two documents is the correlation between the vectors. The similarity between two documents can be defined asSIMCta,tb= ta . tbta × |tb|where ta and tb are vectors of two documents over the term set T= {t1, …,tm}ClusteringClustering is grouping a set of objects to the same group which is called a cluster. So that all cluster members have the same or similar attributes contrary to other clusters. In machine learning, there are 4 main clustering methods such as Hierarchical Clustering, Centroid-based Clustering, Distribution-based Clustering and Density-based Clustering. All these main methods include various alternative sub-methods for specific purposes. In this research, we will examine 4 different clustering techniques which commonly used in text-based clustering works.K-Means ClusteringK-Means clustering is a technique of centroid-based clustering method. It partitions n elements into k clusters in a situation that each element belongs to the cluster with the nearest mean. This clustering technique takes k as input and creates k number of clusters as output.K-means algorithm is a decent choice for datasets that have a small number of clusters with proportional sizes and linearly separable data and it also can be scaled up to be used in the large datasets.For event-based clustering in daily news articles, the main problem with k-means clustering is cluster numbers are not dynamic. When a daily news article occurs, it must enter a current cluster rather than creates a new one with a single element. Thus, if the content is not related with a past event clusters cannot be consistent in time.Agglomerative ClusteringAgglomerative clustering is a strategy for hierarchical clustering. In this approach, each observation starts with its own cluster and then these clusters are progressively merged from bottom to top to build the hierarchy.As it happens in k-means clustering, it also takes n as a parameter (the distance threshold) for number of clusters. So that, this method compels the cluster number to a dedicated n rather than dynamically increasing the number of clusters for different events. DBSCANDensity-based spatial clustering of applications with noise (DBSCAN) is a density-based clustering algorithm. This algorithm computes the distances of the objects to their neighbors and performs the clustering process by grouping the fields with more objects than the predetermined threshold in each region.Contrary to k-means and agglomerative clustering techniques, DBSCAN algorithm fits for datasets that have unknown cluster numbers. However, in the case of event-based clustering in daily news stories it requires high computing power and memory since the algorithm needs to recalculate the clusters for new data arrivals.Incremental ClusteringIn incremental clustering, the aim is to reduce the storage and processing cost in terms of some machine learning issues such as time, CPU usage and budget. The idea is to process only one element at a time and typically store only a small subset of the data CITATION Ack \l 1055 (Ackerman & Dasgupta).Online data sources, such as news sites and blogs, constantly distribute new documents so that this requires establishment of an effective clustering in an efficient manner. In the case of event-based clustering in daily news stories, the use of incremental document clustering systems can be more beneficial than using other clustering techniques alone.Evaluation MeasuresThe main objective of clustering algorithms is forming a high intra cluster similarity and low inter cluster similarity. Internal measures are designed to evaluate the quality of clustering algorithms in this manner, achieved by computing cluster similarities. The internal measures evaluate the goodness of a clustering structure without respect to external information CITATION Liu10 \l 1055 (Liu, Li, Xiong, Gao, & Wu, 2010). But good scores on internal measures doesn’t necessarily implies good effectiveness in an application. In the next sections, we will look at silhouette coefficient and Davies–Bouldin index.On the other hand, external measures often referred as golden standards, used to match externally supplied class labels. Most of the times this labeling process human labelers and human intuition. Entropy and purity are the two external measurements which we will look at.Internal Clustering Validation MeasuresSilhouette CoefficientThe Silhouette coefficient is the measure of the dissimilarity of a document to its cluster and the dissimilarity of a document to the closest neighbor cluster. The Silhouette Coefficient can be calculated using the mean of the intra-cluster distance (x) and the mean of the nearest-cluster distance (y) for each document. The Silhouette coefficient in this case is(y-x)/max?(x,y)The value of the Silhouette coefficient can vary between -1 and 1. The values under the 0 is not desirable since it indicates that there is different cluster which are more like the original point. Values near the 0 generally means there are overlapping clusters, and the 1 is the best silhouette coefficient value which can be obtained. External Clustering Validation MeasuresV-MeasureV-measure compares a target clustering which has been manually labeled against an automatically created clustering to determine how similar these clusters are.To understand v-measure’s logic and formula we need to look at the homogeneity and completeness. A clustering result satisfies homogeneity when the clustering assign only those data points that are members of a single class to a single cluster. On the other hand, to satisfy the completeness, the clustering must assign all those data points that are members of a single class to single cluster CITATION Ros \l 1033 (Rosenberg & Hirschberg, 2007). The v-measure is the harmonic mean between homogeneity and completeness.v=2×homogeneity × completeness/(homogeneity +completeness)MethodologyIn our project, we collect and save all published stories instantly and trying to find out if any news is related to the previous one. As mentioned earlier, clustering of daily news articles requires more efficiency due to the real-time streaming issues. For this reason, rather than implementing a common clustering algorithm that runs for every arrival of an article we developed an incremental approach with k-means clustering algorithm. The system tracks all records and applies the algorithm to each record without running the clustering algorithm for the complete dataset. A basic representation of the architecture can be seen in Figure 3. In this section, we will discuss the incremental approach of k-means clustering algorithm, it’s application of our event-based clustering system and the evaluation of the methodology. Figure SEQ Figure \* ARABIC 3 - Basic representation of the architecture.Online and Incremental ClusteringAs CITATION L?n13 \l 1055 (L?nnberg & Love, 2013) described before, Incremental Clustering Algorithm (ICA) is an algorithm that cluster incremental data quickly. Once the algorithm processes an object at a time, it either places it in a pre-existing cluster or creates a new cluster that contains a single object.If no cluster has been created previously, in other words, if the algorithm will work for the first object occurrence, a new cluster will be created for this object. For other new object occurrences, the algorithm will decide which cluster to assign by finding the most similar object and then checks the similarity for a predefined threshold value. If the similarity is above the threshold value, the algorithm will assign this object into the cluster of the most similar object. If the similarity is below the threshold value, a new cluster will be created.Rather than finding the most similar object, another approach can be to find any object that meets the threshold value and assign the new object to its cluster. Both these approaches will be discussed and evaluated in later sections.ImplementationEvaluationDiscussionConclusionReferences BIBLIOGRAPHY Ackerman, M., & Dasgupta, S. (n.d.). Incremental Clustering: The Case for Extra Clusters. Retrieved from NIPS: , A. (2008). Similarity Measures for Text Document Clustering. The University of Waikato, Department of Computer Science. Hamilton: The University of Waikato.L?nnberg, M., & Love, Y. (2013). Large scale news article clustering. Chalmers University of Technology, Department of Computer Science and Engineering. Sweden: Chalmers University of Technology.Liu, Y., Li, Z., Xiong, H., Gao, X., & Wu, J. (2010). Understanding of Internal Clustering Validation Measures. International Conference on Data Mining, 911-916.Rosenberg, A., & Hirschberg, J. (2007). V-Measure: A conditional entropy-based external cluster evaluation measure.Wikipedia. (2017, March 30). tf–idf. Retrieved from Wikipedia: –idf ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download