Sensing trending topics in Twitter - Luca Maria Aiello

[Pages:14]1

Sensing trending topics in Twitter

Luca Maria Aiello, Georgios Petkos, Carlos Martin, David Corney, Symeon Papadopoulos, Ryan Skraba, Ayse Go?ker, Yiannis Kompatsiaris, Alejandro Jaimes

Abstract--Online social and news media generate rich and timely information about real-world events of all kinds. However, the huge amount of data available, along with the breadth of the user base, requires a substantial effort of information filtering to successfully drill down to relevant topics and events. Trending topic detection is therefore a fundamental building block to monitor and summarize information originating from social sources. There are a wide variety of methods and variables and they greatly affect the quality of results. We compare six topic detection methods on three Twitter datasets related to major events, which differ in their time scale and topic churn rate. We observe how the nature of the event considered, the volume of activity over time, the sampling procedure and the pre-processing of the data all greatly affect the quality of detected topics, which also depends on the type of detection method used. We find that standard natural language processing techniques can perform well for social streams on very focused topics, but novel techniques designed to mine the temporal distribution of concepts are needed to handle more heterogeneous streams containing multiple stories evolving in parallel. One of the novel topic detection methods we propose, based on n-grams cooccurrence and df -idft topic ranking, consistently achieves the best performance across all these conditions, thus being more reliable than other state-of-the art techniques.

Index Terms--Topic detection, Text mining, Information filtering, Twitter, Social media, Social sensing

I. INTRODUCTION

The pervasiveness of online social media has seen unprecedented expansion in recent years. As social networking services progressively diffuse in more geographical areas of the world and penetrate increasingly diverse segments of the population, the value of information that is collectively generated on such online platforms increases dramatically. In fact, interactions and communication in social media often reflect real-world events and dynamics; as the user base of social networks gets wider and more active in producing

Luca Maria Aiello and Alejandro Jaimes are with Yahoo! Research Barcelona ({alucca,ajaimes}@yahoo-).

Georgios Petkos, Symeon Papadopoulos, and Yiannis Kompatsiaris are with the Information Technologies Institute, CERTH, Thessaloniki, Greece ({gpetkos,papadop,ikom}@iti.gr).

Carlos Martin and David Corney are with the School of Informatics, City University London, London EC1V 0HB ({martin.carlos.1,david.corney.1}@city.ac.uk).

Ryan Skraba is from Alcatel-Lucent Bell Labs France, Paris, France (ryan.skraba@alcatel-).

Ayse Go?ker is with Robert Gordon University, School of Computing / IDEAS Research Institute Aberdeen, United Kingdom (a.s.goker@rgu.ac.uk). The majority of this work was conducted while Ayse Go?ker was at City University London.

This work is supported by the SocialSensor FP7 project, partially funded by the EC under contract number 287975.

Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@.

content about real-world events almost in real-time, social media streams become accurate sensors of real-world events.

The riots during the Arab Spring [1], [2] the dramatic incidents determined by natural disasters [3] as well as the process of opinion formation around major political themes [4] offer examples of events that have been reported almost in real-time by social network participants. As a result, social media data mining, originally aimed at understanding and predicting the evolution of the online worlds [5], [6], [7], [8], [9], is now increasingly leveraged to study the dynamics of real-world events. The ability to monitor such phenomena has direct implications on the possibility of understanding and describing real-world events, with applications in the fields of computational journalism [10], [11], urban monitoring [12], and many more. Finally, since social media could be used to manipulate the course of online and offline human dynamics [13], [14], [15] (e.g., by augmenting the consensus on politicians for electoral purposes), detecting anomalous activity can help prevent possible misuses of online social platforms.

To monitor and detect all these aspects in real time, we need to extract relevant information from the continuous stream of data originating from such online sources. Determining which are the topics being discussed by the crowd is the first step towards a high-level, human-understandable description of the social data stream. The task of Topic Detection and Tracking [16] has been tackled in the past for static document corpora, but in a social media context there are many additional factors to consider such as the fragmentation and noise of the user generated content, the real-time requirement, the burstiness of events and their time resolution.

We explore how much these factors affect the topic detection results by exploring two orthogonal dimensions: a) the effect of the nature of input data, including the pre-processing phase, on the topic detection outcome; and b) the behaviour of different topic detection algorithms themselves.

The methods we test cover three different classes: probabilistic models (Latent Dirichlet Allocation), classical Topic Detection and Tracking (a common document-pivot approach) and feature-pivot methods. Along this series of methods, we develop four novel approaches, including methods that use the concept of frequent itemset mining. In particular, we show that a method that leverages n-gram cooccurrences (instead of unigrams) and df -idft topic ranking is consistently the best performing method among the ones tested. The proposed df idft is a score for burstiness detection that can significantly assist in determining the most rapidly emerging topics. The diversity of the methods presented and the different attributes of the datasets considered (with respect to time-scale and breadth of topical discussions) enable a comparison across

2

several crucial dimensions inherent in the topic detection task that have not been explored in previous work.

The evaluation of methods focuses on a scenario of sensing real world topics of the kind that would be of interest to the reader of a news portal. We use three large datasets collected from Twitter, for which the sets of ground truth topics have been produced by examining news stories appearing in the mainstream media. The selected datasets cover the domains of politics (the US Super Tuesday primaries of March 2012 and the US Presidential elections of November 2012) and sports (the English FA Cup Final).

In short, our contributions can be summarized as follows: ? We present a comparative study of a wide range of topic

detection methods across three large Twitter datasets on a real-world event sensing scenario. The main idea of using different datasets is to compare the performance of the algorithms in different domains which have their own special features. ? We analyze how factors such as the type of input data (e.g. time span, topic breadth) and pre-processing techniques can affect the quality of topic detection results. ? Among the tested methods, we find that our novel algorithm combining n-grams with df -idft ranking performs best, outperforming other state-of-the-art techniques. The remainder of the paper is structured as follows. Section II provides an overview of state-of-the-art approaches for topic detection. Section III presents the topic detection methods that are examined in this work, together with some pre-processing steps. Experimental results are presented and discussed in Section IV, while Section V concludes the paper.

II. RELATED WORK

Topic Detection and Tracking (TDT) aims at extracting topics from a stream of textual information sources (documents) and at quantifying their "trend" in time [16]. This work focuses on pieces of texts (posts) produced within social media platforms.

Methodologically, general-purpose topic detection can produce two types of complementary outputs: either the documents in the collection are clustered or the most important terms or keywords are selected and then clustered. In the first method, referred to as document-pivot, a topic is represented by a cluster of documents, whereas in the latter, referred to as feature-pivot, a cluster of keywords is produced instead.

Both methods have advantages and disadvantages. Document-pivot methods suffer from cluster fragmentation problems and, in a streaming context, they often depend on arbitrary thresholds for the inclusion of a new document to an existing topic. On the other hand, feature-pivot methods are commonly based on the analysis of associations between terms, and often capture misleading term correlations. In general, the two approaches can be considered complementary and, depending on the application, one may be more suitable than the other. In the following subsections, we review several popular approaches that fall in either of the two categories. We also characterize them based on a number of important features, such as incremental computation vs. batch mode or the usage of additional sources of information.

A. Document-pivot methods

Simple document-pivot approaches cluster documents by leveraging some similarity metric between them. The work by Phuvipadawat and Murata [17] follows this direction to provide a method for breaking news detection in Twitter. Tweets retrieved using targeted queries or hashtags are converted into a bag-of-words representation weighted with boosted tf idf (term frequency?inverse document frequency) emphasizing important entities such as names of countries or public figures. Tweets are then incrementally merged by considering the textual similarity between incoming tweets and existing clusters. Similar approaches based on textual similarity and tf idf can be found in the literature [18], [19]. Among them, the method discussed by Becker et al. [19] additionally considers the classification of tweets as referring to real-world events or not. The classifier is trained on a variety of features including social aspects (e.g., number of mentions) and other Twitterspecific features. An important drawback of the method is the need for manual annotation of training and test samples.

Dimensions other than text can also be used to improve the quality of clustering. TwitterStand [20] uses a "leaderfollower" clustering algorithm that takes into account both textual similarity and temporal proximity. Each cluster center is represented using a centroid tf -idf vector and the average post time. A similarity metric based on both dimensions and on the number of shared hashtags allows incremental merging of new tweets with existing clusters. Sensitivity to noise (which is a known problem for document-pivot methods [21]) and fragmentation of clusters are drawbacks of this approach. Manual selection of trusted information providers and periodic defragmentation runs are needed to mitigate such effects.

The task of First Story Detection (FSD) discussed by Petrovic et al. [22] is closely related to document-pivot TDT. The goal is to detect the first document discussing a topic in a large corpus. A new story is created by a document having low similarity with all previously detected clusters. For fast retrieval of nearest neighbors for the incoming document Locality Sensitive Hashing (LSH) is used; however, such a solution is problematic when the nearest neighbors are not very similar to the query document.

B. Feature-pivot methods

Feature-pivot methods are closely related to topic models in natural language processing, namely statistical models to extract sets of terms that are representative of the topics occurring in a corpus of documents. Most state-of-the-art static topic models are based on Latent Dirichlet Allocation (LDA) [23]. Even though LDA extensions for dynamic data have been proposed [24], alternative approaches trying to capture topics through the detection of keyword burstiness have been studied [25], mainly in the context of news media mining. The idea behind those methods is that breaking news, unlike other discussion topics, happen to reach a fast peak of attention from social media users as soon as they are publicly announced [26], [27]. Accordingly, the common framework that underlies most approaches in this category first identifies

AIELLO et al.: SENSING TRENDING TOPICS IN TWITTER.

3

bursty terms and then clusters them together to produce topic definitions.

Even before the diffusion of social media services, detection of bursty events had been studied in generic document sets. The method presented by Fung et al. [21], for instance, detects bursty terms by looking at where the frequency of the term in a given time window is positioned in the overall distribution of the number of documents containing that term. Once the bursty terms are found, they are clustered using a probabilistic model of cooccurrence. The need for a global topic term distribution restricts this approach to a batch mode of computation. Similar pipelines were tested for topic detection in social media (e.g., Twitter), but with additional emphasis on the enrichment of the obtained topics with non-bursty but relevant terms, URLs and locations [28].

Graph-based approaches detect keyword clusters based on their pairwise similarities. The algorithm by Sayyadi et al. [29] builds a term cooccurrence graph, whose nodes are clustered using a community detection algorithm based on betweenness centrality. Additionally, the topic description is enriched with the documents that are most relevant to the identified terms. Graphs of short phrases, rather than of single terms, connected by edges representing lexical inclusion or similarity have also been used [30]. Graph-based approaches have also been used in the context of collaborative tagging systems with the goal of discovering groups of tags pertaining to topics of social interest [31].

Alternative approaches based on signal processing have also been explored. Weng et al. [32] compute df -idf (a variant of tf -idf ) for each term in each considered time slot, and apply wavelet analysis on consecutive blocks. The difference between the normalised entropy of consecutive blocks is used to construct the final signal. Bursty, relevant terms are extracted by computing the autocorrelation of the signal and heuristically determining a threshold to detect bursty terms. Also in this case, a graph between selected terms is built based on their cross-correlation and it is then clustered to obtain event definitions. The Discrete Fourier Transform is used by He et al. [33]: the signal for each term is classified according to its power and periodicity. Depending on the identified class, the distribution of appearance of a term in time is modeled using one or more Gaussians, and the KL-divergence between the distributions is then used to determine clusters.

When additional information is available about the document producers, more sophisticated approaches are possible. In the method by Cataldi et al. [34], a PageRank-like measure is used to identify important users on the Twitter social network. Such centrality score is combined with a measure of term frequency to obtain a "nutrition" measure for each keyword. The trend of nutrition in time identifies bursty keywords. Clustering on a correlation graph of bursty keywords delineates the boundary of topics.

III. TOPIC DETECTION FROM SOCIAL MEDIA

Next, we define all the components of the proposed topic detection pipeline. First (Section III-A), we present the problem statement and define some basic terminology. Then

(Section III-B), we describe the data preprocessing and in the following sections we present six methods that take the preprocessed data as input and output the detected topics.

A. Real-world events sensing: problem definition

We address the task of detecting topics in (near) real time from social media streams. To keep our approach general, we consider that the stream is made of (short) pieces of text generated by social media users (posts, messages, or tweets in the case of Twitter). Posts are formed by a sequence of words, terms or keywords (we use the terms interchangeably), and each one is marked with the timestamp of creation.

We address a user-centered scenario in which the user starts up the detection system by providing a set of seed terms that are used as an initial filter to narrow down the analysis only to the posts containing at least one of them. Additionally, we assume that the time frame of interest (can be indefinitely long) and a desired update rate are provided (e.g., detect new trending topics every 10 minutes). The expected output of the algorithm is a topic, defined as a list of keywords, delivered at the end of each time slot determined by the update rate.

This setup fits well many real-world scenarios, in which an expert of some domain has to monitor specific topics or events being discussed in social media [3], [35]. For instance, this is the case for computational journalism, in which the media inquirer is supposed to have enough knowledge of the domain of interest to provide initial keywords to perform an initial filtering of the data stream. Even if it requires initial human input, this framework still remains generic and suitable to any type of topic or event.

B. Data preprocessing

The content of user generated messages could be unpredictably noisy. To reduce the amount of noise before the detection task is executed, the raw data collected through the seed terms filter is subjected to three preprocessing steps.

? Tokenization. In a raw post, terms can be combined with any sort of punctuation and hyphenation and can contain abbreviations, typos, or conventional word variations. We use the Twokenizer tool [18] to extract bags of cleaner terms from the original messages by removing stopwords and punctuation, compressing redundant character repetitions, and removing mentions, i.e., IDs or names of other users included in the text for messaging purposes.

? Stemming. In information retrieval, stemming is the process of reducing inflected words to their root (or stem), so that related words map to the same stem. This process naturally reduces the number of words associated with each document, thus simplifying the feature space. In our experiments we use an implementation of the Porter stemming algorithm [36].

? Aggregation. Topic detection methods based on word or n-grams cooccurrences, or any other type of statistical inference, suffer in the absence of long documents. This is the case of social media, where user-generated content is typically in the form of short posts. In information retrieval, it is common practice to partially address this

4

problem by concatenating different messages together to produce super-documents of larger size. We build super-documents based on two strategies. The first involves temporal aggregation that glues together N messages that are contiguous in time. The second involves similarity-based aggregation that attaches to a message all the near-duplicate messages posted in the same time slot, identified through an efficient document clustering method [22], which is also also used by one of the examined topic detection algorithms (see Section III-D). Determining the effect of such preprocessing algorithms on the quality of the final topic is difficult to predict, and not much investigation on it has been done so far. For instance, the aggregation of posts in super-documents could on one hand help improve the word cooccurrence statistic but on the other hand introduces the risk of putting together terms related to different topics. In Section IV-C we report the impact of the preprocessing on the results.

C. Latent Dirichlet Allocation (LDA)

Topic extraction in textual corpora can be addressed through probabilistic topic models. In general, a topic model is a Bayesian model that associates with each document a probability distribution over topics, which are in turn distributions over words. LDA [23] is the best known and most widely used topic model; we therefore use it as a baseline to compare our methods against. According to LDA, every document is considered as a bag of terms, which are the only observed variables in the model. The topic distribution per document and the term distribution per topic are instead hidden and have to be estimated through Bayesian inference. We use the Collapsed Variational Bayesian inference algorithm [37], an LDA variant that is computationally efficient, more accurate than standard variational Bayesian inference for LDA, and has parallel implementations already available in Apache Mahout1. LDA requires the expected number of topics k as input and in our evaluation we explore the quality of the topic for different values of k (see Section IV-C). The estimation of the optimal k, although possible through the use of non-parametric methods [38], falls beyond the scope of this work.

D. Document-pivot topic detection (Doc-p)

The second method that we examine is an instance of a classical Topic Detection and Tracking method that uses a document-pivot approach. The flavour of the method is based on the work by Petrovic et al. [22], which uses LSH to rapidly retrieve the nearest neighbour of a document and accelerate the clustering task. The principle behind this method is the same used for the near-duplicate detection in the similaritybased aggregation step of the preprocessing phase. It works as follows:

? Perform online clustering of posts: Compute the cosine similarity of the tf -idf [39] representation of an incoming post to all other posts processed so far. If the similarity to the best matching post is above some

1

threshold tf-idf , assign the item to the same cluster as its best match; otherwise create a new cluster with the new post as its only item. The best matching tweet is efficiently retrieved by LSH. ? Filter out clusters with item count smaller than n. ? For each cluster c, compute a score as follows:

|Docsc| |wordsi|

scorec =

exp(-p(wij ))

i=1 j=1

where wij is the jth term appearing in the ith document of the cluster. The probability of appearance of a single term p(wij) is estimated from a reference dataset collected from Twitter (see also Section III-E). Thus, less frequent terms contribute more to the score of the cluster. ? Clusters are sorted according to their score and the top clusters are returned.

The merit of using LSH is that it can rapidly provide the nearest neighbours with respect to cosine similarity in a large collection of documents. An alternative would be to use inverted indices on the terms that appear in the tweets and then compute the cosine similarity between the incoming document and the set of documents that have a significant term overlap with it; however, the use of LSH is much more efficient as it can directly provide the nearest neighbours with respect to cosine similarity.

In practice, for short posts such as tweets, we found that the similarity of two items is usually either close to zero or close to one (from around 0.8 to 1.0). This observation makes setting tf-idf relatively easy: we set it to 0.5. Due to this phenomenon, items grouped together by this procedure are usually, but not always, near-duplicates (e.g., re-tweets). Therefore, it is clear that topics produced by this method will be fragmented, i.e. the same topic may be represented by different sets of near-duplicate tweets. To begin dealing with this issue, we examine the use of different types of aggregation as described in Section III-B.

E. Graph-based feature-pivot topic detection (GFeat-p)

The next method is a first of a series of feature-pivot methods. Its unique feature is that for the feature clustering step it uses the Structural Clustering Algorithm for Networks (SCAN) [40]. A property of SCAN is that apart from detecting communities of nodes, it provides a list of hubs, each of which may be connected to a set of communities. In a feature-pivot approach for topic detection, the nodes of the graph would correspond to terms and the communities would correspond to topics. The detected hubs would then ideally be considered terms that are related to more than one topic, something that would not be possible to achieve with a common partitional clustering algorithm and would effectively provide an explicit link between topics.

We select the terms to be clustered, out of the set of terms present in the corpus, using the approach in [18]. It uses an independent reference corpus consisting of randomly collected tweets. For each of the terms in the reference corpus, the

AIELLO et al.: SENSING TRENDING TOPICS IN TWITTER.

5

likelihood of appearance p(w|corpus) is estimated as follows:

p(w|corpus) = (

Nw +

n u

Nu)

+

n

(1)

where Nw is the number of appearances of term w in the corpus, n is the number of term types appearing in the corpus and is a small constant (typically set to 0.5) that is included to regularize the probability estimate (so that a new term that does not appear in the corpus is not assigned a probability of 0). To determine the most important terms in the new corpus, we compute the ratio of the likelihoods of appearance in the two corpora for each term. That is, we compute:

p(w|corpusnew )

(2)

p(w|corpusref )

The terms with the highest ratio will be the ones with significantly higher than usual frequency of appearance and it is expected that they are related to the most actively discussed topics in the corpus. Stop words, although already removed during preprocessing in our experiments, would typically have a ratio around 1. Once the high-ranking terms are selected, a term graph is constructed and the SCAN graph-based clustering algorithm is applied to extract groups of terms, each of which is considered to be a distinct topic. More specifically, the algorithm steps are the following:

? Selection: The top K terms are selected using the ratio of likelihoods measure and a node for each of them is created in the graph G.

? Linking: The nodes of G are connected using a term linking strategy. First, a similarity measure for pairs of terms is selected and then all pairwise similarities are computed. Various options for the similarity measure are explored: the number of documents in which the terms cooccur, the number of cooccurrences divided by the larger or smaller document frequency of the two terms, and Jaccard similarity. Moreover, either a kN N approach (linking each term with its k nearest neighbours) or an -based approach (link all pairs of nodes that have similarity higher that ) can be used.

? Clustering: The SCAN algorithm is applied to the graph; a topic is generated for each of the detected communities.

? Cluster enrichment: The connectivity of each of the hubs detected by SCAN to each of the communities is checked and if it exceeds some threshold, the hub is linked to the community. A hub may be linked to more than one topic.

Clearly, the term linking step is crucial for the success of the method. Unfortunately, there is no straightforward method for determining the best similarity measure or node linking strategy to be used. Additionally, it can be expected that the graph construction parameters will need to vary for datasets with different topic granularities and levels of intertopic connectivity. For this work, the parameters of graph construction were selected using the ground truth for a single independent timeslot. It should also be noted here that different parameters were required for the three different datasets (see Section IV-B), due to the fact that timeslots of different length were used for the three datasets and therefore there were

large differences in the topic granularities and the inter-topic connectivity.

F. Frequent Pattern Mining (FPM)

A problem with feature-pivot methods like the one described in the previous section is that in order to group together a set of terms, they only take into account their pairwise similarities, which are based on some function of the number of cooccurrences between the pair of terms. In the case that there are closely interconnected topics that share a relatively large number of terms, this procedure is most likely to produce generic or merged topics. An option to deal with this issue is to take into account the simultaneous cooccurrence between more than two terms. This motivation leads naturally to consider the use of frequent itemset mining, a well-defined technique in transaction mining for topic detection to determine which items are likely to cooccur in a set of transactions [41].

In a social media context, an item is any term w mentioned in a post (excluding stop words, punctuation tokens, etc.). The transaction is the post, and the transaction set are all posts that occur in a time slot Tj. The number of times that any given set of terms occurs in the time slot is defined as its support, and any itemset that meets a minimum support is called a pattern. The initial challenge is to apply highlyscalable Frequent Pattern (FP) detection to each time slot in a large stream of posts and then rank the FPs in order to find the most relevant keyword sets for each time slot. These keyword sets may be considered the topics that best illustrate the underlying social interactions. Below, we describe these two processing steps, FP detection and ranking.

1) FP detection: The FP-Growth algorithm is often used as a comparative baseline for frequent itemset mining, due to its good performance [42]. Our implementation uses a distributed version of the algorithm called Parallel FP-Growth [43] that is optimized for use on a Hadoop cluster. FP detection requires three rounds of Map-Reduce processing:

? Keyword list: For each time slot, the initial step of the FP-Growth algorithm is to create a list of keywords sorted by frequency. A minimum support is used to reduce the number of keywords being investigated.

? Parallel construction of an FP-tree data structure: For each time slot, an FP-Tree sorts the patterns according to their cooccurences and their support.

? Frequent pattern extraction: For each time slot, the parallel FP-tree structures are aggregated and analyzed to produce association rules on the transaction set in the form: {w1, w2} Pi = {w3, w4, ...} with support(Pi).

2) FP ranking: Once a set of frequent patterns has been extracted from the dataset, they are ranked and the top N results are returned as candidate topics. The challenge is to rank patterns such that keywords in the candidate topics are sufficiently related and with enough diversity to cover the different underlying subjects of conversation in the social interactions. A common way to rank patterns is to simply use the support of a given pattern; the more often a set of keywords cooccurs, the more likely we can consider it relevant as a topic. Another measure of pattern relevance is

6

the lift. It is defined as the ratio between the itemset support versus the expected frequency if the individual items were distributed independently. A higher lift for a pattern means that the keywords are more likely to be found together. The lift is appropriate to evaluate association rules, where one set of items implies the presence of another set.

Ranking by frequency favours short patterns, since a subset of any longer pattern is guaranteed to have the same or higher support. Ranking by lift favours longer patterns with cooccurrences of otherwise rare keywords. Another simple ranking mechanism to promote pattern length is to rank a pattern, then assign a pattern length boost weight for every additional token. Likewise, a minimum pattern length can be enforced by pruning smaller patterns. Removing punctuation tokens and stop words may also be performed by assigning a zero score to non-keywords after the Parallel FP-Growth analysis. The results are equivalent; early pruning is the obvious choice for better performance in a running system, but late pruning permits more flexibility for investigating ranking functions, such as manually adding different "don't care" keywords that do not contribute to the underlying topics. This is particularly relevant when analyzing datasets obtained by monitoring specific seed keywords, which would otherwise overwhelm the detected frequent patterns.

It is important to note that pruning or penalizing a longer pattern (due to late stop word removal or specific keyword weighting) permits the subsets of that pattern to remain viable candidate topics. A subset of any pattern will always have equal or greater support, and there are always more subsets than there are larger patterns. However, when a longer pattern has subsets with exactly the same support (i.e. the subset of keywords only cooccur within the larger pattern), we can safely prune those subsets. Without additional information, we cannot tell which of the keywords in the larger pattern to discard in order to produce a better candidate topic. For instance, in the case of Twitter posts, the presence of long patterns is often due to retweeting popular status updates.

G. Soft Frequent Pattern Mining (SFPM)

The FPM approach of Section III-F provided an elegant solution to the problem of feature-pivot methods that take into account only pairwise cooccurrences between terms in the case of corpora with densely interconnected topics. It can be said that it lies on the other end of the spectrum of methods that rely on the number of cooccurrences between terms: whereas the approach in Section III-E examined only pairwise cooccurrences, FPM examines cooccurrences between any number of terms, typically larger than two. A question that naturally arises is whether it is possible to formulate a method that lies between these two extremes. Such a method would examine cooccurrence patterns between sets of terms with cardinality larger that two, like FPM does, but it would be less strict by not requiring that all terms in these sets cooccur frequently. Instead, in order to ensure topic cohesiveness, it would require that large subsets of the terms grouped together, but not necessarily all, cooccur frequently, resulting in a "soft" version of FPM. Next, we propose a method for achieving this.

The proposed approach works by maintaining a set of terms S, to which new terms are added in a greedy manner, according to how often they cooccur with the terms in S. In order to quantify the cooccurrence match between a set S and a candidate term t, we maintain a vector DS for S and a vector Dt for the term t, both with length n, where n is the number of documents in the collection. The ith element of DS denotes how many of the terms in S cooccur in the ith document, whereas the ith element of Dt is a binary indicator that represents if the term t occurs in the ith document or not. Thus, the vector Dt for a term t that frequently cooccurs with the terms in set S, will have a high cosine similarity to the corresponding vector DS. Please note that some of the elements of DS may have the value |S|, meaning that all items in S cooccur in the corresponding documents, whereas others may have a smaller value indicating that only a subset of the terms in S cooccur in the corresponding documents. For a term that is examined for expansion of S, it is clear that there will be some contribution to the similarity score also from the documents in which not all terms cooccur, albeit somewhat smaller compared to that documents in which all terms cooccur. This way we achieve the "soft" matching between a term that is considered for expansion and a set S. Finding the best matching term can be done either using exhaustive search or some approximate nearest neighbour scheme such as LSH.

Since we utilize a greedy approach that expands the set S with the best matching term, we need a criterion for terminating the expansion process. The termination criterion needs to deal with the cohesiveness of the generated topics, meaning that if not properly set, the resulting topics may either end up being too generic (with too few keywords) or a mixture of topics (with too many keywords related to possibly irrelevant topics). To deal with this, we use the cosine similarity between S and the next best matching term. If the similarity is above a threshold, we add the term, otherwise the expansion process stops. This threshold is the only parameter of the proposed algorithm and is set to be a function of the cardinality of S. In particular we use a sigmoid function of the form:

1

(S) = 1 - 1 + exp((|S| - b)/c)

(3)

The parameters b and c can be used to control the size of the term clusters and how soft the cooccurrence constraints will be. Practically, we set the values of b and c so that the addition of terms, when the cardinality of S is small, is easier (the threshold is low), but addition of terms is harder when the cardinality is larger. A low threshold for the small values of |S| is required, so that it is possible for terms that are associated to different topics (and therefore occur in more documents than those corresponding to the non-zero elements of DS) to join the set S. The high threshold for the larger values of |S| is required so that S does not grow without limit. Since we require a set of topics, rather than a single topic, the greedy search procedure is applied as many times as the number of considered terms, each time initializing S with a candidate term. This will produce as many topics as the set of terms

AIELLO et al.: SENSING TRENDING TOPICS IN TWITTER.

7

Algorithm 1 SFPM algorithm

T : The set of candidate terms T opics = for each term t in T do

S = t; DS = Dt; ContinueExpanding = true; repeat

t^ = GetBestM atchingT erm(DS , S, T ); sim = CosineSimilarity(DS , Dt^); if sim > (S) then

S = S t^; DS = DS + Dt^; for each DSi < |S|/2 set DSi = 0 else ContinueExpanding = f alse; end if until ContinueExpanding T opics = T opics S end for Remove duplicates from Topics

considered, many of which will be duplicates, thus we postprocess the results to remove these duplicates. To limit the search procedure in reasonable limits, we select the top n terms with the highest likelihood-ratio (Eq. 2).

In early experiments with the described algorithm, it was found that, after some time, especially if a very frequently occurring term is added to the set, the vector DS may include too many non-zero entries filled with small values. This may have the effect that a term may be deemed relevant to S because it cooccurs frequently only with a very small number of terms in the set rather than with most of them. In order to deal with this issue, after each expansion step, we reset to zero any entries of DS that have a value smaller than |S|/2. Moreover, a nice feature of the approach is that the most relevant documents for a topic can be retrieved from the entries with the highest document count in the vector DS. The pseudocode of SFPM is presented in Algorithm 1.

H. BNgram

Both the FPM and SFPM approaches attempted to take into account the simultaneous cooccurences between more than two terms. However, it is also possible to achieve a similar result in a simpler way: use n-grams instead of unigrams. This naturally groups together cooccurring terms and it may be considered to offer a first level of term grouping. Using n-grams makes particular sense for Twitter, since a large number of status updates are just copies or retweets of previous messages, so important n-grams tend to be frequent.

Additionally, we introduce a new feature selection method. We take into account the changing frequency of terms over time as a useful source of information to detect emerging topics. The main goal of this approach is to find emerging topics in post streams by comparing the term frequencies from the current time slot with those of preceding time slots. We propose the df -idft metric which introduces time to the classic tf -idf score. We use historical data to penalize those topics that began in the past and are still popular in the present, and that therefore do not define new topics.

This approach indexes all keywords from the posts of the collection. The keyword indices, implemented using Lucene2,

2

are organized into different time slots. In addition to single keywords, the index also considers bigrams and trigrams. Once the index is created, the df -idft score is computed for each ngram of the current time slot i based on its document frequency for this time slot and penalized by the logarithm of the average of its document frequencies in the previous t time slots (see Equation 4).

df ?idft = log

dfi + 1

+ 1 t

j=i

dfi-j

t

. +1

(4)

In addition, a boost factor is considered to raise the importance of proper nouns (persons, locations and organizations, in our case) using a standard named entity recognizer [44], as they are essential keywords in most discussed stories. The use of this factor is similar to [17], where the authors highlight the importance of such words for grouping results. The selected values for this factor are based on the best values from the experiments of previous work, being boost=1.5 in case the n-gram contains a named entity and boost=1 otherwise.

As a result of this process, a ranking of n-grams is created based on their df -idft scores. A single n-gram is often not very informative, but a group of them often offers interesting details of a story. Therefore, we use a clustering algorithm to group the most representative n-grams into clusters, each representing a single topic. The clustering is based on distances between n-grams or clusters of n-grams. From the set of distances, those not exceeding a distance threshold are assumed to represent the same topic.

We define the similarity between two n-grams as the fraction of posts that contain both of them. We initially assign every n-gram to its own singleton cluster, then follow a standard "group average" hierarchical clustering algorithm [45] to iteratively find and merge the closest pair of clusters. When an n-gram cluster is joined to another, the similarities of the new cluster to the other clusters are computed as the average of the similarities of the combined clusters. The clustering is repeated until the similarity between the nearest un-merged clusters falls below a fixed threshold , producing the final set of topic clusters for the corresponding time slot.

In our experiments, we use a similarity threshold of = 0.5, which means that two n-grams must appear in more than 50% of the same tweets in order to belong to the same topic. This assumption is stronger in our case because we are only considering the posts for a specific time slot, so it is more likely that the n-gram clusters whose similarities are higher than the threshold represent the same topic. Preliminary experiments suggest that the value of is not critical.

Finally, the clusters are ranked according to the highest df idft score of the n-grams contained in the cluster as shown in Fig. 1. This ranking criterion is based on the assumption that each cluster score should be associated with the score of the most representative n-gram in the cluster, as the cluster is mainly composed of posts containing it.

Initial experiments (not described here) revealed that, considered independently, the use of df -idft, n-grams and named entity boosting, each improved the topic recall results, with the best results when all three are used.

8

Fig. 1. Index organization where each n-gram keeps the references to tweets where it is contained. Every cluster is composed of different n-grams and its score is computed as the maximum df -idft value of them.

The main contribution of this approach is the use of the temporal dimension of data to detect emerging stories. There are other similar approaches on term weighting considering the temporal dimension of data, but most of them suffer from several shortcomings. For instance, Shamma et al. [25] present two methods of finding "peaky topics". They find the peak terms for a time slot compared to the rest of the corpus, whereas we compare each slot to the immediately previous time slots. If some topic is discussed at several different times, their approach could miss this since the defining words would be highly frequent in the whole corpus. In addition, their approach only uses unigrams (i.e. single words) that often seem to be too limited to identify stories. Lastly, their use of the whole corpus favours batch-mode processing and is less suitable for real-time analysis.

IV. EXPERIMENTAL RESULTS

To test the performance of topic detection method proposed in Section III, we tested them on three Twitter datasets focused on three popular real-world events. We first present the datasets and describe the process of creating the ground truth. Then, we present the performance of methods, comparing between different algorithm implementations.

A. Evaluation methodology

The evaluation framework consists of four steps: Data collection. We extracted Twitter data from the public streaming API of Twitter3 for three major events in 2012: the FA Cup final, the climax to the English domestic football season, the "Super Tuesday" (ST) primaries, part of the presidential nomination race of the US Republican Party, and the US Elections that took place in November 2012. Tweets related to those events where collected using a set of filter keywords and hashtags chosen by experts. We partitioned the datasets in time slots, taking into account the average volume of tweets and the nature of the target events, specifically one

3

hour for the ST, one minute for the FA Cup and ten minutes for the US Elections collection. Extraction of ground truth. Clearly, there is a large number of topics hidden in the collections. However the sheer volume of the datasets implies that an overwhelming amount of effort would be required to manually extract the topics. Instead, we relied on mainstream media reports to identify significant topics and focus on a subset of the actual set of topics. We reviewed the published media report accounts of the events and chose a set of stories that were significant, time-specific, and well-represented on news media in order to build a topic ground truth. For each topic, the ground truth consists of a set of keywords and a concise headline describing the story. We assign a ground truth topic to one of the time slots based on the time in which that topic emerged in mainstream news. Some ground truth topic examples are shown in Table I. Topic detection. We ran the topic detection algorithm on each timeslot for which at least one topic is contained in the ground truth. In total, we had 13 one-minute slots with at least one topic in FA Cup, eight one-hour slots for ST and twenty-six ten-minute slots for the US elections. We only consider data from those slots as input to the methods. Comparison of topic detection output with ground truth. The automatically detected topics (i.e., lists of keywords) were compared to the ground truth using three metrics:

? Topic recall: Percentage of ground truth topics successfully detected by a method. A topic was considered successfully detected in case the automatically produced set of keywords contained all mandatory keywords for it. To address the problem of spelling variations, we considered that a detected term matched a ground truth keyword when their Levenshtein similarity was 0.8.

? Keyword precision: Percentage of correctly detected keywords (as described above) out of the total number of keywords for the topics matched to some ground-truth topic in the time slot under consideration. The total precision of a method is computed by micro-averaging the individual precision scores over all time slots.

? Keyword recall: Percentage of correctly detected keywords over the total number of keywords of the ground truth topics that have been matched to some candidate topic in the time slot under consideration. The total recall is similarly computed by micro-averaging.

These scores were computed at the top n topics produced by the topic detection algorithms, for a range of values of n. They were automatically computed by an evaluation script, but to ensure the reliability of the results, we conducted several rounds of manual evaluation of results and confirmed their agreement with the automatically produced ones.

Note that we did not include topic precision as an evaluation measure. The reason is that to measure topic precision, we would need to compare the topics that our algorithms detect with the set of every newsworthy topic that took place at that particular time. A missing cat and a national election may both be newsworthy in their own way, and people certainly send tweets about both, but there is no practical way to create a definitive list of all such events. Instead, we only have a subset of the topics that occurred in each time slot, so we

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download