Death and Change Tracking: Wikipedia Edit Bursts

Death and Change Tracking: Wikipedia Edit Bursts

Miles Lincoln

ABSTRACT

Bursts have been used in prior informetrics studies to predict the emergence of new fields and trends in research by looking at the occurrence of common terms throughout a body of knowledge, such as the full text of a journal. If a sudden flurry of activity in a field can be observed, it may be possible to quantify and determine to what effect this new area will take off. Having identified a new variety of burst observed in the activity on a Wikipedia page following a newsworthy event, this study took an interest in what could be predicted by visualizing these graphs, as they represent a granular, instantly updated resource that can be observed and analyzed. Additionally, beyond prediction, can the analysis of this data correlate to other aspects of the subject of the Wikipedia page? For example, does the popularity of an actor correlate to the size of the edit burst on their Wikipedia page following their death? This study develops a workflow and appends scripts capable of processing data from Wikipedia that can be applied to any sampling of pages to visualize bursts in activity from which one could look for correlations and make their own predictions. For this experiment, the chosen sample created some constraints, and although some correlation was observed between an actor's financial success and the number of edits to their Wikipedia page made in the burst following their death, there were too few instances where this was the case and too much discrepancy between individual actors to come to a concrete conclusion.

1. INTRODUCTION

As more and more knowledge enters the digital realm becoming a machine-readable unit of information, the ability to analyze knowledge quantifiably expands. Every day, we see trending topics on Twitter, popular queries on our favorite search engine and countless other instances of recurring phrases linking together to gain momentum. These bursts, or recurring instances of similar content (like a trending topic) or activity (like thousands of people Liking a Facebook post), are everywhere.

In the past decade, Wikipedia has grown to be a reliable source for information on an endless variety of subjects. From its roots as a niche website with the reputation of dubious or unsubstantiated knowledge, Wikipedia has evolved into a constantly updated information resource [1] with a dedicated group of volunteer curators with rigid standards for ensuring quality knowledge. In an attempt to visualize some of the activity on a Wikipedia page, this study focuses on what it identifies as edit bursts, or spikes represented by a flurry of edits occurring within a short period of time (which, when graphed, display a spike).

Burst, or spike, beginning at x=3

These bursts are typically associated with some event related to the subject of the page, such as a newsworthy event. The news reaches the general public, then one editor first updates the page to reflect this news [2], followed by a number of other editors with the intention to either improve the quality of the first edit, tweak the wording or content in some way, or revert the change altogether. This study predicts that the length and volume of this cycle of editing, re-editing and undoing is closely related to the popularity of the subject of the page. In theory, this is how Wikipedia should work, however it is worthwhile to note that an edit spike may be caused by something unrelated to the content of the page, and instead having to do with the editor themselves who may be spurred to edit a page for no other reason than to improve the quality, or maybe even for devious reasons, such as wikivandalism, in which edits are made to degrade the quality of the knowledge, sometimes surreptitiously [3]. Such acts of vandalism, and reactionary edits and undoings, could also create an edit spike. Wikipedia allows anyone to edit almost any page (Wikipedia has instituted processes to screen edits applied to living people [4]), and it also provides snapshots of every edit ever made to a page, along with who made the edit and when. Given the quantifiable nature of dates, it is the `when' that this paper is most interested in and capable of visualizing. This particular study will be examining the activity on the Wikipedia pages of a range of actors surrounding their time of death.

Bursts have previously been used in informetrics to predict trends in research and emerging fields. Guo et al. found that word bursts frequently preceded an emerging area, which, in turn, attracts new authors [5], making these bursts useful indicators of future success. Because the sample in this study is not large enough and does not stretch over a long enough term, it is difficult to use it for purposes of prediction, or model the Wikipedia data after the research in Guo's paper, but still there are similarities worth considering. Setting itself apart further, the Wikipedia data is contained, unlike the studies of trend bursts and emerging fields, which span the full text of a body of scientific literature, each wiki page in this study is relevant only to the actor it describes. Also, the spikes are brief and difficult to extend into a method for prediction. Because our contained data set is relevant only to a single actor now deceased, the value of a prediction may be low.

It would appear that the dynamic time in an actor's life story is passed, but regardless of what the future holds, we have a series of snapshots immediately surrounding the time of death. Although no further news is expected of a dead celebrity, perhaps previously unrecognized relationships can be discovered in these community reactions to celebrity mortality.

As a sample, a group of male actors were picked who had passed away in the first decade of Wikipedia's existence (2001-2009), roughly one burst/actor death per calendar year. The study chose actors by browsing the `Deaths in (year)' pages on Wikipedia and selecting the most recognizable name in acting. The nine actors chosen are listed in a table below, along with their date of death. While macabre, this sampling provided a set of Wikipedia pages with (almost) guaranteed bursts around a predictable date attached to a controlled event (death of actor). Though efforts were made to select the most popular (a subjective quality) actor each year, this was often difficult, and some years have a notable lack of star power. By picking based on fame, the data admitted a diverse group of actors who have little in common with regard to age, cause of death, and other factors which would affect the newsworthiness of their passing, and the likelihood that a large number of tech-savvy Wikipedia editors would jump into action. By pulling in one actor who should exhibit a spike in edit activity for each year that Wikipedia has existed, we will also observe how the edit spike evolves over the long-term lifespan of Wikipedia.

Actor

Date of death

Age at death

Jack Lemmon

6/27/2001

76

James Coburn

11/18/2002

74

Gregory Peck

6/12/2003

87

Marlon Brando

7/1/2004

80

Ossie Davis

2/4/2005

87

Jack Palance

11/10/2006

87

Robert Goulet

10/30/2007

73

Heath Ledger

1/22/2008

28

Patrick Swayze

9/14/2009

57

Actors used in study

2. METHODS

This study employed a Perl (a programming language capable of parsing text using regular expressions, such as dates) script to scrape Wikipedia page revision histories for the dates of all edits. Wikipedia only shows these 500 at a time, so it was necessary to copy/paste these 500 at a time into a text document that could then be fed to the script. The script produced a list of sequential dates in DD/MM/YYYY format with one entry for each edit made per day for a particular page.

Sample output of Perl script With the data in this new format, it was inserted into a pivot table in Excel which looked for dates occurring multiple times, to see which days had a high frequency of edits, or a burst. These data were then graphed using several different parameters in Excel. This process was repeated for the nine different sample Wikipedia pages exhibiting edit bursts and combined into a single spreadsheet. Unifying all my data made it simple to chart different comparisons and compare any facet on a whim.

3. RESULTS

The first thing to look at was the evolution of a burst over time. How did an edit burst change from 2002 to 2009? This was first done by aligning all of the bursts on the same graph and comparing their activity on the same timeline.

Deaths 2002-09 aligned on x=10 on a log scale Here, we can see the more recent deaths at the upper end/exhibiting the largest edit bursts. Also interesting is the activity that continues after the initial burst in many of the actors studied, and the lack of activity that appears to fall around x=30. Note that Jack Lemmon's death occurred prior to Wikipedia's current editing numeration scheme, and his page had such a small amount of activity to be inconsequential, so he has been left off of some graphs in order to keep clutter to a minimum. In order to get a different view, the growth was compared by observing the trend in number of max edits per day over the years.

Change in initial (day of death) burst size 2002-09

Both of these approaches produced interesting graphs. As you can see in the second graph, there is a trend toward an increasing number of edits, but not at a steady rate. Looking at this graph, it is also possible to explain the areas where the graph does not increase by the unfamiliarity of the actor's name. When all of the data is combined, it becomes apparent that tracking data over the entire lifespan of Wikipedia is going to show more variation than desired. For example, it is near impossible to compare Jack Lemmon's 2001 death with a spike of 1 edit with Heath Ledger's spike of 352 edits almost a decade later. A larger sampling of similar subjects in a shorter time period would be preferable.

Focusing on what we can work with, when looking specifically at the two most recent edit bursts (Ledger and Swayze), if one adjusts them to take into consideration one metric for judging an actor's popularity, the spikes get close, but not quite identical. Using the average box office gross of each actor ($77,639,600 for Ledger, $49,751,800 for Swayze) adjusted for price inflation [6,7], we see the difference in size of edit burst shrink significantly. Ledger's edits are untouched, Swayze's edits have been multiplied by a factor of 1.56 ($49m * 1.56 $77m).

Brad Pitt's edit frequency in red, trendline is the moving average with a period of 200

Pitt is one of the best-known actors, but his largest edit burst is just barely over 50--only 1/7th the size of Heath Ledger's.

Moving average with period=200 blue: Pitt, red: Ledger graphed on a log scale

Here is the living actor (Pitt in red) compared to Ledger (blue) on the same graph (log scale). Again we see how this popular actor does not experience turbulent edit spikes in their everyday (celebrity) life. Another of the (undesired) variables inherent in our data is the cause of death/type of death. Our two clearest examples of edit bursts (Ledger and Swayze) have very different circumstances surrounding their deaths. Ledger's death was unexpected, while Swayze's health was known to be in decline.

Comparison of Heath Ledger and Patrick Swayze (adjusted)

A larger sampling with more spikes occurring in the past two years would be helpful in confirming a correlation between the size and characteristics of an edit spike. Also required is a metric for comparing whatever this new sample is. In this study, the sample is constrained to actors who are recently deceased. Perhaps a future study could compare less decisive events, such as the debut of a film.

Finally, having looked at the revision history activity of so many actors who have died, it is worth looking at an actor still working to get a basis for comparison.

Ledger: blue, Swayze: red

Each actor shows a large initial spike near the day of their death, but Ledger also shows a series of spikes following, as more details became known to the public. The public knew most circumstances

of Swayze's passing at the beginning, so edit activity is minimal after the initial spike; there are no aftershocks.

4. DISCUSSION

While requiring some manual input from the user, this study establishes a workflow and provides all necessary resources needed for it to be repeated across any number of data sets. In order to move forward, one needs only a sample set and a quality metric that can be applied to the entire data set in order to look for a correlation that will level the playing field.

Because only Ledger and Swayze had noteworthy acting careers recent enough to be featured on a website listing their inflationadjusted average gross, they were the only two eligible to be correlated to their popularity in a measurable way within the scope of this study. Because the sample size consists of two, it would not be wise to claim established correlation between the volume of Wikipedia response to your death and your popularity as measured by the average gross of your film, but this study establishes the groundwork to easily analyze similar relations.

Furthermore, this study has also circumstantially displayed instances wherein the spike surrounding an actor's death is far greater than any spikes surrounding an actor's life. We have also visualized some of the different types of reaction spikes, which could tell the viewer something about how the news reached the public (all at once, or in waves).

[4] Cohen, Noam. 2009. Wikipedia to Limit Changes to Articles on People. The New York Times.

[5] Guo, Hanning, Scott Weingart, Katy B?rner. 2011. Mixed-indicators model for identifying emerging research areas. Scientometrics. 89 (1), 421-435.

[6] Box Office Mojo. 2011. Heath Ledger Movie Box Office Results. . htm

[7] Box Office Mojo. 2011. Patrick Swayze Movie Box Office Results. ze.htm

5. FUTURE STUDY

When using a sample of resources about humans created by other humans, there are an infinite number of variables to control. Future research might refine this study to focus on controllable aspects in order to have an experiment that is not vulnerable to the many differences that make the reaction to Heath Ledger's unexpected death very different from the reaction to an actor's passing of old age, which is in part different than the reaction that plays out as a celebrity's ailing health declines in the public eye. Future studies may want to seek a sampling with more in common, such as bursts that occur in the same year, so that the user base of Wikipedia at the time, and other temporal aspects are not a variable.

This study looked at edits made only in a quantitative light. The quality of edits made was out of the scope of this study. Future study may make attempts to classify different types of edits, maybe even automatically, as Wikipedia categorizes some types of edits, such as "minor edits." This way, it would be possible to generate a more dimensional picture of the edits occurring.

6. REFERENCES

[1] Mackey, Robert. 2009. Wikipedia's Rapid Reaction to Outburst during Obama Speech. The New York Times.

[2] Onion, The. 2008. Area Man Honored To Be The One Who Added Death Date To Heath Ledger's Wikipedia Page.

[3] Wikipedia. 2011. Vandalism on Wikipedia.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download