Detecting Argumentative Discourse Acts with Linguistic ...

Detecting Argumentative Discourse Acts with Linguistic Alignment

Timothy Niven and Hung-Yu Kao

Intelligent Knowledge Management Lab Department of Computer Science and Information Engineering

National Cheng Kung University Tainan, Taiwan

tim.niven.public@, hykao@mail.ncku.edu.tw

Abstract

We report the results of preliminary investigations into the relationship between linguistic alignment and dialogical argumentation at the level of discourse acts. We annotated a proof of concept dataset with illocutions and transitions at the comment level based on Inference Anchoring Theory. We estimated linguistic alignment across discourse acts and found significant variation. Alignment features calculated at the dyad level are found to be useful for detecting a range of argumentative discourse acts.

A: ...To be able to claim that life expectancy and health are tied to religion you have to rule out hundreds of other factors: diet; lifestyle; racial characteristics; genetic pre-disposition (religion tends to run in families) etc...

B: ...Can I just have ANY religion and have a longer life?

Figure 1: An example dyad from our dataset. Without disambiguating information it is hard to know if B's reply is pure or assertive questioning.

1 Introduction

Argumentation mining remains a difficult problem for machines. Even for humans, understanding the substance of an argument can involve complex pragmatic interpretation (Cohen, 1987). Consider the reply of B in Figure 1. Absent broader conversational context, and perhaps knowledge of the background beliefs of B, it can be difficult to judge whether they are asking "which religions are correlated with increased life expectancy?" (pure questioning) or giving their opinion that "not just any religion is correlated with a longer life" (assertive questioning). Since only the latter is an argumentative discourse unit (ADU) (Stede, 2013), ambiguities like this therefore make it difficult to accurately identify the structure of argumentation.

In this work we investigate using a subtle yet robust signal to resolve such ambiguity: linguistic alignment. Alignment can be calculated in an unsupervised manner and does not require textual understanding. It is therefore well suited to our current technology as an extra pragmatic feature to assist dialogical argumentation mining. Our hypothesis is that, since alignment has been shown to relate to communication strategies (Doyle and Frank, 2016), different alignment effects will be

Figure 2: Posterior densities on alignment estimates for pure and assertive questioning in our dataset, indicating that alignment can help to disambiguate discourse acts.

observed over different argumentative discourse acts, providing signal for their detection. For example, Figure 2 shows our estimated posterior densities for alignment scores over pure and assertive questioning. On this basis, if B's comment in Figure 1 is accompanied by a significantly positive alignment score, we would be correct more often than not classifying it as assertive questioning.

In this preliminary work we aim to address the following questions:

1. Are the majority of argumentative discourse acts associated with significantly different alignment effects?

2. Are alignment features useful for detecting argumentative discourse acts?

104

Proceedings of the 6th Workshop on Argument Mining, pages 104?112 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics

2 Background and Related Work

Linguistic alignment is a form of communication accommodation (Giles et al., 1991) whereby speakers adapt their word choice to match their interlocutor (Niederhoffer and Pennebaker, 2002). It can be calculated as an increase in the probability of using a word category having just heard it, relative to a baseline usage rate. An example is given in Figure 3. Note that alignment is calculated over non-content word categories.1 While content words are clearly set by the topic of conversation, the usage rates of particular noncontent word categories has shown to be a robust measure of linguistic style (Pennebaker and King, 2000). Consistent with previous work, we focus on alignment over the Linguistic Inquiry and Word Count (LIWC) categories (Pennebaker et al., 2015), listed in Table 1.

Linguistic alignment is a robust phenomenon found in a variety of settings. It has been used to predict employment outcomes (Srivastava et al., 2018), romantic matches (Ireland et al., 2011), and performance at cooperative tasks (Fusaroli et al., 2012; Kacewicz et al., 2014). People have been found to align to power (Willemyns et al., 1997; Gnisci, 2005; Danescu-NiculescuMizil et al., 2011), to people they like (Bilous and Krauss, 1988; Natale, 1975), to in-group members (Shin and Doyle, 2018), and to people more central in social networks (Noble and Fernandez, 2015). The variety of these contexts suggest alignment is ubiquitous and modulated by a complex range of factors.

Some previous work bears on argumentation. Binarized alignment features indicating the presence of words from LIWC categories were found to improve the detection of disagreement in online comments (Rosenthal and McKeown, 2015). We utilize more robust calculation methods that account for baseline usage rates which thereby avoid mistaking similarity for alignment (Doyle et al., 2016). Accommodation of body movements was found to decrease in face-to-face argumentative conflict where interlocutors had fundamentally differing opinions (Paxton and Dale, 2013; Duran and Fusaroli, 2017). In contrast we are concerned with linguistic forms of alignment.

1Previous work has indicated the primacy of word-based over category-based alignment (Doyle and Frank, 2016). We leave investigation of alignment over words in argumentation to future work.

A's message has pronoun no pronoun

B's reply

has pronoun no pronoun

8

2

5

5

Figure 3: Example of linguistic alignment using a binarized "by-message" calculation technique (Doyle and Frank, 2016). B's baseline usage rate of pronouns is 0.5, coming from the bottom row. The top row shows the probability of B using a pronoun increases to 0.8 after seeing one in A's message.

Category Article Certainty Conjunction Discrepancy Negation Preposition Pronoun Quantifier Tentative Insight Causation

Examples a, the

always, never but, and, though should, would

not, never to, in, by, from

it, you few, many maybe, perhaps think, know, consider because, effect, hence

Usage 0.076 0.016 0.060 0.018 0.018 0.137 0.108 0.025 0.030 0.027 0.021

Table 1: LIWC dictionary categories we use, examples, and baseline production rates observed in our dataset of 1.5 million comments on news articles.

We focus on the argumentative discourse acts of Inference Anchoring Theory (IAT) (Budzynska and Reed, 2011; Budzynska et al., 2016). IAT is well motivated theoretically, providing a principled way to relate dialogue to argument structure. As noted above, an utterance that has the surface form of a question may have different functions in an argument - asking for a reason, stating a belief, or both. The IAT framework is designed to make these crucial distinctions, and covers a comprehensive range of argumentative discourse acts.

Two previous datasets are similar to ours. The US 2016 Election Reddit corpus (Visser et al., 2019) comes from our target genre and is reliably annotated with IAT conventions. However, the content is restricted to a single topic. Furthermore, political group effects have already been demonstrated to influence alignment (Shin and Doyle, 2018). These considerations limit our ability to generalize using this dataset alone. The Internet Argument Corpus (Abbott et al., 2016), used in prior work on disagreement (Rosenthal and McKeown, 2015), is much larger than our current dataset, however the annotations do not cover the principled and comprehensive set of discourse acts that we require to support dialogical argumentation mining in general.

105

Figure 4: Annotating discourse acts across a message-reply pair. The blue text spans are Asserting. The red span is Disagreeing, which always crosses the comments - in this case attacking the inference in A. If A was the reply we would annotate the purple span as Arguing, as it offers a reason in support of the preceding assertion. In the reply, Arguing is provided by the green span, which is an instance of Assertive Questioning. Note that we only annotate what is in B. This pair is therefore annotated as: {Asserting, Disagreeing, Assertive Questioning, Arguing}.

3 Dataset

3.2 Annotation

In this section we outline our annotation process. So far we have 800 message-reply pairs but annotated by just a single annotator. In future work we will scale up considerably with multiple annotators, and include Mandarin data for crosslinguistic comparison.

3.1 Source

We scraped 1.5M below the line comments from an academic news website, The Conversation,2 covering all articles from its inception in 2011 to the end of 2017. In order to maximize the generalizabilty of our conclusions we selected comments covering a variety of topics. We also picked as evenly as possible from the continuum of controversiality, as measured by the proportion of deleted comments in each topic. More controversial topics are likely to see higher degrees of polarization, which should affect alignment across groups (Shin and Doyle, 2018). The most controversial topics we included are climate change and immigration. Among the least controversial are agriculture and tax.

Nevertheless this data source has its own peculiarities that attenuate liberal generalization. As the site is well moderated, comments are on topic and abusive comments are deleted, even if they also contain argumentative content. The messages are generally longer and less noisy than, for example, Twitter data. Moreover, many commenters are from research and academia. Therefore in general we see a high quality of writing, and of argumentation.

2

The list of illocutions we chose to annotate are taken from Budzynska et al. (2016): Asserting, Ironic Asserting, (Pure/Assertive/Rhetorical) Questioning, (Pure/Assertive/Rhetorical) Challenging, Conceding, Restating, and NonArgumentative (anything else). The transitions we consider follow IAT conventions. Arguing holds over two units, where a reason is offered as support for some proposition. Disagreeing occurs where an assertion conflicts with another. Agreeing is instantiated by phrases such as "I agree" and "Yeah."

Annotating

Rhetorical

Question-

ing/Challenging is the most difficult. As

noted by Budzynska et al. (2016), there is no

common specification for Rhetorical Questioning.

We follow their definition, by which Pure and

Assertive Questioning/Challenging ask for the

speaker's opinion/evidence, and the Assertive

and Rhetorical types communicate the speakers

own opinion. Therefore the Pure varieties do not

convey the speakers opinion, and the Rhetorical

types do not expect a reply. Annotating Rhetor-

ical Questioning/Challenging therefore requires

a more complicated pragmatic judgment of the

speaker's intention.

Our annotation scheme departs from previous work in that we only annotate at the comment and not the text segment level. Multiple annotations often apply to a single comment. An example is given in Figure 4. The text spans of the identified illocutions are highlighted and the transitions are indicated with arrows for clarity, but note that we did not annotate at that level.

106

Another difference from prior work relates to Concessions. Unlike Budzynska et al. (2016) we do not explicitly annotate the sub-type Popular Concession - where a speaker concedes in order to prepare the ground for disagreement. A potential confound with the annotation scheme described so far is ambiguous cases of Agreeing and Disagreeing in the same comment, which could be expected in a Popular Concession: "Yeah, I agree that X, but [counter-argument]." Because we are annotating at the level of the comment, we are able to distinguish these cases by considering combinations of discourse acts. A Popular Concession is distinguished by the presence of Conceding along with Disagreeing, optionally with Agreeing. A Pure Concession is then distinguished by the presence of Conceding and the absence of Disagreeing. We therefore do not need to rule that only one of Agreeing or Disagreeing can occur in a single comment.

We found that Asserting (627/800), Arguing (463/800), and Disagreeing (402/800) are by far the most common individually, and as a combination (339/800), reflecting the argumentative nature of our dataset. The distribution of comments over discourse acts is Zipfian. The lowest frequency discourse act is Ironic Asserting, which has only 12 annotations in our 800 comments.

4 Methodology

4.1 Alignment over Discourse Acts

To estimate alignment scores across discourse acts we parameterize the message and reply generation process as a hierarchy of normal distributions, following the word-based hierarchical alignment model (WHAM) (Doyle and Frank, 2016). Each message is treated as a bag of words and word category usage is modeled as a binomial draw. WHAM is based on the hierarchical alignment model (HAM) (Doyle et al., 2016), adapted by much other previous work (Doyle and Frank, 2016; Yurovsky et al., 2016; Doyle et al., 2017). WHAM's principal benefit over HAM is controlling for message length, which was shown to be important for accurate alignment calculation (Doyle and Frank, 2016). Our adaptation is shown in Figure 5. For further details of WHAM we refer the reader to the original work.

A key problem we need to address is our inability to aggregate counts over all messages in a conversation between two speakers (as in Figure 3).

Figure 5: Our adaptation of WHAM (Doyle and Frank, 2016) for estimating alignment over argumentative discourse acts.

This is a virtue of the original WHAM model that provides more reliable alignment statistics. We cannot aggregate counts over multiple messagereply pairs since our target is the discourse acts in individual replies. However, we are helped somewhat by the long average comment length in our chosen genre (? = 82.5 words, = 66.5). The lowest baseline category usage rate is approximately 0.8% (? = 3.6%, = 2.2%). Therefore an average comment length gives us enough opportunity to see much of the effects of alignment on the binomial draw, but is likely to systematically underestimate alignment. In future work we will investigate this phenomenon with simulated data, and continue to search for a solution that makes better use of the statistics.

However, we can make more robust estimates of the baseline rate of word category usage by considering our entire dataset ( 1.5 million comments). We have annotations for 261 authors. The most prolific author has 11, 327 comments. On average an author has 429 comments ( = 1, 409). For most authors we find multiple replies to comments that do not have each word category, making these statistics relatively reliable.

107

Figure 6: Alignment estimates over IAT discourse acts and combinations of interest. The error bars represent 95% highest posterior density.

Figure 7: ROC AUC Performance change from bag of GloVe vectors due to adding alignment features.

Bayesian posteriors for discourse act alignments are then estimated using Hamiltonian Monte Carlo, implemented with PyStan (Carpenter et al., 2017). We use 1, 000 iterations of No UTurn Sampling, with 500 warmup iterations, and 3 chains. To address research question (1) we then compare the posterior densities of the last 500 samples from each chain, and look for significant differences in the means.

to avoid infinite values and floor effects - for example where the reply does not contain a word from c. This range is large enough to cover the size of alignment effects we observed. Following this calculation method we end up with an 11dimensional vector of alignments over each LIWC category for each reply.

4.2 Alignment Over Comments

In this preliminary work, we use a simpler method for local alignment at the individual commentreply level that we found effective. We utilize the author baselines calculated for each LIWC category from the entire dataset. Then, for each message and reply, we calculate the local change in logit space from the baseline to the observed usage rate, based on the binary criterion of whether the original message contained a word from the category. Formally, let the LIWC categories used in the first message be Ca. For a LIWC category c, given the baseline logit space probability (c) of the replier, and the observed usage rate r of words from category c in the reply, we calculate the alignment score as

s(c) = logit(r) - (c) c Ca

0

otherwise

We clip these values to be in the range [-5, 5]

4.3 Detecting Argumentative Discourse Acts

To investigate our second preliminary research question we perform logistic regression for each annotated comment and each discourse act. Our baseline is a bag of GloVe vectors (Pennington et al., 2014). We use the 25-dimensional vectors trained on 27 billion tokens from a Twitter corpus. We concatenate the 11-dimensional alignment score vector to the bag of GloVe representation and look for an increase in performance. We randomly split the dataset into 600 training data points, and 200 for testing. We implement logistic regression with Scikit-learn (Pedregosa et al., 2011) and use the LBFGS solver. We set the maximum number of iterations to 10, 000 to allow enough exploration time. Because this is not a deterministic algorithm, we take the mean performance of 20 runs over different random seeds as the final result. As we are concerned with detection, and because the labels in each class are very imbalanced, our evaluation metric is ROC AUC.

108

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download