Abstract

arXiv:1909.09143v1 [cs.LG] 18 Sep 2019

Leveraging User Engagement Signals For Entity

Labeling in a Virtual Assistant

Deepak Muralidharan?, Justine Kao?, Xiao Yang?, Lin Li,

Lavanya Viswanathan, Mubarak Seyed Ibrahim, Kevin Luikens,

Stephen Pulman, Ashish Garg, Atish Kothari, Jason Williams

Apple Inc., One Apple Park Way, Cupertino, CA 95014

Abstract

Personal assistant AI systems such as Siri, Cortana, and Alexa have become

widely used as a means to accomplish tasks through natural language commands.

However, components in these systems generally rely on supervised machine

learning algorithms that require large amounts of hand-annotated training data,

which is expensive and time-consuming to collect. The ability to incorporate

unsupervised, weakly supervised, or distantly supervised data holds significant

promise in overcoming this bottleneck. In this paper, we describe a framework

that leverages user engagement signals (user behaviors that demonstrate a positive

or negative response to content) to automatically create granular entity labels for

training data augmentation. Strategies such as multi-task learning and validation

using an external knowledge base are employed to incorporate the engagementannotated data and to boost the models accuracy on a sequence labeling task. Our

results show that learning from data automatically labeled by user engagement

signals achieves significant accuracy gains in a production deep learning system,

when measured on both the sequence labeling task as well as on user-facing results

produced by the system end-to-end. We believe this is the first use of user engagement signals to help generate training data for a sequence labeling task on a large

scale, and can be applied in practical settings to speed up new feature deployment

when little human-annotated data is available.

1 Introduction

Accomplishing a voice controlled task using a virtual assistant agent such as Siri, Cortana, or Alexa

usually involves several steps. First, a speech recognition module converts audio signals into text.

Next, a natural language understanding (NLU) component extracts the users intent from the transcribed text. This step usually involves determining what action the agent should perform for the

user, along with entities involved in that action (e.g., the action could be play, and the entity could

be a song with title Shake It Off). NLU in the context of conversational AI is particularly challenging for several reasons. First, speech recognition errors, as well as heterogeneous and informal

styles of language use, often introduce noise to the user input and make understanding difficult.

Secondly, many requests issued to digital assistants are brief and ambiguous, requiring an external

knowledge source in order to select the most likely interpretation. For example, a user query Play

play that song train is difficult to comprehend because the sentence can be interpreted in several

ways, especially if we consider the possibility of speech recognition errors and noisy user input. Is

train the title of a song? Is play that song the song title and train the artist name2 ? In order to

play the correct song, the NLU component needs to correctly identify the entities and entity types in

the request despite potential ambiguity, which is remarkably challenging.

?

2

Equal contributions. Alphabetically ordered.

Ground truth: Play that song" is the name of a song by the band Train".

In our work, we treat this parsing task as a sequence labeling problem performed by a bidirectional

LSTM (BiLSTM) model (Graves et al., 2005) (Section 4). This type of deep neural network based

model requires a large amount of training data to perform at high accuracy. As with many traditional

machine learning problems, the granularity of the label space impacts the ease of the learning task

as well as the cost of acquiring annotated labels. Coarse-grained labels are easier to obtain, e.g.,

via human annotation, and facilitate efficient model training. On the other hand, fine-grained labels

are often more useful for downstream components in the AI system to consume in order to produce

desired outcomes for the user. In the context of music entity labeling, an example of a coarsegrained label is musicEntity, which is a collection of finer-granular music-related entities such as

musicArtist, musicAlbum, and musicTitle. In a coarse-grained label space, given the request Play

the the kingdom of rain,3 the entire span of the the kingdom of rain will be labeled as one single

musicEntity. In contrast, in a fine-grained label space, the the should be labeled as musicArtist

and kingdom of rain as musicTitle. It is apparent that fine-grained labels contain more detailed

information about the true user intent and are more valuable for the downstream components to take

the accurate action. However, correctly identifying the fine-grained entities in a users request is

time-consuming, costly, and often challenging even for human annotators (e.g., requiring annotators

to recognize idiosyncratic names of music artists), which leads to insufficient hand-labeled training

data for model training4.

Our contribution in this paper is to describe a framework that leverages naturally occurring user

behaviors to automatically annotate user requests with fine-grained entity labels. We use empirically validated heuristics to select user behaviors that indicate positive or negative engagement with

content. These behaviors include tapping on content to engage with it further (positive response),

listening to a song for a long duration (positive response), or interrupting content provided by the

assistant and manually selecting different content (negative response). These user behaviors, which

we refer to as user engagement signals, provide strong indications of a users true intent. We selectively harvested these signals in a privacy-preserving manner to automatically produce ground truth

annotations. Our solution only needs human annotators to provide coarse-grained labels, which are

much simpler and faster to obtain with higher fidelity compared to a finer-grained labeling process.

These simpler coarse-grained labels are then further refined using user engagement signals, as explained in the following sections. Our framework is of particularly great value in scenarios where

the conversational AI system extends to new domains or features, and corresponding training data

need to be collected quickly and reliably for bootstrapping. Moreover, as will be illustrated shortly,

user engagement signals can help us to identify where the digital assistant needs improvement by

learning from its own mistakes. Our approach significantly increases the volume and quality of our

training data without adding much annotation cost, nor jeopardizing user privacy or user experience.

In order to incorporate both coarse-grained labels (by human annotators) and fine-grained labels

(inferred by our framework), we designed and deployed a multi-task learning framework in our

production environment, which treats coarse-grained and fine-grained entity labeling as two tasks.

We also incorporated an external knowledge base consisting of entities and their relations to validate

the models predictions and ensure high precision. We show that our data generation framework

coupled with these modeling and validation strategies leads to significant accuracy improvements

for both the coarse-grained and fine-grained labeling tasks. More importantly, we demonstrate that

our framework yields significantly better user experience in a real-world production system.

2 Related Work

The use of unsupervised or weakly supervised data to improve performance in entity-labeling tasks

has a long history. A well-established strategy is to start with some seed examples and then

use contextual features and co-training to identify and refine new examples (Collins and Singer,

1999; Gupta and Manning, 2014), building up a corpus that can then be used to train a model. In

Gupta and Manning (2015), the authors show that distributed representations can further improve

performance of such systems, and in Nagesh and Surdeanu (2018) this and two related approaches

are compared and found to outperform methods that do not use distributed representations.

3

Ground truth: Kingdom of Rain is the name of a song by the post-punk band The The.

An annotator may not know that The The is the name of a band and may provide incorrect fine-grained

labels. As a result, it is often preferable for human annotators to annotate in a coarse-grained label space.

4

2

In recent work, Yang and Mitchell (2017) describe an LSTM based architecture that uses external

resources like WordNet and a knowledge base of triples (entity1, relation, entity2) to carry out entity

labeling in two stages: first identifying chunks and second labeling them. By representing external

concepts via embeddings and training an attention mechanism, the system is able to leverage these

concepts: the attention mechanism serves partly to weight the appropriate sense of an ambiguous

term, correctly distinguishing between (for example) Clinton as person or as a location depending

on the context. Our use of a knowledge base is simpler than this, essentially acting as an existence

check to re-rank alternatives produced by the model.

Improving performance of dialog systems by using information about user engagement and task

completion is a standard technique for systems that use reinforcement learning to acquire or improve

a dialog policy: for a review see (Young et al., 2013), and for some recent developments (Gasic et al.,

2017). However, to our knowledge, our work is the first to use inferences about task completion to

derive training data for sequence labeling rather than policy learning.

3 Generating Weakly Supervised Data

In this section, we describe user engagement signals as well as how we use them to generate finegrained entity annotations. In the rest of the paper, we will use queries expressing a play music

intent as the example use case to illustrate our method 5 . Our proposed methods can be extended

straightforwardly to other domains where user engagement signals are available.

3.1 User Engagement Signals

User engagement signals refer to user behaviors that indicate whether the user feels positive or

negative about the agents chosen action, without the agent asking for explicit feedback. In our

scenario of the play music intent, a positive signal is defined as the user listening to the song initiated

by the agent for more than a threshold amount of time. We determined the threshold to be 30 seconds

by asking annotators to grade the success of a request and correlating the grades with how long a

song was played (the vast majority of songs played > 30 seconds were graded successful). A

negative signal is defined as the user aborting the song and switching to a different one, or the user

playing a desired song by searching for it manually after the agent claims it could not find the song6 .

3.2 Engagement-Annotated Fine-Grained Data

We first deploy a model based on human-labeled data in a coarse-grained label space. This model

infers a users intent and passes it to the Action component. For example, given the request Play

play that song train, suppose the model predicts play that song train as musicEntity, which is

a coarse-grained label. Using our model, we can obtain fine-grained labeled data in the following

scenarios. The first scenario is that the downstream component makes a correct decision and plays

the song Play that song by the artist Train. If we receive a positive engagement signal from the

user (i.e., this song was played for a certain amount of time), we can retrieve detailed metadata of

the played song including the title, album, and artist. In this case, the title is Play that song and

the artist is Train. We then map this fine-grained information back to the utterance to regenerate

fine-grained entity labels for each token. This results in a high quality training example that is

automatically labeled with fine-grained entity types, where play that song maps to the musicTitle

type, and train maps to the musicArtist type, in contrast to one single musicEntity coarse label.

The second scenario is that the downstream component makes a wrong decision and returns undesired results, e.g., a song that the user does not want or misinterpreting the request to be for a

movie instead. From our analysis, users will often immediately stop the incorrectly chosen content

and manually search the intended song and then play it, or interrupt the content with a query that

paraphrases the original query. This is a strong indicator that the NLU and downstream components

5

Note that in our setup, the NLU component contains a module that classifies requests into domains such

as music. Although user engagement signals can be used to improve the domain chooser, this work focuses on

improving the entity labeling component that follows.

6

Although there are cases where a user changes her mind and aborts a correctly selected song, we find that

the majority of cases where a user switches to a related song are genuinely unsuccessful cases.

3

default default default default musicEntity

song Train.

Assistant: Playing Hey, soul sister by Train.

Example 2

Example 1

Users

assistantthat

action

User: 1st request

Play andplay

User stopped

song immediately Original utterance:

Task completed ? Play play that song train.

Fuzzy-matching

Users search action

1. User navigates to music app

2. Searches for song play that song by Train

Users 1st request and assistant action

default default default default musicEntity

User:

Play play

that song Train.

Assistant: Playing Hey, soul sister by Train.

User listens to

Music app played:

song for 30s

Task completed ? Song title: Play that song

Engagement-Annotated Fine-Grained

Play

play

that

song

train

default musicTitle musicTitle musicTitle musicArtist

Artist: Train

User stopped

song immediately Original utterance:

Task completed ? Play play that song train.

Fuzzy-matching

Users 2nd request and assistant action

User listens to

default default default default musicEntity musicEntity musicEntity default musicEntity song for 30s

User: Play the

song called play

that

song

by

train.

Task completed ?

Assistant: Playing Play that song by Train.

Engagement-Annotated Fine-Grained

Play

play

that

song

train

default musicTitle musicTitle musicTitle musicArtist

Assistant action:

Song title: Play that song

Artist: Train

Figure 1: Examples of generating engagement-annotated fine-grained data.

failed to fulfill the users intent, and the song manually played by the user (or played by the system following the paraphrase) is actually the desired one. Our model then utilizes metadata of the

ultimately played song to gather the correct fine-grained entity labels.

It is worth noting that the metadata of the song is standardized and contains properly spelled entity

names, whereas the original utterance may be noisy and informal. In order to map the finer level

music information back to the original utterance, we employ an edit-distance based fuzzy matching

algorithm to perform this mapping. The matched tokens are labeled as the identified entities if the

fuzzy matching confidence score is above a threshold7 and the remaining tokens will be labeled

as default" (i.e., meaning the token does not reference an entity). The fuzzy matching algorithm

can tolerate spelling errors, missing or redundant tokens, and ordering problems, which frequently

occur in conversational AI systems (e.g. matching your beautiful to youre beautiful, and this

is you came for to this is what you came for). Figure 1 shows two examples of using the fuzzy

matching algorithm to annotate an utterance that was originally predicted incorrectly. The error is

then corrected by mapping the song title and artist name to the original utterance.

In summary, we describe two scenarios that provide us with valuable fine-grained entity labels:

(1) queries with strong positive user engagement signals, and (2) queries with strong negative user

engagement signals followed by the users corrective action. Both cases will be leveraged by our

model and framework to retrieve weakly-supervised and finer-granular ground-truth entity labels for

the original user utterance. We refer to this fine-grained dataset enriched by user engagement signals

as the engagement-annotated data. Since the engagement-annotated data and human-annotated data

were labeled from different label spaces (fine-grained v.s. coarse-grained, respectively), it is not

straightforward to incorporate these two training data sources together8. In the following section,

we introduce a multi-task learning approach that leverages both datasets jointly to improve entity

recognition for both the coarse-grained and fine-grained labeling tasks.

4 Multi-task learning

We design a multi-task learning framework to better utilize engagement-annotated data (with finergranular entity labels) and human-annotated data (with coarse-granular entity labels). Note that the

same training example is not initially required to have both coarse-grained and fine-grained labels.

As shown in Figure 2 , the multi-task learning model utilizes a deep neural network architecture

based on bidirectional LSTMs (BiLSTM) (Graves et al., 2005). For every query, we first generate a

vector containing a list of customized features representing domain and context information. These

features are pre-trained in an embedding layer for dimension reduction, such that each token in the

utterance is represented by a word vector. The word embeddings are generated using word2vec

(Mikolov et al., 2013) and are trained on data sampled from our production usage. Both the reduced

7

We collected human judgments of similarity and found that strings with fuzzy matching confidence scores

over 0.8 tend to be rated as highly similar by humans. As a result, we used 0.8 as the fuzzy matching threshold.

8

We believe this is a realistic challenge in many scenarios: since fine-grained entity labeling is a more

difficult and time-consuming task, it is easier to obtain high-quality human-annotated data with coarse-grained

entity labels, whereas weak supervision may provide fine-grained (but potentially noisy) labels.

4

default

Song

Title

default

Music

Artist

default

Music

Entity

default

Music

Entity

C1

C2

C3

C4

R1

R2

R3

R4

L1

L2

L3

L4

Play

One

by

Metallica

}

}

Output Layer (Fine-Grained Entity)

Output Layer (Coarse-Grained Entity)

}

Bi-directional LSTM

}

Word Embedding

Figure 2: Main architecture of the multi-task learning network. Word and context feature embedding

are given to a bidirectional LSTM, where Li represents the word i and its left context, Ri represents

the word i and its right context. Concatenating these two vectors yields a representation of the word

i in its context Ci , which is fed into two independent output layers - one for coarse-grained entity

typing task and the other for fine-grained entity typing task.

feature vector and the token word embeddings are passed to the BiLSTM as inputs for training. The

outputs from the forward and backward pass of the first BiLSTM layer are concatenated to form the

input for the second BiLSTM layer. This is followed by a linear projection layer and two softmax

layers: one for predicting coarse-grained entity type labels and another for predicting fine-grained

entity type labels. The loss function is defined as

L(w) = 1{dDCG}

X

yi log pi + 1{dDFG }

X

zj log qj + kwk2

(1)

j

i

where d denotes sampled mini-batch; DCG denotes human-annotated coarse-grained data; DFG

denotes engagement-annotated fine-grained data; w refers to network weights; denotes L2 regularization parameter; y, p refer to ground truth and predicted class for coarse-grained entity typing

task; z, q refer to ground truth and predicted class for fine-grained entity typing task.

For every iteration during training, we select a mini-batch (d) from one of the data sources based

on a pre-defined sampling weight assigned to each source. If the mini-batch belongs to the humanannotated data (DCG ) which follows a coarse-grained entity label space, we perform a forward and

backward pass through the input projection layers, LSTM network and the coarse-grained entity

typing softmax output layer. If the mini-batch belongs to the engagement-annotated data (DFG ), we

perform a forward and backward pass through the input projection layers, LSTM network and the

fine-grained entity typing softmax output layer. Note that the lower level LSTM network is shared

between both the tasks, and its weights are updated during every iteration. However, the weights

of the coarse-grained entity typing and the fine-grained entity typing output layers are updated only

when the mini-batch is sampled from the respective data source. This multi-task framework effectively increases the training data size for LSTM layers and facilitates better feature representation to

improve entity typing accuracy.

5 Knowledge Base Validation

We can further improve our fine-grained entity labeling by utilizing an external knowledge base.

For example, given the query Play something by the Beatles, we label something as musicTitle

partially because it exists as a song by The Beatles in a music knowledge base. If the user had

said Play something by Taylor Swift, since artist Taylor Swift has no song called something,

the system should interpret the utterance to mean play any song by the artist Taylor Swift instead.

Therefore, an authoritative knowledge base containing relational information about music entities

provides an efficient and robust way to validate our model.

During inference, after the model predicts the fine-grained entity label distribution for the sequence,

we perform a beam search over the prediction lattice and select the top five alternatives based on the

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download