Abstract
arXiv:1909.09143v1 [cs.LG] 18 Sep 2019
Leveraging User Engagement Signals For Entity
Labeling in a Virtual Assistant
Deepak Muralidharan?, Justine Kao?, Xiao Yang?, Lin Li,
Lavanya Viswanathan, Mubarak Seyed Ibrahim, Kevin Luikens,
Stephen Pulman, Ashish Garg, Atish Kothari, Jason Williams
Apple Inc., One Apple Park Way, Cupertino, CA 95014
Abstract
Personal assistant AI systems such as Siri, Cortana, and Alexa have become
widely used as a means to accomplish tasks through natural language commands.
However, components in these systems generally rely on supervised machine
learning algorithms that require large amounts of hand-annotated training data,
which is expensive and time-consuming to collect. The ability to incorporate
unsupervised, weakly supervised, or distantly supervised data holds significant
promise in overcoming this bottleneck. In this paper, we describe a framework
that leverages user engagement signals (user behaviors that demonstrate a positive
or negative response to content) to automatically create granular entity labels for
training data augmentation. Strategies such as multi-task learning and validation
using an external knowledge base are employed to incorporate the engagementannotated data and to boost the models accuracy on a sequence labeling task. Our
results show that learning from data automatically labeled by user engagement
signals achieves significant accuracy gains in a production deep learning system,
when measured on both the sequence labeling task as well as on user-facing results
produced by the system end-to-end. We believe this is the first use of user engagement signals to help generate training data for a sequence labeling task on a large
scale, and can be applied in practical settings to speed up new feature deployment
when little human-annotated data is available.
1 Introduction
Accomplishing a voice controlled task using a virtual assistant agent such as Siri, Cortana, or Alexa
usually involves several steps. First, a speech recognition module converts audio signals into text.
Next, a natural language understanding (NLU) component extracts the users intent from the transcribed text. This step usually involves determining what action the agent should perform for the
user, along with entities involved in that action (e.g., the action could be play, and the entity could
be a song with title Shake It Off). NLU in the context of conversational AI is particularly challenging for several reasons. First, speech recognition errors, as well as heterogeneous and informal
styles of language use, often introduce noise to the user input and make understanding difficult.
Secondly, many requests issued to digital assistants are brief and ambiguous, requiring an external
knowledge source in order to select the most likely interpretation. For example, a user query Play
play that song train is difficult to comprehend because the sentence can be interpreted in several
ways, especially if we consider the possibility of speech recognition errors and noisy user input. Is
train the title of a song? Is play that song the song title and train the artist name2 ? In order to
play the correct song, the NLU component needs to correctly identify the entities and entity types in
the request despite potential ambiguity, which is remarkably challenging.
?
2
Equal contributions. Alphabetically ordered.
Ground truth: Play that song" is the name of a song by the band Train".
In our work, we treat this parsing task as a sequence labeling problem performed by a bidirectional
LSTM (BiLSTM) model (Graves et al., 2005) (Section 4). This type of deep neural network based
model requires a large amount of training data to perform at high accuracy. As with many traditional
machine learning problems, the granularity of the label space impacts the ease of the learning task
as well as the cost of acquiring annotated labels. Coarse-grained labels are easier to obtain, e.g.,
via human annotation, and facilitate efficient model training. On the other hand, fine-grained labels
are often more useful for downstream components in the AI system to consume in order to produce
desired outcomes for the user. In the context of music entity labeling, an example of a coarsegrained label is musicEntity, which is a collection of finer-granular music-related entities such as
musicArtist, musicAlbum, and musicTitle. In a coarse-grained label space, given the request Play
the the kingdom of rain,3 the entire span of the the kingdom of rain will be labeled as one single
musicEntity. In contrast, in a fine-grained label space, the the should be labeled as musicArtist
and kingdom of rain as musicTitle. It is apparent that fine-grained labels contain more detailed
information about the true user intent and are more valuable for the downstream components to take
the accurate action. However, correctly identifying the fine-grained entities in a users request is
time-consuming, costly, and often challenging even for human annotators (e.g., requiring annotators
to recognize idiosyncratic names of music artists), which leads to insufficient hand-labeled training
data for model training4.
Our contribution in this paper is to describe a framework that leverages naturally occurring user
behaviors to automatically annotate user requests with fine-grained entity labels. We use empirically validated heuristics to select user behaviors that indicate positive or negative engagement with
content. These behaviors include tapping on content to engage with it further (positive response),
listening to a song for a long duration (positive response), or interrupting content provided by the
assistant and manually selecting different content (negative response). These user behaviors, which
we refer to as user engagement signals, provide strong indications of a users true intent. We selectively harvested these signals in a privacy-preserving manner to automatically produce ground truth
annotations. Our solution only needs human annotators to provide coarse-grained labels, which are
much simpler and faster to obtain with higher fidelity compared to a finer-grained labeling process.
These simpler coarse-grained labels are then further refined using user engagement signals, as explained in the following sections. Our framework is of particularly great value in scenarios where
the conversational AI system extends to new domains or features, and corresponding training data
need to be collected quickly and reliably for bootstrapping. Moreover, as will be illustrated shortly,
user engagement signals can help us to identify where the digital assistant needs improvement by
learning from its own mistakes. Our approach significantly increases the volume and quality of our
training data without adding much annotation cost, nor jeopardizing user privacy or user experience.
In order to incorporate both coarse-grained labels (by human annotators) and fine-grained labels
(inferred by our framework), we designed and deployed a multi-task learning framework in our
production environment, which treats coarse-grained and fine-grained entity labeling as two tasks.
We also incorporated an external knowledge base consisting of entities and their relations to validate
the models predictions and ensure high precision. We show that our data generation framework
coupled with these modeling and validation strategies leads to significant accuracy improvements
for both the coarse-grained and fine-grained labeling tasks. More importantly, we demonstrate that
our framework yields significantly better user experience in a real-world production system.
2 Related Work
The use of unsupervised or weakly supervised data to improve performance in entity-labeling tasks
has a long history. A well-established strategy is to start with some seed examples and then
use contextual features and co-training to identify and refine new examples (Collins and Singer,
1999; Gupta and Manning, 2014), building up a corpus that can then be used to train a model. In
Gupta and Manning (2015), the authors show that distributed representations can further improve
performance of such systems, and in Nagesh and Surdeanu (2018) this and two related approaches
are compared and found to outperform methods that do not use distributed representations.
3
Ground truth: Kingdom of Rain is the name of a song by the post-punk band The The.
An annotator may not know that The The is the name of a band and may provide incorrect fine-grained
labels. As a result, it is often preferable for human annotators to annotate in a coarse-grained label space.
4
2
In recent work, Yang and Mitchell (2017) describe an LSTM based architecture that uses external
resources like WordNet and a knowledge base of triples (entity1, relation, entity2) to carry out entity
labeling in two stages: first identifying chunks and second labeling them. By representing external
concepts via embeddings and training an attention mechanism, the system is able to leverage these
concepts: the attention mechanism serves partly to weight the appropriate sense of an ambiguous
term, correctly distinguishing between (for example) Clinton as person or as a location depending
on the context. Our use of a knowledge base is simpler than this, essentially acting as an existence
check to re-rank alternatives produced by the model.
Improving performance of dialog systems by using information about user engagement and task
completion is a standard technique for systems that use reinforcement learning to acquire or improve
a dialog policy: for a review see (Young et al., 2013), and for some recent developments (Gasic et al.,
2017). However, to our knowledge, our work is the first to use inferences about task completion to
derive training data for sequence labeling rather than policy learning.
3 Generating Weakly Supervised Data
In this section, we describe user engagement signals as well as how we use them to generate finegrained entity annotations. In the rest of the paper, we will use queries expressing a play music
intent as the example use case to illustrate our method 5 . Our proposed methods can be extended
straightforwardly to other domains where user engagement signals are available.
3.1 User Engagement Signals
User engagement signals refer to user behaviors that indicate whether the user feels positive or
negative about the agents chosen action, without the agent asking for explicit feedback. In our
scenario of the play music intent, a positive signal is defined as the user listening to the song initiated
by the agent for more than a threshold amount of time. We determined the threshold to be 30 seconds
by asking annotators to grade the success of a request and correlating the grades with how long a
song was played (the vast majority of songs played > 30 seconds were graded successful). A
negative signal is defined as the user aborting the song and switching to a different one, or the user
playing a desired song by searching for it manually after the agent claims it could not find the song6 .
3.2 Engagement-Annotated Fine-Grained Data
We first deploy a model based on human-labeled data in a coarse-grained label space. This model
infers a users intent and passes it to the Action component. For example, given the request Play
play that song train, suppose the model predicts play that song train as musicEntity, which is
a coarse-grained label. Using our model, we can obtain fine-grained labeled data in the following
scenarios. The first scenario is that the downstream component makes a correct decision and plays
the song Play that song by the artist Train. If we receive a positive engagement signal from the
user (i.e., this song was played for a certain amount of time), we can retrieve detailed metadata of
the played song including the title, album, and artist. In this case, the title is Play that song and
the artist is Train. We then map this fine-grained information back to the utterance to regenerate
fine-grained entity labels for each token. This results in a high quality training example that is
automatically labeled with fine-grained entity types, where play that song maps to the musicTitle
type, and train maps to the musicArtist type, in contrast to one single musicEntity coarse label.
The second scenario is that the downstream component makes a wrong decision and returns undesired results, e.g., a song that the user does not want or misinterpreting the request to be for a
movie instead. From our analysis, users will often immediately stop the incorrectly chosen content
and manually search the intended song and then play it, or interrupt the content with a query that
paraphrases the original query. This is a strong indicator that the NLU and downstream components
5
Note that in our setup, the NLU component contains a module that classifies requests into domains such
as music. Although user engagement signals can be used to improve the domain chooser, this work focuses on
improving the entity labeling component that follows.
6
Although there are cases where a user changes her mind and aborts a correctly selected song, we find that
the majority of cases where a user switches to a related song are genuinely unsuccessful cases.
3
default default default default musicEntity
song Train.
Assistant: Playing Hey, soul sister by Train.
Example 2
Example 1
Users
assistantthat
action
User: 1st request
Play andplay
User stopped
song immediately Original utterance:
Task completed ? Play play that song train.
Fuzzy-matching
Users search action
1. User navigates to music app
2. Searches for song play that song by Train
Users 1st request and assistant action
default default default default musicEntity
User:
Play play
that song Train.
Assistant: Playing Hey, soul sister by Train.
User listens to
Music app played:
song for 30s
Task completed ? Song title: Play that song
Engagement-Annotated Fine-Grained
Play
play
that
song
train
default musicTitle musicTitle musicTitle musicArtist
Artist: Train
User stopped
song immediately Original utterance:
Task completed ? Play play that song train.
Fuzzy-matching
Users 2nd request and assistant action
User listens to
default default default default musicEntity musicEntity musicEntity default musicEntity song for 30s
User: Play the
song called play
that
song
by
train.
Task completed ?
Assistant: Playing Play that song by Train.
Engagement-Annotated Fine-Grained
Play
play
that
song
train
default musicTitle musicTitle musicTitle musicArtist
Assistant action:
Song title: Play that song
Artist: Train
Figure 1: Examples of generating engagement-annotated fine-grained data.
failed to fulfill the users intent, and the song manually played by the user (or played by the system following the paraphrase) is actually the desired one. Our model then utilizes metadata of the
ultimately played song to gather the correct fine-grained entity labels.
It is worth noting that the metadata of the song is standardized and contains properly spelled entity
names, whereas the original utterance may be noisy and informal. In order to map the finer level
music information back to the original utterance, we employ an edit-distance based fuzzy matching
algorithm to perform this mapping. The matched tokens are labeled as the identified entities if the
fuzzy matching confidence score is above a threshold7 and the remaining tokens will be labeled
as default" (i.e., meaning the token does not reference an entity). The fuzzy matching algorithm
can tolerate spelling errors, missing or redundant tokens, and ordering problems, which frequently
occur in conversational AI systems (e.g. matching your beautiful to youre beautiful, and this
is you came for to this is what you came for). Figure 1 shows two examples of using the fuzzy
matching algorithm to annotate an utterance that was originally predicted incorrectly. The error is
then corrected by mapping the song title and artist name to the original utterance.
In summary, we describe two scenarios that provide us with valuable fine-grained entity labels:
(1) queries with strong positive user engagement signals, and (2) queries with strong negative user
engagement signals followed by the users corrective action. Both cases will be leveraged by our
model and framework to retrieve weakly-supervised and finer-granular ground-truth entity labels for
the original user utterance. We refer to this fine-grained dataset enriched by user engagement signals
as the engagement-annotated data. Since the engagement-annotated data and human-annotated data
were labeled from different label spaces (fine-grained v.s. coarse-grained, respectively), it is not
straightforward to incorporate these two training data sources together8. In the following section,
we introduce a multi-task learning approach that leverages both datasets jointly to improve entity
recognition for both the coarse-grained and fine-grained labeling tasks.
4 Multi-task learning
We design a multi-task learning framework to better utilize engagement-annotated data (with finergranular entity labels) and human-annotated data (with coarse-granular entity labels). Note that the
same training example is not initially required to have both coarse-grained and fine-grained labels.
As shown in Figure 2 , the multi-task learning model utilizes a deep neural network architecture
based on bidirectional LSTMs (BiLSTM) (Graves et al., 2005). For every query, we first generate a
vector containing a list of customized features representing domain and context information. These
features are pre-trained in an embedding layer for dimension reduction, such that each token in the
utterance is represented by a word vector. The word embeddings are generated using word2vec
(Mikolov et al., 2013) and are trained on data sampled from our production usage. Both the reduced
7
We collected human judgments of similarity and found that strings with fuzzy matching confidence scores
over 0.8 tend to be rated as highly similar by humans. As a result, we used 0.8 as the fuzzy matching threshold.
8
We believe this is a realistic challenge in many scenarios: since fine-grained entity labeling is a more
difficult and time-consuming task, it is easier to obtain high-quality human-annotated data with coarse-grained
entity labels, whereas weak supervision may provide fine-grained (but potentially noisy) labels.
4
default
Song
Title
default
Music
Artist
default
Music
Entity
default
Music
Entity
C1
C2
C3
C4
R1
R2
R3
R4
L1
L2
L3
L4
Play
One
by
Metallica
}
}
Output Layer (Fine-Grained Entity)
Output Layer (Coarse-Grained Entity)
}
Bi-directional LSTM
}
Word Embedding
Figure 2: Main architecture of the multi-task learning network. Word and context feature embedding
are given to a bidirectional LSTM, where Li represents the word i and its left context, Ri represents
the word i and its right context. Concatenating these two vectors yields a representation of the word
i in its context Ci , which is fed into two independent output layers - one for coarse-grained entity
typing task and the other for fine-grained entity typing task.
feature vector and the token word embeddings are passed to the BiLSTM as inputs for training. The
outputs from the forward and backward pass of the first BiLSTM layer are concatenated to form the
input for the second BiLSTM layer. This is followed by a linear projection layer and two softmax
layers: one for predicting coarse-grained entity type labels and another for predicting fine-grained
entity type labels. The loss function is defined as
L(w) = 1{dDCG}
X
yi log pi + 1{dDFG }
X
zj log qj + kwk2
(1)
j
i
where d denotes sampled mini-batch; DCG denotes human-annotated coarse-grained data; DFG
denotes engagement-annotated fine-grained data; w refers to network weights; denotes L2 regularization parameter; y, p refer to ground truth and predicted class for coarse-grained entity typing
task; z, q refer to ground truth and predicted class for fine-grained entity typing task.
For every iteration during training, we select a mini-batch (d) from one of the data sources based
on a pre-defined sampling weight assigned to each source. If the mini-batch belongs to the humanannotated data (DCG ) which follows a coarse-grained entity label space, we perform a forward and
backward pass through the input projection layers, LSTM network and the coarse-grained entity
typing softmax output layer. If the mini-batch belongs to the engagement-annotated data (DFG ), we
perform a forward and backward pass through the input projection layers, LSTM network and the
fine-grained entity typing softmax output layer. Note that the lower level LSTM network is shared
between both the tasks, and its weights are updated during every iteration. However, the weights
of the coarse-grained entity typing and the fine-grained entity typing output layers are updated only
when the mini-batch is sampled from the respective data source. This multi-task framework effectively increases the training data size for LSTM layers and facilitates better feature representation to
improve entity typing accuracy.
5 Knowledge Base Validation
We can further improve our fine-grained entity labeling by utilizing an external knowledge base.
For example, given the query Play something by the Beatles, we label something as musicTitle
partially because it exists as a song by The Beatles in a music knowledge base. If the user had
said Play something by Taylor Swift, since artist Taylor Swift has no song called something,
the system should interpret the utterance to mean play any song by the artist Taylor Swift instead.
Therefore, an authoritative knowledge base containing relational information about music entities
provides an efficient and robust way to validate our model.
During inference, after the model predicts the fine-grained entity label distribution for the sequence,
we perform a beam search over the prediction lattice and select the top five alternatives based on the
5
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related searches
- abstract for chemistry lab report
- experimental biology 2019 abstract submission
- experimental biology abstract deadline
- biology abstract example
- experimental biology 2019 abstract deadli
- experimental biology 2019 abstract deadline
- experimental biology 2019 abstract submi
- chemistry lab report abstract example
- experimental biology abstract submission
- abstract lab report example
- biology lab report abstract example
- abstract report example