Statistical Source Expansion for Question Answering

[Pages:171]Statistical Source Expansion for Question Answering

Nico Schlaefer

CMU-LTI-11-019

Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213

Thesis Committee: Eric Nyberg (Chair)

Jamie Callan Jaime Carbonell Jennifer Chu-Carroll (IBM T.J. Watson Research Center)

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies

Copyright c 2009?2011 Nico Schlaefer

This research was supported by IBM Ph.D. Fellowships in the 2009?10 and 2010?11 academic years, and by IBM Open Collaboration Agreement #W0652159.

ii

Abstract

A source expansion algorithm automatically extends a given text corpus with related information from large, unstructured sources. While the expanded corpus is not intended for human consumption, it can be leveraged in question answering (QA) and other information retrieval or extraction tasks to find more relevant knowledge and to gather additional evidence for evaluating hypotheses. In this thesis, we propose a novel algorithm that expands a collection of seed documents by (1) retrieving related content from the Web or other large external sources, (2) extracting self-contained text nuggets from the related content, (3) estimating the relevance of the text nuggets with regard to the topics of the seed documents using a statistical model, and (4) compiling new pseudo-documents from nuggets that are relevant and complement existing information.

In an intrinsic evaluation on a dataset comprising 1,500 hand-labeled web pages, the most effective statistical relevance model ranked text nuggets by relevance with 81% MAP, compared to 43% when relying on rankings generated by a web search engine, and 75% when using a multi-document summarization algorithm. These differences are statistically significant and result in noticeable gains in search performance in a task-based evaluation on QA datasets. The statistical models use a comprehensive set of features to predict the topicality and quality of text nuggets based on topic models built from seed content, search engine rankings and surface characteristics of the retrieved text. Linear models that evaluate text nuggets individually are compared to a sequential model that estimates their relevance given the surrounding nuggets. The sequential model leverages features derived from text segmentation algorithms to dynamically predict transitions between relevant and irrelevant passages. It slightly outperforms the best linear model while using fewer parameters and requiring less training time. In addition, we demonstrate that active learning reduces the amount of labeled data required to fit a relevance model by two orders of magnitude with little loss in ranking performance. This facilitates the adaptation of the source expansion algorithm to new knowledge domains and applications.

Applied to the QA task, the proposed method yields consistent and statistically significant performance gains across different datasets, seed corpora and retrieval strategies. We evaluated the impact of source expansion on search performance and end-to-end accuracy using Watson and the OpenEphyra QA system, and datasets comprising over 6,500 questions from the Jeopardy! quiz show and TREC evaluations. By expanding various seed corpora with web search results, we were able to improve the QA accuracy of Watson from 66% to 71% on regular Jeopardy! questions, from 45% to 51% on Final Jeopardy! questions and from 59% to 64% on TREC factoid questions. We also show that the source expansion approach can be adapted to extract relevant content from locally stored sources without requiring a search engine, and that this method yields similar performance gains. When combined with the approach that uses web search results, Watson's accuracy further increases to 72% on regular Jeopardy! data, 54% on Final Jeopardy! and 67% on TREC questions.

iv

Acknowledgements

First of all, I would like to thank my advisor Eric Nyberg for his support and guidance throughout my studies at Carnegie Mellon. From the beginning, Eric placed great confidence in me, allowing me to explore new research directions and develop my own research objectives. I also deeply appreciate his generosity and his readiness to share his experience and to give helpful advice whenever needed. Without his support, this work would not have been possible. I am also very grateful to Jennifer Chu-Carroll, who has been my mentor at IBM Research for over three years. I have been fortunate to work with one of the most experienced researchers in the field of question answering, and was able to learn a lot from her about her area of expertise and about conducting rigorous scientific research. Furthermore, I would like to thank Jamie Callan for his thoughtful comments and honest reflections on my work. Jamie made many helpful suggestions that resulted in additional experiments and analyses and ultimately improved the quality of this work considerably. I am also thankful to Jaime Carbonell for sharing his vast experience and knowledge of machine learning and language technologies with me. Despite his extremely busy schedule, Jaime took the time to study my work carefully and provide valuable feedback and guidance.

Much of my thesis research would not have been possible without the help of the DeepQA group at IBM. In addition to Jennifer's continuous support, I had the pleasure of working closely with James Fan and Wlodek Zadrozny during my summer internships at IBM Research. James and Wlodek contributed many interesting ideas and new insights that led to significant improvements of our method. In addition, I feel indebted to many other IBMers who helped build Watson, which served as a testbed for most of the experiments in this thesis. I am particularly grateful to Eric Brown, Pablo Duboue, Edward Epstein, David Ferrucci, David Gondek, Adam Lally, Michael McCord, J. William Murdock, John Prager, Marshall Schor, and Dafna Sheinwald. Furthermore, I greatly appreciate the help I received from Karen Ingraffea and Matthew Mulholland with various data annotation tasks. Karen and Matt have been instrumental in creating a dataset for statistical relevance modeling, and they helped evaluate Watson's answers to thousands of questions to obtain unbiased estimates of the performance impact of our approach.

I am also extremely grateful to my parents, Stefan and Maria Schl?afer, for their constant support and trust. They always encouraged me to make my own choices and follow my own interests, while supporting me in every way they could and helping me obtain the best education possible. Finally, I would like to thank my wife, Sachiko Miyahara, for her encouragement and understanding during these busy years. Sachiko was always ready to provide advice and support, and never complained when I was busy working towards yet another deadline. I am very fortunate to have met her in Pittsburgh.

vi

Contents

1 Introduction

1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Related Work

7

2.1 Relations to Established Research Areas . . . . . . . . . . . . . . . . 7

2.2 Construction of Local Sources . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Document Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Expansion of Independent Documents . . . . . . . . . . . . . . 11

2.3.2 Link and Citation Analysis . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Comparison to Source Expansion . . . . . . . . . . . . . . . . 15

2.4 Maximal Marginal Relevance . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Sequential Models for Text Segmentation . . . . . . . . . . . . . . . . 19

3 Fundamentals

21

3.1 Pipeline for Question Answering . . . . . . . . . . . . . . . . . . . . . 21

3.2 Question Answering Tasks . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Text REtrieval Conference (TREC) . . . . . . . . . . . . . . . 23

3.2.2 Jeopardy! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Question Answering Systems . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.1 Ephyra and OpenEphyra . . . . . . . . . . . . . . . . . . . . . 29

3.4.2 Watson and the DeepQA Architecture . . . . . . . . . . . . . 30

4 Source Expansion Approach

35

4.1 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.1 Annotation Methodology . . . . . . . . . . . . . . . . . . . . . 38

4.3.2 Relevance Features . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.3 Relevance Models . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

vii

viii

CONTENTS

5 Intrinsic Evaluation

51

5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3 Results and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.4 Robustness of Relevance Estimation . . . . . . . . . . . . . . . . . . . 63

5.5 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6 Application to Question Answering

73

6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.1.1 Jeopardy! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.1.2 TREC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.2 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.3 Search Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.3.1 Experimental Setup using Watson . . . . . . . . . . . . . . . . 79

6.3.2 Watson Results and Analysis . . . . . . . . . . . . . . . . . . 82

6.3.3 Experimental Setup using OpenEphyra . . . . . . . . . . . . . 87

6.3.4 OpenEphyra Results and Analysis . . . . . . . . . . . . . . . . 90

6.3.5 Robustness of Search . . . . . . . . . . . . . . . . . . . . . . . 94

6.4 End-to-End Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 99

6.5 Redundancy vs. Coverage . . . . . . . . . . . . . . . . . . . . . . . . 103

7 Unstructured Sources

107

7.1 Extraction-Based Source Expansion . . . . . . . . . . . . . . . . . . . 107

7.1.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.1.2 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . 113

7.2 Expansion of Unstructured Sources . . . . . . . . . . . . . . . . . . . 118

8 Extensions for Relevance Estimation

121

8.1 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

8.1.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 122

8.1.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 125

8.2 Sequential Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

8.2.1 Transition Features . . . . . . . . . . . . . . . . . . . . . . . . 132

8.2.2 Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 139

8.2.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 140

9 Conclusions

145

9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

9.2 Importance of Source Expansion . . . . . . . . . . . . . . . . . . . . . 148

9.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Bibliography

153

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download