Markus Dickinson & Marwa Ragheb June 9, 2013

[Pages:159]Annotation for Learner English Guidelines, v. 0.1

Markus Dickinson & Marwa Ragheb June 9, 2013

ii

Contents

Front matter

vii

0.1 Notes for Researchers . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

0.2 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 Getting Started

1

1.1 Quick Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 General Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Benefit of the doubt . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.4 Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.5 Underspecification . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.6 Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.7 Mismatches to define shortest distance . . . . . . . . . . . . . . 8

1.3 Label Inventories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.1 Dependency Relations . . . . . . . . . . . . . . . . . . . . . . 10

1.3.2 POS Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Initial Annotation Layers

17

2.1 Segmentation & Tokenization . . . . . . . . . . . . . . . . . . . . . . 17

2.1.1 Sentence segmentation . . . . . . . . . . . . . . . . . . . . . . 17

2.1.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 Irregulars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2 Misspellings . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.3 Spelling vs. Morphology . . . . . . . . . . . . . . . . . . . . . 22

2.2.4 Spacing issues . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.5 Lowercase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.6 Anonymization . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.7 Acronyms, unknowns, & foreign terms . . . . . . . . . . . . . 25

2.3 POS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii

iv

CONTENTS

2.3.1 POS mismatches . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Defining distributional POS . . . . . . . . . . . . . . . . . . . 27 2.4 Lexical violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.1 Lexical violations vs. Lemma changes . . . . . . . . . . . . . . 32 2.4.2 Lexical violations vs. POS mismatches . . . . . . . . . . . . . 33

3 Dependencies

35

3.1 Morphosyntactic Dependencies . . . . . . . . . . . . . . . . . . . . . . 35

3.1.1 Inventory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1.2 The syntactic in morphosyntactic dependencies . . . . . . . . . 43

3.2 Where we differ from CHILDES . . . . . . . . . . . . . . . . . . . . . 46

3.2.1 Possessives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.2 NJCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.3 Particles (PRT) . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.4 Dependents of adjectives . . . . . . . . . . . . . . . . . . . . . 47

3.2.5 Heads for verbal chains . . . . . . . . . . . . . . . . . . . . . . 47

3.2.6 Secondary Dependencies . . . . . . . . . . . . . . . . . . . . . 48

3.2.7 Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.8 Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.9 Ellipsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3 Overview of annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3.1 Verbs & Verbal relations . . . . . . . . . . . . . . . . . . . . . 51

3.3.2 Nouns and Noun relations . . . . . . . . . . . . . . . . . . . . 59

3.3.3 Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3.4 Arguments vs. Adjuncts . . . . . . . . . . . . . . . . . . . . . 75

3.3.5 Punctuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.4 Subcategorization frames . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.4.1 Grammaticality . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.4.2 Determining subcategorization requirements . . . . . . . . . . 82

3.4.3 Specific cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4 A Variety of Dependency Constructions

89

4.1 Attachment decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.2 INCROOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.3 Extraposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.4 wh-words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.4.1 Displacement . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.4.2 wh-questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.4.3 Embedded clauses . . . . . . . . . . . . . . . . . . . . . . . . 94

4.4.4 Relative clauses . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.5 Prepositions vs. Complementizers . . . . . . . . . . . . . . . . . . . . 97

CONTENTS

v

4.6 Comparative constructions . . . . . . . . . . . . . . . . . . . . . . . . 99 4.6.1 as X as . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.6.2 Xer than . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.6.3 Discontinuous comparatives . . . . . . . . . . . . . . . . . . . 100

4.7 Purpose Clauses (cf. in order to) . . . . . . . . . . . . . . . . . . . . . 101 4.8 Appositives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.9 Parentheticals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.10 Ellipsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.11 Linguistic mentions (quotations) . . . . . . . . . . . . . . . . . . . . . 106 4.12 Multi-Word Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 107

5 Learner Innovations

115

5.1 Missing elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.1.1 Missing head . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.1.2 Missing argument . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.2 Extra elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.2.1 Extra head . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.2.2 Extra dependent . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.2.3 Extra word with unclear function . . . . . . . . . . . . . . . . 123

5.3 Word order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6 Extended Examples of Difficult Cases

125

6.1 Example 1: one, complement clause . . . . . . . . . . . . . . . . . . . 125

6.2 Example 2: lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.3 Example 3: syntax vs. meaning . . . . . . . . . . . . . . . . . . . . . . 127

6.4 Example 4: lemma, word order, run-on . . . . . . . . . . . . . . . . . . 127

6.5 Example 5: missing verb, double conjunction, unclear phrase . . . . . . 129

6.6 Example 6: syntax vs. discourse, multi-ambiguity . . . . . . . . . . . . 130

6.7 Example 7: syntax vs. meaning . . . . . . . . . . . . . . . . . . . . . . 131

6.8 Example 8: complement clause . . . . . . . . . . . . . . . . . . . . . . 132

6.9 Example 9: Lemma, POS ambiguity . . . . . . . . . . . . . . . . . . . 133

6.10 Example 10: non-finite subordinate clause, problematic misspellings . . 134

A Practical matters

139

A.1 Brat annotation tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

A.1.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . 140

A.1.2 Basics of annotation . . . . . . . . . . . . . . . . . . . . . . . 140

A.1.3 Annotating a word . . . . . . . . . . . . . . . . . . . . . . . . 140

A.1.4 Annotating dependencies . . . . . . . . . . . . . . . . . . . . . 142

A.2 CoNLL file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

vi

CONTENTS

Front matter

0.1 Notes for Researchers

The annotation scheme here is based on thinking that has evolved over several years, as captured in various papers, listed here. It is also part of an ongoing dissertation project at Indiana University. You can find pdf versions of the papers, as well as other information about the project, at .

? Dickinson and Ragheb (2009): Dependency Annotation for Learner Corpora. Proceedings of the Eighth Workshop on Treebanks and Linguistic Theories (TLT8). Milan, Italy.

? Dickinson and Ragheb (2011): Dependency Annotation of Coordination for Learner Language. Proceedings of the International Conference on Dependency Linguistics. Barcelona, Spain.

? Ragheb and Dickinson (2011): Avoiding the Comparative Fallacy in the Annotation of Learner Corpora. Selected Proceedings of the 2010 Second Language Research Forum: Reconsidering SLA Research, Dimensions, and Directions. Cascadilla Proceedings Project: Somerville, MA. pp. 114?124.

? Ragheb and Dickinson (2012): Defining Syntax for Learner Language Annotation. Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Poster Session. Mumbai, India.

? Ragheb and Dickinson (2013): Inter-annotator Agreement for Dependency Annotation of Learner Language. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Atlanta, GA.

The goal is to annotate syntactic and, to some extent, morpho-syntactic information, without necessarily encoding errors. The papers give more justification for this, but you can also read section 1.2 for more on what the annotation does and does not encode.

vii

viii

FRONT MATTER

Information for citing these guidelines is:

? Dickinson and Ragheb (2013): Annotation for Learner English Guidelines, v. 0.1. Technical report, Indiana University, Bloomington, IN. June 9, 2013.

And the BiBTeX entry is:

@TechReport{salle:13, author = {Markus Dickinson and Marwa Ragheb}, title = {Annotation for Learner {E}nglish Guidelines, v. 0.1}, institution = {Indiana University}, year = {2013}, address = {Bloomington, IN}, month = {June}, note = {June 9, 2013},

}

It is important to note that, while we have made hundreds of decisions, these are not the only decisions one could have made. We hope that these guidelines are useful, not just to understand what the annotation means, but as a starting point for other annotation and analysis efforts. That is, similar to what we stated in Ragheb and Dickinson (2012), one of the most important contributions of these guidelines may be "to outline the questions which need to be addressed for grammatical annotation of learner language."

As long as these guidelines are still in progress, we welcome feedback and discussion. Contact us at: mragheb@indiana.edu or md7@indiana.edu.

Our data will eventually be released, but as a) we are a small annotation effort, and b) we have had to take significant time determining what to annotate and what the annotation denotes--questions which form the core of a PhD thesis-in-progress--please bear with our slowness.

0.2 Acknowledgements

As can be gathered from the acknowledgments in the papers above, this work has been affected by many researchers in different areas.

We would also like to acknowledge our student annotators, who had the challenging task of annotating while decisions were still being made: Eric Benzschawel, Frank Linville, Shannon Manley, Lauren Swanson, Zachary Wampler, and Samantha Zimny.

The process was a little unusual, in that, for the most part, students were receiving course credit for annotating, and part of the credit was based on discussing syntax,

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download