Language Identification in Bilingual Documents for ...
Motivation, Goals
Extraction Pipeline
Language ID
Performance
Future Directions
Language Identification in Bilingual Documents for Linguistic Data Extraction
Terrence Szymanski tdszyman@umich.edu
Penn Linguistics Colloquium 37
March 23, 2013
Language ID for Linguistic Data Extraction
Terrence Szymanski tdszyman@umich.edu
Motivation, Goals
Extraction Pipeline
Language ID
Performance
Future Directions
1 Motivation, Goals 2 Extraction Pipeline 3 Language ID 4 Performance 5 Future Directions
Language ID for Linguistic Data Extraction
Terrence Szymanski tdszyman@umich.edu
Motivation, Goals
Motivation
Extraction Pipeline
Language ID
Performance
Future Directions
John Goldsmith (2007) A New Empiricism
"[T]he goal of the linguist is to provide the most compact overall description of all of the linguistic data that exists at present"
? John Goldsmith
Steven Abney (2011) Data-Intensive Experimental Linguistics
"[A]ny experimental foray into universal linguistics will be a data-intensive undertaking. It will require substantial samples of many languages-- ultimately all human languages--in a consistent form that supports automated processing across languages."
? Steven Abney
Language ID for Linguistic Data Extraction
Terrence Szymanski tdszyman@umich.edu
Motivation, Goals
Extraction Pipeline
Language ID
Performance
Sources of Machine-Readable Linguistic Data
Future Directions
N.B. Many digital resources aren't machine readable.
Currently available
NLP corpora PDFs of linguistics papers, via ODIN (Lewis & Xia, 2010) ? odin.
Currently unavailable
Undocumented languages Field notes and unpublished material Non-digitized material Unstructured digital material
e.g. Digitized books in online libraries
Language ID for Linguistic Data Extraction
Terrence Szymanski tdszyman@umich.edu
Motivation, Goals
Extraction Pipeline
Availability of Language Data
350
300
250
200
150
100
50
0
Language Coverage of Current Digital Resources
100%
Language ID for Linguistic Data Extraction
Number of Languages Included
Treebanks Tatoeba
Google Translate '06 Google Translate '07
Google Translate '08 Google Translate '09 Google Translate '10 Google Translate '11 Google Translate '12
Wikipedia '01 Wikipedia '02 Wikipedia '03 Wikipedia '04 Wikipedia '05 Wikipedia '06 Wikipedia '07 Wikipedia '08 Wikipedia '09 Wikipedia '10 Wikipedia '11 Wikipedia '12
Fraction of World Languages Included
Treebanks (0.5%) Tatoeba (1.3%) Google (0.9%)
Wikipedia (4.1%)
Language ID
Performance
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Terrence Szymanski tdszyman@umich.edu
Future Directions
Motivation, Goals
Extraction Pipeline
Language ID
Language Texts in Digital Libraries
Performance
Future Directions
Types of books with relevant language data:
Grammars
(e.g. A Grammar of the Santhal Language)
Lexicons
(e.g. Trukese-English Dictionary)
Readers and texts (bilingual or monolingual) (e.g. Kickapoo Tales)
Challenges
OCR (optical character recognition) is weak. Some texts are subject to copyright restrictions. Quality of data is uncertain.
Language ID for Linguistic Data Extraction
Terrence Szymanski tdszyman@umich.edu
Motivation, Goals
Extraction Pipeline
Language ID
Desired Input and Output
Performance
Future Directions
Electronic Document
Parallel Corpus (Bitext)
------ Processing
F-52 E-52 F-53 E-53 F-54 E-54
... holako hechlen, onkodo okaena? they who came yesterday, what has become of them? Hopon em ranade tae, oni joharam lagit'e hechakana whose son you gave medicine to, he has come to thank you Enbetarem ranade, oni do phariaoena, to whom you gave medicine at that time, he has recovered.
...
Figure: The high-level objective of bitext data collection.
Language ID for Linguistic Data Extraction
Terrence Szymanski tdszyman@umich.edu
Motivation, Goals
Extraction Pipeline
Extraction Pipeline
Language ID
Performance
Future Directions
Four major stages of processing:
Document Collection
Mixed-language Document
Language ID
Monolingual sub-spans
Translation ID
Corpus of Bitexts
Downstream Processing
Language ID for Linguistic Data Extraction
Terrence Szymanski tdszyman@umich.edu
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- linguistic evaluation of support verb constructions by
- google translate everyone s language wallah
- a case study of natural gender phenomena in translation a
- language access planning tool
- improvedspeech to texttranslationwiththefisherandcallhome
- bisect learning to split and rephrase sentences with bitexts
- virtual assistants multi language support for mentor
- robust semantic text similarity using lsa machine
- spanish common phrases
- terms and conditions in an airline s official website
Related searches
- documents for selling a car
- corporate documents for a corporation
- organizational documents for llc
- dmv documents for license renewal
- dmv documents for driving license
- documents for sale of car
- early american documents for sale
- historical documents for sale
- rare documents for sale
- legal documents for a corporation
- documents for selling home
- organizational documents for corporations