Processing Raw Text POS Tagging - GitHub Pages

Accessing Text beyond NLTK Processing Raw Text POS Tagging

Processing Raw Text POS Tagging

Marina Sedinkina - Folien von Desislava Zhekova

CIS, LMU marina.sedinkina@campus.lmu.de

January 14, 2020

Marina Sedinkina- Folien von Desislava Zhekova

Language Processing and Python

1/67

Outline

Dealing with other formats NLP pipeline

Automatic Tagging References

1 Dealing with other formats HTML Binary formats

2 NLP pipeline POS Tagging

3 Automatic Tagging Default Tagger - Baseline Regular Expression Tagger Lookup Tagger Evaluation Optimization

4 References

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

2/73

Dealing with other formats NLP pipeline

Automatic Tagging References

Dealing with other formats

HTML Binary formats

Often enough, content on the Internet as well as locally stored content

is transformed to a number of formats different from plain text (.txt). RTF ? Rich Text Format (.rtf) HTML ? HyperText Markup Language (.html, .htm) XHTML ? Extensible HyperText Markup Language (.xhtml, .xht, .xml, .html, .htm) XML ? Extensible Markup Language (.xml) RSS ? Rich Site Summary (.rss, .xml)

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

3/73

Dealing with other formats NLP pipeline

Automatic Tagging References

Dealing with other formats

HTML Binary formats

Additionally, often text is stored in binary formats, such as:

MS Office formats ? (.doc, .dot, .docx, .docm, .dotx, .dotm, .xls, .xlt, .xlm, .ppt, .pps, .pptx ... and many others) PDF ? Portable Document Format (.pdf) OpenOffice formats ? (.odt, .ott, .oth, .odm ...

and others)

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

4/73

HTML

Dealing with other formats NLP pipeline

Automatic Tagging References

HTML Binary formats

http:

//news/world-middle-east-42412729

1 import u r l l i b 2 3 u r l = " h t t p : / / bbc . com / news / world middle east 42412729 " 4 urlData = u r l l i b . request . urlopen ( url ) 5 html = urlData . read ( ) . decode ( " utf 8" ) 6 print ( html ) 7 # prints 8 # ' < !DOCTYPE html > \ n< h t m l l a n g =" en " i d =" responsive news " > \ n 9 # \ n \n 10 # \ n 11 # < t i t l e >Yemen r e b e l b a l l i s t i c m i s s i l e \ ' i n t e r c e p t e d over Riyadh \ '

BBC News< / t i t l e > \ n 12 # ...

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

5/73

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download