Processing Raw Text POS Tagging - GitHub Pages

[Pages:73]Accessing Text beyond NLTK Processing Raw Text POS Tagging

Processing Raw Text POS Tagging

Marina Sedinkina - Folien von Desislava Zhekova

CIS, LMU marina.sedinkina@campus.lmu.de

January 14, 2020

Marina Sedinkina- Folien von Desislava Zhekova

Language Processing and Python

1/67

Outline

Dealing with other formats NLP pipeline

Automatic Tagging References

1 Dealing with other formats HTML Binary formats

2 NLP pipeline POS Tagging

3 Automatic Tagging Default Tagger - Baseline Regular Expression Tagger Lookup Tagger Evaluation Optimization

4 References

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

2/73

Dealing with other formats NLP pipeline

Automatic Tagging References

Dealing with other formats

HTML Binary formats

Often enough, content on the Internet as well as locally stored content

is transformed to a number of formats different from plain text (.txt). RTF ? Rich Text Format (.rtf) HTML ? HyperText Markup Language (.html, .htm) XHTML ? Extensible HyperText Markup Language (.xhtml, .xht, .xml, .html, .htm) XML ? Extensible Markup Language (.xml) RSS ? Rich Site Summary (.rss, .xml)

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

3/73

Dealing with other formats NLP pipeline

Automatic Tagging References

Dealing with other formats

HTML Binary formats

Additionally, often text is stored in binary formats, such as:

MS Office formats ? (.doc, .dot, .docx, .docm, .dotx, .dotm, .xls, .xlt, .xlm, .ppt, .pps, .pptx ... and many others) PDF ? Portable Document Format (.pdf) OpenOffice formats ? (.odt, .ott, .oth, .odm ...

and others)

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

4/73

HTML

Dealing with other formats NLP pipeline

Automatic Tagging References

HTML Binary formats

http:

//news/world-middle-east-42412729

1 import u r l l i b 2 3 u r l = " h t t p : / / bbc . com / news / world middle east 42412729 " 4 urlData = u r l l i b . request . urlopen ( url ) 5 html = urlData . read ( ) . decode ( " utf 8" ) 6 print ( html ) 7 # prints 8 # ' < !DOCTYPE html > \ n< h t m l l a n g =" en " i d =" responsive news " > \ n 9 # \ n \n 10 # \ n 11 # < t i t l e >Yemen r e b e l b a l l i s t i c m i s s i l e \ ' i n t e r c e p t e d over Riyadh \ '

BBC News< / t i t l e > \ n 12 # ...

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

5/73

HTML

Dealing with other formats NLP pipeline

Automatic Tagging References

HTML Binary formats

HTML is often helpful since it marks up the distinct parts of the document, which makes them easy to find:

1 ... 2 < t i t l e >Yemen r e b e l b a l l i s t i c m i s s i l e i n t e r c e p t e d over

Riyadh BBC News< / t i t l e > 3 4 ...

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

6/73

Beautiful Soup

Dealing with other formats NLP pipeline

Automatic Tagging References

HTML Binary formats

Python library for pulling data out of HTML and XML files.

can navigate, search, and modify the parse tree.

1 html_doc = " " " 2 < t i t l e >The Dormouse ' s s t o r y 3 4 The Dormouse ' s story 5 Once upon a time t h e r e were t h r e e l i t t l e s i s t e r s ;

and t h e i r names were 6 E l s i e

, 7 Lacie

and 8

Tillie ; 9 and they l i v e d at the bottom of a w e l l . 10 ... 11 " " "

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

7/73

Beautiful Soup

Dealing with other formats NLP pipeline

Automatic Tagging References

HTML Binary formats

1 from bs4 import BeautifulSoup 2 soup = BeautifulSoup ( html_doc , ' html . parser ' )

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

8/73

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download