Processing Raw Text POS Tagging - GitHub Pages
[Pages:73]Accessing Text beyond NLTK Processing Raw Text POS Tagging
Processing Raw Text POS Tagging
Marina Sedinkina - Folien von Desislava Zhekova
CIS, LMU marina.sedinkina@campus.lmu.de
January 14, 2020
Marina Sedinkina- Folien von Desislava Zhekova
Language Processing and Python
1/67
Outline
Dealing with other formats NLP pipeline
Automatic Tagging References
1 Dealing with other formats HTML Binary formats
2 NLP pipeline POS Tagging
3 Automatic Tagging Default Tagger - Baseline Regular Expression Tagger Lookup Tagger Evaluation Optimization
4 References
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
2/73
Dealing with other formats NLP pipeline
Automatic Tagging References
Dealing with other formats
HTML Binary formats
Often enough, content on the Internet as well as locally stored content
is transformed to a number of formats different from plain text (.txt). RTF ? Rich Text Format (.rtf) HTML ? HyperText Markup Language (.html, .htm) XHTML ? Extensible HyperText Markup Language (.xhtml, .xht, .xml, .html, .htm) XML ? Extensible Markup Language (.xml) RSS ? Rich Site Summary (.rss, .xml)
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
3/73
Dealing with other formats NLP pipeline
Automatic Tagging References
Dealing with other formats
HTML Binary formats
Additionally, often text is stored in binary formats, such as:
MS Office formats ? (.doc, .dot, .docx, .docm, .dotx, .dotm, .xls, .xlt, .xlm, .ppt, .pps, .pptx ... and many others) PDF ? Portable Document Format (.pdf) OpenOffice formats ? (.odt, .ott, .oth, .odm ...
and others)
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
4/73
HTML
Dealing with other formats NLP pipeline
Automatic Tagging References
HTML Binary formats
http:
//news/world-middle-east-42412729
1 import u r l l i b 2 3 u r l = " h t t p : / / bbc . com / news / world middle east 42412729 " 4 urlData = u r l l i b . request . urlopen ( url ) 5 html = urlData . read ( ) . decode ( " utf 8" ) 6 print ( html ) 7 # prints 8 # ' < !DOCTYPE html > \ n< h t m l l a n g =" en " i d =" responsive news " > \ n 9 # \ n \n 10 # \ n 11 # < t i t l e >Yemen r e b e l b a l l i s t i c m i s s i l e \ ' i n t e r c e p t e d over Riyadh \ '
BBC News< / t i t l e > \ n 12 # ...
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
5/73
HTML
Dealing with other formats NLP pipeline
Automatic Tagging References
HTML Binary formats
HTML is often helpful since it marks up the distinct parts of the document, which makes them easy to find:
1 ... 2 < t i t l e >Yemen r e b e l b a l l i s t i c m i s s i l e i n t e r c e p t e d over
Riyadh BBC News< / t i t l e > 3 4 ...
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
6/73
Beautiful Soup
Dealing with other formats NLP pipeline
Automatic Tagging References
HTML Binary formats
Python library for pulling data out of HTML and XML files.
can navigate, search, and modify the parse tree.
1 html_doc = " " " 2 < t i t l e >The Dormouse ' s s t o r y 3 4 The Dormouse ' s story 5 Once upon a time t h e r e were t h r e e l i t t l e s i s t e r s ;
and t h e i r names were 6 E l s i e
, 7 Lacie
and 8
Tillie ; 9 and they l i v e d at the bottom of a w e l l . 10 ... 11 " " "
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
7/73
Beautiful Soup
Dealing with other formats NLP pipeline
Automatic Tagging References
HTML Binary formats
1 from bs4 import BeautifulSoup 2 soup = BeautifulSoup ( html_doc , ' html . parser ' )
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
8/73
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- processing raw text pos tagging github pages
- release 0 read the docs
- chimère documentation
- beautiful soup tutorialspoint
- 5 web scraping i introduction to beautifulsoup
- beautifulsoup
- 1 web scraping brigham young university
- beautiful soup documentation — beautiful soup v4 0 0
- beautiful soup documentation — beautiful soup 4 9 0