Processing Raw Text POS Tagging - GitHub Pages
Dealing with other formats NLP pipeline
Automatic Tagging References
Processing Raw Text POS Tagging
Marina Sedinkina - Folien von Desislava Zhekova -
CIS, LMU marina.sedinkina@campus.lmu.de
January 8, 2019
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
1/73
Outline
Dealing with other formats NLP pipeline
Automatic Tagging References
1 Dealing with other formats HTML Binary formats
2 NLP pipeline POS Tagging
3 Automatic Tagging Default Tagger - Baseline Regular Expression Tagger Lookup Tagger Evaluation Optimization
4 References
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
2/73
Dealing with other formats NLP pipeline
Automatic Tagging References
Dealing with other formats
HTML Binary formats
Often enough, content on the Internet as well as locally stored content
is transformed to a number of formats different from plain text (.txt). RTF ? Rich Text Format (.rtf) HTML ? HyperText Markup Language (.html, .htm) XHTML ? Extensible HyperText Markup Language (.xhtml, .xht, .xml, .html, .htm) XML ? Extensible Markup Language (.xml) RSS ? Rich Site Summary (.rss, .xml)
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
3/73
Dealing with other formats NLP pipeline
Automatic Tagging References
Dealing with other formats
HTML Binary formats
Additionally, often text is stored in binary formats, such as:
MS Office formats ? (.doc, .dot, .docx, .docm, .dotx, .dotm, .xls, .xlt, .xlm, .ppt, .pps, .pptx ... and many others) PDF ? Portable Document Format (.pdf) OpenOffice formats ? (.odt, .ott, .oth, .odm ...
and others)
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
4/73
HTML
Dealing with other formats NLP pipeline
Automatic Tagging References
HTML Binary formats
http:
//news/world-middle-east-42412729
1 import u r l l i b 2 3 u r l = " h t t p : / / bbc . com / news / world-middle-east-42412729 " 4 urlData = u r l l i b . request . urlopen ( url ) 5 h t m l = u r l D a t a . read ( ) . decode ( " u t f -8 " ) 6 print ( html ) 7 # prints 8 # ' < !DOCTYPE html > \ n< h t m l l a n g =" en " i d =" responsive-news " > \ n 9 # \ n \n 10 # \ n 11 # < t i t l e >Yemen r e b e l b a l l i s t i c m i s s i l e \ ' i n t e r c e p t e d over Riyadh \ '
- BBC News< / t i t l e > \ n 12 # ...
Marina Sedinkina- Folien von Desislava Zhekova -
Language Processing and Python
5/73
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- processing raw text pos tagging github pages
- lecture03 data ii github pages
- 5 web scraping i introduction to beautifulsoup
- beautiful soup documentation — beautiful soup v4 0 0
- beautiful soup documentation — beautiful soup 4 9 0
- beautifulsoup
- web mining and recommender systems
- processing raw text pos tagging
- beautiful soup tutorialspoint
- lab 16 beautifulsoup brigham young university