Processing Raw Text POS Tagging
Accessing Text beyond NLTK Processing Raw Text POS Tagging
Processing Raw Text POS Tagging
Marina Sedinkina - Folien von Desislava Zhekova
CIS, LMU marina.sedinkina@campus.lmu.de
January 16, 2018
Marina Sedinkina- Folien von Desislava Zhekova
Language Processing and Python
1/67
Outline
Accessing Text beyond NLTK Processing Raw Text POS Tagging
1 Accessing Text beyond NLTK 2 Processing Raw Text 3 POS Tagging
Marina Sedinkina- Folien von Desislava Zhekova
Language Processing and Python
2/67
Accessing Text beyond NLTK Processing Raw Text POS Tagging
Dealing with other formats HTML Binary formats
Gutenberg Corpus
NLTK includes a good selection of various corpora among which a small selection of texts from the Project Gutenberg electronic text archive. Project Gutenberg contains more than 50 000 free electronic
books, hosted at .
Marina Sedinkina- Folien von Desislava Zhekova
Language Processing and Python
3/67
Accessing Text beyond NLTK Processing Raw Text POS Tagging
Dealing with other formats HTML Binary formats
Gutenberg Corpus
Unfortunately, only 18 books are provided, which you can list as we have seen before:
1 >>> import n l t k 2 >>> n l t k . corpus . gutenberg . f i l e i d s ( ) 3 [ " austen-emma. t x t " , " austen-p e rs u a s i on . t x t " , " austen-
sense . t x t " , " b i b l e -k j v . t x t " , " blake-poems . t x t " , " b r y a n t-s t o r i e s . t x t " , " burgess-busterbrown . t x t " , " c a r r o l l -a l i c e . t x t " , " chesterton-b a l l . t x t " , " c h e s t e r t o n-brown . t x t " , " c h e s t e r t o n-t h u r s d a y . t x t " ,
" edgeworth-p a r e n t s . t x t " , " m e l v i l l e -moby_dick . t x t " , " m i l t o n -p a r a d i s e . t x t " , " shakespeare-caesar . t x t " , " shakespeare-hamlet . t x t " , " shakespeare-macbeth . t x t " , " whitman-l e a v e s . t x t " ]
Marina Sedinkina- Folien von Desislava Zhekova
Language Processing and Python
4/67
Accessing Text beyond NLTK Processing Raw Text POS Tagging
Dealing with other formats HTML Binary formats
Gutenberg eBooks
Accessing the original collection is thus helpful:
1 import n l tk 2 import u r l l i b 3 4 u r l = " h t t p : / / gutenberg . org / f i l e s / 2554 / 2554-0 . t x t " 5 urlData = u r l l i b . request . urlopen ( url ) 6 f i r s t L i n e = u r l D a t a . r e a d l i n e ( ) . decode ( " u t f -8 " ) 7 print ( firstLine ) 8 9 # prints 10 # The P r o j e c t Gutenberg EBook o f Crime and Punishment
, by Fyodor Dostoevsky
Marina Sedinkina- Folien von Desislava Zhekova
Language Processing and Python
5/67
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- processing raw text pos tagging github pages
- lecture03 data ii github pages
- 5 web scraping i introduction to beautifulsoup
- beautiful soup documentation — beautiful soup v4 0 0
- beautiful soup documentation — beautiful soup 4 9 0
- beautifulsoup
- web mining and recommender systems
- processing raw text pos tagging
- beautiful soup tutorialspoint
- lab 16 beautifulsoup brigham young university