NLTK II - GitHub Pages

Corpora Preprocessing

spaCy References

NLTK II

Marina Sedinkina - Folien von Desislava Zhekova -

CIS, LMU marina.sedinkina@campus.lmu.de

December 17, 2019

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

1/79

Outline

Corpora Preprocessing

spaCy References

1 Corpora

2 Preprocessing Normalization

3 spaCy Tokenization with spaCy

4 References

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

2/79

NLP and Corpora

Corpora Preprocessing

spaCy References

Corpora are large collections of linguistic data

designed to achieve specific goal in NLP: data should provide best representation for the task. Such tasks are for example:

word sense disambiguation: sentiment analysis text categorization part of speech tagging

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

3/79

Corpora Structure

Corpora Preprocessing

spaCy References

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

4/79

Corpora

Corpora Preprocessing

spaCy References

When the nltk.corpus module is imported, it automatically

creates a set of corpus reader instances that can be used to access the corpora in the NLTK data distribution

The corpus reader classes may be of several subtypes:

CategorizedTaggedCorpusReader, BracketParseCorpusReader, WordListCorpusReader, PlaintextCorpusReader

...

1 from n l t k . corpus import brown 2 3 print ( brown ) 4 5 # prints 6 #

Marina Sedinkina- Folien von Desislava Zhekova -

Language Processing and Python

5/79

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download