Scikit-Learn: Intro Scikit-learn provides tools for data ...

Scikit-Learn: Intro ? Scikit-learn provides tools for data analysis (machine learning) ? Tools span the machine learning pipeline:

? Loading ? Preprocessing ? Learning

Classification Clustering Regression ? Model selection ? Scikit works on any numeric data stored as numpy arrays or scipy sparse arrays ? Types that are convertible to these arrays will also work (e.g., pandas DataFrames) ? Scikit-Learn expects all data to be numeric and all values to be present

1

Scikit-Learn: Loading

? sklearn.datasets package provides methods for downloading data sets

? Dataset loaders are used to load small standard datasets (toy datasets) ? Dataset fetchers are used to load large standard datasets (real world datasets) ? Miscellaneous tools to load datasets in other formats ? Tools for generating synthetic datasets (i.e., artificial sets) ? In general, the following return dictionary-like objects that have at least

the following two entries: 1. An array of shape n samples ? n f eatures of sample data, with key

data This is generally a NumPy array, Pandas DataFrame, or sometimes

a SciPy sparse matrix 2. A numpy array of length n samples of target values, with key target

This is generally a NumPy array or Pandas Series Usually 1D but can be 2D ? The DESC attribute describes their structure ? Other attributes may include f eature names target names

? Toy datasets

? Loaders: load boston([return X y]) Load and return the boston house-prices dataset (regression). load iris([return X y]) Load and return the iris dataset (classification). load diabetes([return X y]) Load and return the diabetes dataset (regression). load digits([n class, return X y]) Load and return the digits dataset (classification). load linnerud([return X y]) Load and return the linnerud dataset (multivariate regression). load wine([return X y]) Load and return the wine dataset (classification). load breast cancer([return X y]) Load and return the breast cancer wisconsin dataset (classification).

2

Scikit-Learn: Loading (2)

? Descriptions of the above can be found at stable/datasets/index.html#datasets

? Real datasets

? Loaders: fetch olivetti faces([data home, shuffle, ]) Load the Olivetti faces dataset from AT&T (classification). fetch 20newsgroups([data home, subset, ]) Load the filenames and data from the 20 newsgroups dataset (classification). fetch 20newsgroups vectorized([subset, ]) Load the 20 newsgroups dataset and vectorize it into token counts (classification). fetch lfw people([data home, funneled, ]) Load the Labeled Faces in the Wild (LFW) people dataset (classification). fetch lfw pairs([subset, data home, ]) Load the Labeled Faces in the Wild (LFW) pairs dataset (classification). fetch covtype([data home, ]) Load the covertype dataset (classification). fetch rcv1([data home, subset, ]) Load the RCV1 multilabel dataset (classification). fetch kddcup99([subset, data home, shuffle, ]) Load the kddcup99 dataset (classification). fetch california housing([data home, ]) Load the California housing dataset (regression).

? Generated datasets

? These tools allow generation of artificial datasets that exhibit various characteristics

? Generators exist for Classification Clustering Regression Manifold learning Decomposition

3

Scikit-Learn: Loading (3) ?

? is a public repository for machine learning data ? The sklearn.datasets package provides the function sklearn.datasets.fetchopenml()

for downloading data from this repository ? You can access the components of the data object using the standard dictionary

access methods and attributes (e.g., object.keys())

4

Scikit-Learn: Estimators

? Learning models are implemented as Python objects called estimators

? All estimators share a common interface ? Datasets are represented as NumPy arrays, Pandas DataFrames, or SciPy

sparse matrices ? Machine learning tasks can be pipelined together

? The usual sequence of steps using an estimator are

1. Choose an estimator and import its class 2. Choose hyperparameters when instantiating the class 3. Convert dataset into proper format of features matrix and target vector 4. Fit the model to the data by calling fit()

? Estimators come in two flavors:

1. Transformers ? These transform a dataset ? They are used for preprocessing a dataset prior to training

2. Predictors ? These are used to train a model and then use the trained result for generalization ? They predict results

? Estimator methods:

? f it(X, y = N one) A member of all estimators X is the dataset; y is optional and generally represents target data for supervised learning Fitting the model to the dataset applies the estimator to the dataset ? In the case of a transformer, it will generate a value of some sort ? In the case of a predictor, it will train the model on the data set

? predict(T ) A member of predictors This method generalizes test data T is the test input Returns a dataset of predicted values

5

Scikit-Learn: Estimators (2) ? transf orm(X)

A member of transformers Performs the actual transformation of the dataset Called after calling f it() ? f it transf orm(X) Applies transf orm(X) after calling f it() A convenience, which may be more efficient than making two separate

calls ? score()

6

Scikit-Learn: Preprocessing Data - Transformers ? The preprocessing package is used for this:

from sklearn import preprocessing ? Method scale(X[, ...]) performs simple scaling

? Generates a dataset with mean of 0 and variance of 1 ? Class StandardScaler

? A transformer ? Constructor: StandardScaler([...]) ? Works same as method scale() ? Attributes:

scale : Per feature relative scaling of the data mean : Per feature mean of the data var : Per feature variance of the data n samples seen : number of samples processed by the estimator for each

feature ? NOTE: For most of the classes listed below, there is an equivalent method

comparable to scale() ? Class M inM axScaler

? A transformer ? Constructor: M inM axScaler(range = (0, 1)[, ...]) ? Transforms dataset values within range specified (default [0, 1] ? Attributes:

min : Per feature adjustment for minimum scale : Per feature relative scaling of the data data min : Per feature minimum seen in the data data max : Per feature maximum seen in the data data range : Per feature range seen in the data

7

Scikit-Learn: Preprocessing Data - Transformers (2)

? Class RobustScaler ? A transformer ? Constructor: RobustScalar([...]) ? Like StandardScaler, but for use when have many outliers ? Scaling based on median and quantile ranges ? Attributes: center : Per feature median value scale : Per feature scaled interquartile range

? Class N ormalizer ? A transformer ? Constructor: N ormalizer([...]) ? Normalizes each sample (row); i.e., when considered as a vector, each row will have unit length

? Class CategoricalEncoder ? NOTE: This is mentioned in Geron but does not exist ? Functionality supplied by OrdinalEncoder and OneHotEncoder

? Class OrdinalEncoder ? A transformer ? Constructor: OrdinalEncoder([...]) ? Transforms categorical data to numeric reps per unique feature value ? Attributes: categories : The categories of each feature determined during fitting

? Class OneHotEncoder ? A transformer ? Constructor: OneHotEncoder([...]) ? Transforms categorical integer features as a one-hot numeric array ? Categories based on unique values in each feature ? Attributes: categories : The categories of each feature determined during fitting

8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download