Building datasets - Florida State University

[Pages:9]BUILDING DATASETS

An Introduction

BUILDING DATASETS

? Sometimes, it becomes necessary to make your own dataset, for training, testing or verification purposes.

? Alternatively, you might want to build a dataset with "real-world" data from various sources for a project.

? We need well-composed datasets to build and test our ML application before launching it. ? Datasets are usually multi-variate. Each data-point has multiple variables/dimensions,

which may or may not be labeled. ? The advantage of building a test dataset is the existence of the class label (target) for the

data point, which allows us to test the application on something besides the training data.

TYPES OF DATASETS

Data can be numerical (just numbers) or categorical (all data types) ? Numerical data could be

? Discrete - individual data point represented by 1 value oin each axis ? Continuous - Data could be described over a range ? Interval - Data points are represented as counts per interval, with the intervals ordered ? Ratio - Normalized across the dimensions, so mathematical trnaformations may be applied

without skew

? Categorical data could be

? Nominal ? unordered lists, like lists of colors, courses, etc. ? Ordinal ? ordered, usually exhaustive lists, like T-shirt sizes (XS ? XL), letter grades (A-F), etc.

CONSTRUCTING DATASETS

There are several options: ? Use a pre-existing dataset from a dataset repository (or buy pre-labeled datasets).

You could also get access to an API that would funnel some client data to you.

? We did this for the iris clustering example

? Make up data points using statistics and pre-existing modules to fit the parameters of the application.

? We did this for the torch example

? Generate data points using just RNG's (tailored to fit the application). Harder for categorical data.

? Scrape the internet for data

? Web scrapers exist, but a lot of websites are not scrapable.

GENERATING DATASETS WITH SKLEARN

? The sklearn library allows us for several supervised and unsupervised ML applications including numerical data points that are

? Linear (for regression based applications) ? Circular (for 2 ? class classification) ? Moons (for 2-class classification) ? Blobs (for clustering, and multi-class classification)

? We will examine quick ways to generate all 4 of these. ? There are more, but these shall suffice for an example.

SOME DATA POINTS GENERATED WITH SKLEARN

SOME DATA POINTS GENERATED WITH SKLEARN

GENERATING CATEGORICAL DATA

There are several libraries that can be used to genberate "random" categorical data. ? Faker is one such library ? To install

pip install faker ? Faker contains ways to randomly generate

? Names ? Addresses ? Dates and times ? Job titles ? Geopositional data ? Textual "notes" ? Currencies ? User ID's and hashes ? XML data, etc.

? Faker can be configured for a variety of languages, locations, etc.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download