Www2.stat.duke.edu



Instructions: Datasets for DL course (2019 Fall):Preparations:Download the Platform Software: Python 3.7.4 (the latest version) Official Download Website: TensorFlow: A widely used and open-source library developed by Google for implementing Machine Learning topics. A quick guide to install TensorFlow is available here: Keras API (Optional but Recommended): A highly developed Deep Learning API based on Python. There are several guides to build this with TensorFlow Backend: If you are not using Anaconda you are using Anaconda: DatasetFigure 1: Examples of the images in the MNIST datasetIntro: The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.Applications: Shallow/Deep Neural Networks; CNN; RNNLink to Dataset: Import from Keras:from?keras.datasets?import?mnist??#?Setup?train?and?test?splits??(x_train,?y_train),?(x_test,?y_test)?=?mnist.load_data()Import Dataset from Tensorflow:import?tensorflow?as?tf??from?tensorflow.examples.tutorials.mnist?import?input_data??mnist?=?input_data.read_data_sets("MNIST_data/",?one_hot=True)??train_x?=?mnist.train.images??train_y?=?mnist.train.labels??test_x?=?mnist.test.images??test_y?=?mnist.test.labels??MNIST Fashion DatasetFigure 2: Examples of the images in the Fashion MNIST datasetIntro: Fashion-MNIST contains 70,000 grayscale images of 10 categories’ fashion products like sneakers, trousers and coats. There are 7,000 images in each category. This dataset is more challenging than MNIST dataset therefore it is considered to be a replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size(28*28) , data format and the structure of training and testing splits. Link to Dataset: : Shallow/Deep Neural Networks; CNN; RNNImport from Keras:from?keras.datasets?import?fashion_mnist?#?Setup?train?and?test?splits???(x_train,?y_train),?(x_test,?y_test)?=?fashion_mnist.load_data()??Import from TensorFlow:Train(Size: 60000)Test(Size:10000)train-images-downloadtrain-labels-downloadtest-images-downloadtest-labels-downloadTable 1: Download Links of Mnist DatasetMake sure you have?downloaded the data?and saved it in the path:?data/fashion. Otherwise, it will use the Mnist Dataset instead.from?tensorflow.examples.tutorials.mnist?import?input_data??data?=?input_data.read_data_sets('data/fashion')??Import from mnist_reader:import?mnist_reader??X_train,?y_train?=?mnist_reader.load_mnist('data/fashion',?kind='train')??X_test,?y_test?=?mnist_reader.load_mnist('data/fashion',?kind='t10k')??Cat and Non-Cat DatasetFigure 3: A cats’ picture from webIntro: This “cat and non-cat” dataset is taken from Andrew Ng’s course on Coursera Neural Networks and Deep Learning. The dataset (“data.h5”) contains: a training set of train images labeled as cat (y=1) or non-cat (y=0) ; a test set of test images labeled as cat or non-cat; each image is of shape (num_px, num_px, 3) where 3 is for the 3 channels (RGB). Thus, each image is square (height = num_px) and (width = num_px).Link to Dataset: Train data google drive linkTest data google drive linkApplications: Shallow/Deep Neural Networks; CNN; RNNImport from h5py package:First, make sure you have already downloaded the dataset(both train and test), then import h5py package:import?numpy?as?np??import?h5py??def?load_data():??????train_dataset?=?h5py.File('train_catvnoncat.h5',?"r")??????train_set_x_orig?=?np.array(train_dataset["train_set_x"][:])?#?your?train?set?features??????train_set_y_orig?=?np.array(train_dataset["train_set_y"][:])?#?your?train?set?labels??????test_dataset?=?h5py.File('test_catvnoncat.h5',?"r")??????test_set_x_orig?=?np.array(test_dataset["test_set_x"][:])?#?your?test?set?features??????test_set_y_orig?=?np.array(test_dataset["test_set_y"][:])?#?your?test?set?labels??????classes?=?np.array(test_dataset["list_classes"][:])?#?the?list?of?classes??????return?train_set_x_orig,?train_set_y_orig,?test_set_x_orig,?test_set_y_orig,?classes????train_x,?train_y,?test_x,?test_y,?classes?=?load_data()??Cifar-10 Dataset Figure 4: Examples of the images in the Cifar-10 datasetIntro: “CIFAR-10??is an established computer-vision dataset used for object recognition. It is a subset of the?80 million tiny images dataset?and consists of 60,000 32x32 color images containing one of 10 object classes, with 6000 images per class. It was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.”Link to Dataset: HYPERLINK "" : Shallow/Deep Neural Networks; CNN; RNNImport from Keras:from?keras.datasets?import?cifar10??(x_img_train,y_label_train),(x_img_test,y_label_test)?=?cifar10.load_data()?Import from Self-defined Functions:import?pickle??def?unpickle(file):??????with?open(file,?'rb')?as?fo:??????????dict?=?pickle.load(fo,?encoding='bytes')??????return?dict????import?numpy?as?np??import?os????#?define?the?function?to?load?a?batch???def?load_CIFAR_batch(filename):??????with?open(filename,?'rb')?as?f:??????????datadict?=?pickle.load(f,?encoding='latin1')??????????X?=?datadict['data']??????????Y?=?datadict['labels']??????????X?=?X.reshape(10000,?3,?32,?32).transpose(0,?2,?3,?1).astype("float")??????????Y?=?np.array(Y)??????????return?X,?Y????#?define?the?function?to?load?the?whole?dataset??def?load_CIFAR10():??????xs?=?[]??????ys?=?[]??????for?b?in?range(1,?6):??????????location?=?'data_batch_'+str(b)??????????X,?Y?=?load_CIFAR_batch(location)??????????xs.append(X)?????????#将所有batch整合起来??????????ys.append(Y)??????Xtr?=?np.concatenate(xs)?#使变成行向量,最终Xtr的尺寸为(50000,32,32,3)??????Ytr?=?np.concatenate(ys)??????del?X,?Y??????Xte,?Yte?=?load_CIFAR_batch('test_batch')??????return?Xtr,?Ytr,?Xte,?Yte????import?numpy?as?np????#?load?the?dataset??X_train,?y_train,?X_test,?y_test?=?load_CIFAR10()??Street View House Numbers (SVHN) DatasetFigure 5: Examples of the images in the SVHN datasetIntro: SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. SVHN is obtained from house numbers in Google Street View images.Link to Dataset: : Shallow/Deep Neural Networks; CNN; RNNImport from SciPy:First, make sure you have already downloaded these datasets: (Format 2) Cropped Digits: train_32x32.mat, test_32x32.mat , extra_32x32.matimport?numpy?as?np??import?scipy.io?as?sio??import?matplotlib.pyplot?as?plt??%matplotlib?inline????train_data?=?sio.loadmat('train_32x32.mat')??test_data?=?sio.loadmat('test_32x32.mat')????x_train?=?train_data['X']??y_train?=?train_data['y']??x_test?=?test_data['X']??y_test?=?test_data['y']??x_extra?=?extra_data['X']??y_extra?=?extra_data['y']??To use extra dataset as trainset as well:#?Use?extra?data?as?train?data??x_train?=?np.concatenate([x_train,?x_extra])??y_train?=?np.concatenate([y_train,?y_extra])??Large Movie Review Dataset (IMDB) DatasetFigure 6: IMDB pic from webIntro: This is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. The review is stored as a sequence of integers. These are word IDs that have been pre-assigned to individual words, and the label is an integer (0 for negative, 1 for positive).Link to Dataset: : Shallow/Deep Neural Networks; CNN; LSTM(RNN)ReviewLabelOne of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.PositiveTable 2: An example of the review and label in IMDB datasetImport from Keras:from?keras.datasets?import?imdb??vocabulary_size?=?5000??(X_train,?y_train),?(X_test,?y_test)?=?imdb.load_data(num_words?=?vocabulary_size)??Import from Pandas and os package:First download the dataset from the link: and save it to the same folder as your program.And then import os and pandas:import?os??import?pandas?as?pd??path?=?"aclImdb/"??positiveFiles?=?[x?for?x?in?os.listdir(path+"train/pos/")?if?x.endswith(".txt")]??negativeFiles?=?[x?for?x?in?os.listdir(path+"train/neg/")?if?x.endswith(".txt")]??testFiles?=?[x?for?x?in?os.listdir(path+"test/")?if?x.endswith(".txt")]??Next, create the DataFrame:positiveReviews,?negativeReviews,?testReviews?=?[],?[],?[]??for?pfile?in?positiveFiles:??????with?open(path+"train/pos/"+pfile,?encoding="latin1")?as?f:??????????positiveReviews.append(f.read())??for?nfile?in?negativeFiles:??????with?open(path+"train/neg/"+nfile,?encoding="latin1")?as?f:??????????negativeReviews.append(f.read())??for?tfile?in?testFiles:??????with?open(path+"test/"+tfile,?encoding="latin1")?as?f:??????????testReviews.append(f.read())??reviews?=?pd.concat([??????pd.DataFrame({"review":positiveReviews,?"label":1,?"file":positiveFiles}),??????pd.DataFrame({"review":negativeReviews,?"label":0,?"file":negativeFiles}),??????pd.DataFrame({"review":testReviews,?"label":-1,?"file":testFiles})??],?ignore_index=True).sample(frac=1,?random_state=1)??reviews.head()??The output will be : Last, we can perform train, validation and test set splits:reviews?=?reviews[["review",?"label",?"file"]].sample(frac=1,?random_state=1)??train?=?reviews[reviews.label!=-1].sample(frac=0.6,?random_state=1)??valid?=?reviews[reviews.label!=-1].drop(train.index)??test?=?reviews[reviews.label==-1]?? ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download