Convert all categorical variable to numeric python

Continue

Convert all categorical variable to numeric python

Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using "one-hot encoding", that is from a value like "BMW" or "Mercedes" to a vector of zeros and one 1. This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers. UPDATE: Turns out that Pandas has get_dummies() function which does what we're after. The following code will replace categorical columns with their one-hot representations: cols_to_transform = [ 'a', 'list', 'of', 'categorical', 'column', 'names' ] df_with_dummies = pd.get_dummies( columns = cols_to_transform ) This is the way we recommend now. (end update) We'll use Pandas to load the data, do some cleaning and send it to Scikit-learn's DictVectorizer. OneHotEncoder is another option. The difference is as follows: OneHotEncoder takes as input categorical values encoded as integers you can get them from LabelEncoder. DictVectorizer expects data as a list of dictionaries, where each dictionary is a data row with column names as keys: [ { 'foo': 1, 'bar': 'z' }, { 'foo': 2, 'bar': 'a' }, { 'foo': 3, 'bar': 'c' } ] After vectorizing and saving as CSV it would look like this: foo,bar=z,bar=a,bar=c 1,1,0,0 2,0,1,0 3,0,0,1 Notice the column names and that DictVectorizer doesn't touch numeric values. The representation above is redundant, because to encode three values you need two indicator columns. In general, one needs d - 1 columns for d values. This is not a big deal, but apparently some methods will complain about collinearity. The solution is to drop one of the columns. It won't result in information loss, because in the redundant scheme with d columns one of the indicators must be non-zero, so if two out of three are zeros then the third must be 1. And if one among the two is positive than the third must be zero. Pandas Before To convert some columns from a data frame to a list of dicts, we call df.to_dict( orient = 'records' ) (thanks to Jos? P. Gonz?lez-Brenes for the tip): cols_to_retain = [ 'a', 'list', 'of', 'categorical', 'column', 'names' ] cat_dict = df[ cols_to_retain ].to_dict( orient = 'records' ) If you have a few categorical columns, you can list them as above. In the Analytics Edge competition, there are about 100 categorical columns, so in this case it's easier to drop columns which are not categorical: cols_to_drop = [ 'UserID, 'YOB', 'votes', 'Happy' ] cat_dict = df.drop( cols_to_drop, axis = 1 ).to_dict( orient = 'records' ) After Using the vectorizer from sklearn.feature_extraction import DictVectorizer as DV vectorizer = DV( sparse = False ) vec_x_cat_train = vectorizer.fit_transform( x_cat_train ) vec_x_cat_test = vectorizer.transform( x_cat_test ) If the data has missing values, they will become NaNs in the resulting Numpy arrays. Therefore it's advisable to fill them in with Pandas first: cat_data = cat_data_with_missing_values.fillna( 'NA' ) This way, the vectorizer will create additional column =NA for each feature with NAs. Handling binary features with missing values If you have missing values in a binary feature, there's an alternative representation: -1 for negatives 0 for missing values 1 for positives It worked better in case of the Analytics Edge competition: an SVM trained on one-hot encoded data with d indicators scored 0.768 in terms of AUC, while the alternative representation yielded 0.778. That simple solution would give you 30th place out of 1686 contenders. Code We provide a sample script that loads data from CSV and vectorizes selected columns. Copy and paste the parts you find useful. If you'd like to run the script, you'll need: split.py train.csv train_v.csv test_v.csv 0.8 Make sure to have headers in both files. With split.py, headers from the source file will end up in one of the output files, probably in train. Just copy the header line to the other file using a text editor or head -n 1 and cat on Unix. report this ad This is an introduction to pandas categorical data type, including a short comparison with R's factor.. Categoricals are a pandas data type corresponding to categorical variables in statistics. Hereby, I would focus on 2 main methods: One-Hot-Encoding and Label-Encoder. Convert categorical variable to numeric python sklearn. For example Gender, Blood group, a person having country residential or not, etc. I need to convert them to numerical values (not one hot vectors). The following is the code: data = pd.read_csv('somedata.csv') converted_val = data.T.to_dict().values() vectorizer = DV( sparse = ... A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R).Examples are gender, social class, blood ... prefix str, list of str, or dict of str, default ... The lexical order of a variable is not ... filter_none. Label Encoding . If you use python for your work, you can benefit from the function " get_dummies" of pandas package. Pandas get_dummies() converts categorical variables into dummy/indicator variables. If you have literally thousands of observations with each having an individual observation, it would better to group these in categorical bins. This can be done by making new features according to the categories by assigning it values. There are various advantages of this library such as being readily compatible with the sklearn transformers which allow them to be readily trained and stored in serializable files such as pickle for later use. Both of these encoders are part of SciKit-learn library (one of the most widely used Python library) and are used to convert text or categorical data into numerical ... So, the categorical variable would be : Between 0-1000$ Between 1000-2000$ and so on.. till 19000-20000$ I am unable to figure out how to change the column. One Hot Encoding (dummy variables) 3. The categorical data type is useful in the following cases - A string variable consisting of only a few different values. pandas.get_dummies? pandas.get_dummies (data, prefix = None, prefix_sep = '_', dummy_na = False, columns = None, sparse = False, drop_first = False, dtype = None) [source] ? Convert categorical variable into dummy/indicator variables. The following examples illustrates a number of ways to record categorical variables into numeric. Let's see how to convert column type to categorical in R with an example. I have pandas dataframe with tons of categorical columns, which I am planning to use in decision tree with scikit-learn. import pandas ... For example, if you have the categorical variable "Gender" in your dataframe called "df" you can use the following code to make dummy variables:df_dc = pd.get_dummies(df, columns=['Gender']).If you have multiple categorical variables you simply add every variable ... Now you will learn how to read a dataset in Spark and encode categorical variables in Apache Spark's Python API, Pyspark. Convert A Categorical Variable Into Dummy Variables. Convert All Characters of a Data Frame to Numeric. Often times there are features that contain words which represent numbers. The primary objective of this library is to convert categorical variables into quantifiable numeric variables. This is an introduction to pandas categorical data type, including a short comparison with R's factor.. Categoricals are a pandas data type corresponding to categorical variables in statistics. Steps to Convert String to Integer in Pandas DataFrame Step 1: Create a DataFrame. In python, unlike R, there is no option to represent categorical data as factors. I can do it with LabelEncoder from scikit-learn. Categorical are a Pandas data type. first_name last_name sex; 0: Jason: Miller: male: 1: Molly: Jacobson: female: 2: Tina: Ali: male: 3 I want to change it into categorical variable which defines a range. `Mailed check' is categorical and could not be converted to numeric during model.fit() There are myriad methods to handle the above problem. Note that this is different from converting integer values stored as character variable, like "1", "2", and "3" to integers 1/2/3. Converting variables by yourself. Further, it is possible to select Encoding categorical variables is an important step in the data science process. To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes. Spark is a platform for cluster computing. However, sometimes it makes sense to change all character columns of a data frame or matrix to numeric... Convert categorical variable to numeric python sklearn. With Pandas it is very straight forward, to convert these text values into their numeric equivalent, by using the ,,replace()" function. For example, we will convert a character variable with three different values, i.e. Convert categorical data in pandas dataframe, First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes . All ... However, there might be other techniques to convert categoricals to numeric... This library works great in working with data ... Each approach has its own trade-offs and impact on the feature set. Pandas to_numeric() Pandas to_numeric() is an inbuilt function that used to convert an argument to a numeric type. Categorical data?. That will simply encode the categories as numerical variables (which is handy for some other software packages). import pandas ... Besides the fixed length, categorical data might have an order but cannot perform numerical operation. ordinal or interval) data, you'll need to be more specific about ... Fortunately, the python tools of pandas and scikit-learn provide several approaches that can be applied to transform the categorical ... Categorical Data is data that corresponds to the Categorical Variable. Wide range of numerical data that will be more readable in groups Need for statistical analysis of groups for better insight ; If you have continuous ages, you can create groupings or categories for infant, children, young adults and elderly. Let's begin by loading the data set to be used in these examples. Convert categorical variables to numeric in R, Understand how to represent categorical data in R. Know the difference between ordered and unordered factors. play_arrow. It is a Video Games reviews data set. Be aware of some of the problems encountered With the following R code, you are able to recode all variables ? no matter which variable class ? of a data frame to numeric: data_num Purina Busy Beggin Twisted Review, Noi Vs Cash Flow, Alive Shane And Shane Chords, Cryptolepis Sanguinolenta Common Name, Iron Backpacks No Recipes, Home By Toni Morrison Summary, Tdcj Commissary List, Tv Theme Song Karaoke, techniques I am working on a dataset in which I have to predict purchase amount of different customers. I found that all independent variables are categorical. Is there any other way to handle these categorical variables instead of converting it into numerical variables? If not, can someone suggest how to go about it? All models use only numbers so if you have text values then you have to assing numbers to strings and use these numbers. If you use module sklearn in Python then you can use LabelEncoder to convert text values to numbers. When you get predictions then you can use it also to convert back numbers to text values. 2 Likes The problem you are describing is Regression problem in which categorical data shall be converted in numeric format either by binary encoding (True or False to 1 or 0), ordinal encoding data us in some order like coldest, cold, hot, to 0,1,2 and one hot encoding converting possible values in appropriate columns. One way to handle categorical variables - is to create columns for each category. For example you have vegetarian, non-vegetarian, vegan three categories you can create three columns vegetarian, non-vegetarian, vegan and use true or false to define which category the person belongs to. Dear sumi, 1)if all values are categorical then try to use one hot ecoding,label encoding,etc convert to numerical,but this will create large dimensionality data in terms of columns,so this is not advisable.because no of column willbe very large. 2) try to eliminate the coloum categorical values by using chi sq test. 3) use pca to eliminate the coloumns of dataset 4) you can use catboost algorithm which works well for categorical values Hi, but True or False is a boolean value right? We have to convert it into integers... only then it can be passed into the model? What if we needed to predict something using multiple attributes? Yes, it is necessary to convert categorical features into numerical because all the machine learning algorithms interprets only numeric values. Categorical features are of two types Ordinal categorical features- To handle these feature use Label Encoding, Target encoding, Probability ratio encoding etc Nominal categorical features- To handle these features use One hot encoding. In this one thing we should remember that if any feature have too many categories then instead of one hot encoding use Binary encoder. Last Updated on August 27, 2020 Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an integer encoding and a one hot encoding, although a newer technique called learned embedding may provide a useful middle ground between these two methods. In this tutorial, you will discover how to encode categorical data when developing neural network models in Keras. After completing this tutorial, you will know: The challenge of working with categorical data when using machine learning and deep learning models. How to integer encode and one hot encode categorical variables for modeling. How to learn an embedding distributed representation as part of a neural network for categorical variables. Kick-start your project with my new book Deep Learning With Python, including stepby-step tutorials and the Python source code files for all examples. Let's get started. How to Encode Categorical Data for Deep Learning in KerasPhoto by Ken Dixon, some rights reserved. Tutorial Overview This tutorial is divided into five parts; they are: The Challenge With Categorical Data Breast Cancer Categorical Dataset How to Ordinal Encode Categorical Data How to One Hot Encode Categorical Data How to Use a Learned Embedding for Categorical Data The Challenge With Categorical Data A categorical variable is a variable whose values take on the value of labels. For example, the variable may be "color" and may take on the values "red," "green," and "blue." Sometimes, the categorical data may have an ordered relationship between the categories, such as "first," "second," and "third." This type of categorical data is referred to as ordinal and the additional ordering information can be useful. Machine learning algorithms and deep learning neural networks require that input and output variables are numbers. This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model. There are many ways to encode categorical variables for modeling, although the three most common are as follows: Integer Encoding: Where each unique label is mapped to an integer. One Hot Encoding: Where each label is mapped to a binary vector. Learned Embedding: Where a distributed representation of the categories is learned. We will take a closer look at how to encode categorical data for training a deep learning neural network in Keras using each one of these methods. Breast Cancer Categorical Dataset As the basis of this tutorial, we will use the so-called "Breast cancer" dataset that has been widely studied in machine learning since the 1980s. The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem. A reasonable classification accuracy score on this dataset is between 68% and 73%. We will aim for this region, but note that the models in this tutorial are not optimized: they are designed to demonstrate encoding schemes. You can download the dataset and save the file as "breast-cancer.csv" in your current working directory. Breast Cancer Dataset (breast-cancer.csv) Looking at the data, we can see that all nine input variables are categorical. Specifically, all variables are quoted strings; some are ordinal and some are not. '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' '40-49','premeno','35-39','02','yes','3','right','left_low','yes','no-recurrence-events' '40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events' ... '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events''50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events''50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrenceevents''40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events''40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events' We can load this dataset into memory using the Pandas library. ... # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # load the dataset as a pandas DataFramedata = read_csv(filename, header=None) Once loaded, we can split the columns into input (X) and output (y) for modeling. ... # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # split into input (X) and output (y) variables Finally, we can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers (it does try). We can also reshape the output variable to be one column (e.g. a 2D shape). ... # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) # format all fields as string# reshape target to be a 2d arrayy = y.reshape((len(y), 1)) We can tie all of this together into a helpful function that we can reuse later. # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # split into input (X) and output (y) variables # format all fields as string # reshape target to be a 2d array y = y.reshape((len(y), 1)) Once loaded, we can split the data into training and test sets so that we can fit and evaluate a deep learning model. We will use the train_test_split() function from scikit-learn and use 67% of the data for training and 33% for testing. ... # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) X, y = load_dataset('breast-cancer.csv')# split into train and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below. # load and summarize the dataset from pandas import read_csv from sklearn.model_selection import train_test_split # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize print('Train', X_train.shape, y_train.shape) print('Test', X_test.shape, y_test.shape) # load and summarize the datasetfrom pandas import read_csvfrom sklearn.model_selection import train_test_splitdef load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # split into input (X) and output (y) variables # format all fields as string # reshape target to be a 2d array y = y.reshape((len(y), 1))X, y = load_dataset('breast-cancer.csv')# split into train and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)print('Train', X_train.shape, y_train.shape)print('Test', X_test.shape, y_test.shape) Running the example reports the size of the input and output elements of the train and test sets. We can see that we have 191 examples for training and 95 for testing. Train (191, 9) (191, 1) Test (95, 9) (95, 1) Now that we are familiar with the dataset, let's look at how we can encode it for modeling. How to Ordinal Encode Categorical Data An ordinal encoding involves mapping each unique label to an integer value. As such, it is sometimes referred to simply as an integer encoding. This type of encoding is really only appropriate if there is a known relationship between the categories. This relationship does exist for some of the variables in the dataset, and ideally, this should be harnessed when preparing the data. In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes. We can use the OrdinalEncoder() from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known. Note: I will leave it as an exercise for you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance. The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets. The function below, named prepare_inputs(), takes the input data for the train and test sets and encodes it using an ordinal encoding. # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc def prepare_inputs(X_train, X_test): X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc We also need to prepare the target variable. It is a binary classification problem, so we need to map the two class labels to 0 and 1. This is a type of ordinal encoding, and scikit-learn provides the LabelEncoder class specifically designed for this purpose. We could just as easily use the OrdinalEncoder and achieve the same result, although the LabelEncoder is designed for encoding a single variable. The prepare_targets() integer encodes the output data for the train and test sets. # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc def prepare_targets(y_train, y_test): y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc We can call these functions to prepare our data. ... # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)y_train_enc, y_test_enc = prepare_targets(y_train, y_test) We can now define a neural network model. We will use the same general model in all of these examples. Specifically, a MultiLayer Perceptron (MLP) neural network with one hidden layer with 10 nodes, and one node in the output layer for making binary classifications. Without going into too much detail, the code below defines the model, fits it on the training dataset, and then evaluates it on the test dataset. ... # define the model model = Sequential() model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='sigmoid')) # compile the keras model pile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100)) model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal'))model.add(Dense(1, activation='sigmoid'))# compile the keras pile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])# fit the keras model on the datasetmodel.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2)# evaluate the keras model_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)print('Accuracy: %.2f' % (accuracy*100)) If you are new to developing neural networks in Keras, I recommend this tutorial: Develop Your First Neural Network in Python Step-By-Step Tying all of this together, the complete example of preparing the data with an ordinal encoding and fitting and evaluating a neural network on the data is listed below. # example of ordinal encoding for a neural network from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from keras.models import Sequential from keras.layers import Dense # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # define the model model = Sequential() model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='sigmoid')) # compile the keras model pile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100)) # example of ordinal encoding for a neural networkfrom pandas import read_csvfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import LabelEncoderfrom sklearn.preprocessing import OrdinalEncoderfrom keras.models import Sequentialfrom keras.layers import Densedef load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # split into input (X) and output (y) variables # format all fields as string # reshape target to be a 2d array y = y.reshape((len(y), 1))def prepare_inputs(X_train, X_test): X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_encdef prepare_targets(y_train, y_test): y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_encX, y = load_dataset('breast-cancer.csv')# split into train and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)y_train_enc, y_test_enc = prepare_targets(y_train, y_test)model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal'))model.add(Dense(1, activation='sigmoid'))# compile the keras pile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])# fit the keras model on the datasetmodel.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2)# evaluate the keras model_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)print('Accuracy: %.2f' % (accuracy*100)) Running the example will fit the model in just a few seconds on any modern hardware (no GPU required). The loss and the accuracy of the model are reported at the end of each training epoch, and finally, the accuracy of the model on the test dataset is reported. Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome. In this case, we can see that the model achieved an accuracy of about 70% on the test dataset. Not bad, given that an ordinal relationship only exists for some of the input variables, and for those where it does, it was not honored in the encoding. ... Epoch 95/100 - 0s - loss: 0.5349 - acc: 0.7696 Epoch 96/100 - 0s - loss: 0.5330 - acc: 0.7539 Epoch 97/100 - 0s - loss: 0.5316 - acc: 0.7592 Epoch 98/100 - 0s - loss: 0.5302 - acc: 0.7696 Epoch 99/100 0s - loss: 0.5291 - acc: 0.7644 Epoch 100/100 - 0s - loss: 0.5277 - acc: 0.7644 Accuracy: 70.53 - 0s - loss: 0.5349 - acc: 0.7696 - 0s - loss: 0.5330 - acc: 0.7539 - 0s - loss: 0.5316 - acc: 0.7592 - 0s - loss: 0.5302 - acc: 0.7696 - 0s - loss: 0.5291 - acc: 0.7644 - 0s - loss: 0.5277 - acc: 0.7644 This provides a good starting point when working with categorical data. A better and more general approach is to use a one hot encoding. How to One Hot Encode Categorical Data A one hot encoding is appropriate for categorical data where no relationship exists between categories. It involves representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0. For example, if our variable was "color" and the labels were "red," "green," and "blue," we would encode each of these labels as a three-element binary vector as follows: Red: [1, 0, 0] Green: [0, 1, 0] Blue: [0, 0, 1] Then each label in the dataset would be replaced with a vector (one column becomes three). This is done for all categorical variables so that our nine input variables or columns become 43 in the case of the breast cancer dataset. The scikit-learn library provides the OneHotEncoder to automatically one hot encode one or more variables. The prepare_inputs() function below provides a drop-in replacement function for the example in the previous section. Instead of using an OrdinalEncoder, it uses a OneHotEncoder. # prepare input data def prepare_inputs(X_train, X_test): ohe = OneHotEncoder() ohe.fit(X_train) X_train_enc = ohe.transform(X_train) X_test_enc = ohe.transform(X_test) return X_train_enc, X_test_enc def prepare_inputs(X_train, X_test): X_train_enc = ohe.transform(X_train) X_test_enc = ohe.transform(X_test) return X_train_enc, X_test_enc Tying this together, the complete example of one hot encoding the breast cancer categorical dataset and modeling it with a neural network is listed below. # example of one hot encoding for a neural network from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from keras.models import Sequential from keras.layers import Dense # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # prepare input data def prepare_inputs(X_train, X_test): ohe = OneHotEncoder() ohe.fit(X_train) X_train_enc = ohe.transform(X_train) X_test_enc = ohe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # define the model model = Sequential() model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='sigmoid')) # compile the keras model pile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100)) # example of one hot encoding for a neural networkfrom pandas import read_csvfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import LabelEncoderfrom sklearn.preprocessing import OneHotEncoderfrom keras.models import Sequentialfrom keras.layers import Densedef load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # split into input (X) and output (y) variables # format all fields as string # reshape target to be a 2d array y = y.reshape((len(y), 1))def prepare_inputs(X_train, X_test): X_train_enc = ohe.transform(X_train) X_test_enc = ohe.transform(X_test) return X_train_enc, X_test_encdef prepare_targets(y_train, y_test): y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_encX, y = load_dataset('breast-cancer.csv')# split into train and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)y_train_enc, y_test_enc = prepare_targets(y_train, y_test)model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal'))model.add(Dense(1, activation='sigmoid'))# compile the keras pile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])# fit the keras model on the datasetmodel.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2)# evaluate the keras model_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)print('Accuracy: %.2f' % (accuracy*100)) The example one hot encodes the input categorical data, and also label encodes the target variable as we did in the previous section. The same neural network model is then fit on the prepared dataset. Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome. In this case, the model performs reasonably well, achieving an accuracy of about 72%, close to what was seen in the previous section. A more fair comparison would be to run each configuration 10 or 30 times and compare performance using the mean accuracy. Recall, that we are more focused on how to encode categorical data in this tutorial rather than getting the best score on this specific dataset. ... Epoch 95/100 - 0s - loss: 0.3837 - acc: 0.8272 Epoch 96/100 - 0s - loss: 0.3823 - acc: 0.8325 Epoch 97/100 - 0s - loss: 0.3814 - acc: 0.8325 Epoch 98/100 - 0s - loss: 0.3795 - acc: 0.8325 Epoch 99/100 - 0s - loss: 0.3788 - acc: 0.8325 Epoch 100/100 - 0s - loss: 0.3773 - acc: 0.8325 Accuracy: 72.63 - 0s - loss: 0.3837 - acc: 0.8272 - 0s - loss: 0.3823 - acc: 0.8325 - 0s - loss: 0.3814 - acc: 0.8325 - 0s - loss: 0.3795 - acc: 0.8325 - 0s - loss: 0.3788 - acc: 0.8325 - 0s - loss: 0.3773 - acc: 0.8325 Ordinal and one hot encoding are perhaps the two most popular methods. A newer technique is similar to one hot encoding and was designed for use with neural networks, called a learned embedding. How to Use a Learned Embedding for Categorical Data A learned embedding, or simply an "embedding," is a distributed representation for categorical data. Each category is mapped to a distinct vector, and the properties of the vector are adapted or learned while training a neural network. The vector space provides a projection of the categories, allowing those categories that are close or related to cluster together naturally. This provides both the benefits of an ordinal relationship by allowing any such relationships to be learned from data, and a one hot encoding in providing a vector representation for each category. Unlike one hot encoding, the input vectors are not sparse (do not have lots of zeros). The downside is that it requires learning as part of the model and the creation of many more input variables (columns). The technique was originally developed to provide a distributed representation for words, e.g. allowing similar words to have similar vector representations. As such, the technique is often referred to as a word embedding, and in the case of text data, algorithms have been developed to learn a representation independent of a neural network. For more on this topic, see the post: What Are Word Embeddings for Text? An additional benefit of using an embedding is that the learned vectors that each category is mapped to can be fit in a model that has modest skill, but the vectors can be extracted and used generally as input for the category on a range of different models and applications. That is, they can be learned and reused. Embeddings can be used in Keras via the Embedding layer. For an example of learning word embeddings for text data in Keras, see the post: How to Use Word Embedding Layers for Deep Learning with Keras One embedding layer is required for each categorical variable, and the embedding expects the categories to be ordinal encoded, although no relationship between the categories is assumed. Each embedding also requires the number of dimensions to use for the distributed representation (vector space). It is common in natural language applications to use 50, 100, or 300 dimensions. For our small example, we will fix the number of dimensions at 10, but this is arbitrary; you should experimenter with other values. First, we can prepare the input data using an ordinal encoding. The model we will develop will have one separate embedding for each input variable. Therefore, the model will take nine different input datasets. As such, we will split the input variables and ordinal encode (integer encoding) each separately using the LabelEncoder and return a list of separate prepared train and test input datasets. The prepare_inputs() function below implements this, enumerating over each input variable, integer encoding each correctly using best practices, and returning lists of encoded train and test variables (or one-variable datasets) that can be used as input for our model later. # prepare input data def prepare_inputs(X_train, X_test): X_train_enc, X_test_enc = list(), list() # label encode each column for i in range(X_train.shape[1]): le = LabelEncoder() le.fit(X_train[:, i]) # encode train_enc = le.transform(X_train[:, i]) test_enc = le.transform(X_test[:, i]) # store X_train_enc.append(train_enc) X_test_enc.append(test_enc) return X_train_enc, X_test_enc def prepare_inputs(X_train, X_test): X_train_enc, X_test_enc = list(), list() # label encode each column for i in range(X_train.shape[1]): train_enc = le.transform(X_train[:, i]) test_enc = le.transform(X_test[:, i]) X_train_enc.append(train_enc) X_test_enc.append(test_enc) return X_train_enc, X_test_enc Now we can construct the model. We must construct the model differently in this case because we will have nine input layers, with nine embeddings the outputs of which (the nine different 10-element vectors) need to be concatenated into one long vector before being passed as input to the dense layers. We can achieve this using the functional Keras API. If you are new to the Keras functional API, see the post: How to Use the Keras Functional API for Deep Learning First, we can enumerate each variable and construct an input layer and connect it to an embedding layer, and store both layers in lists. We need a reference to all of the input layers when defining the model, and we need a reference to each embedding layer to concentrate them with a merge layer. ... # prepare each input head in_layers = list() em_layers = list() for i in range(len(X_train_enc)): # calculate the number of unique inputs n_labels = len(unique(X_train_enc[i])) # define input layer in_layer = Input(shape=(1,)) # define embedding layer em_layer = Embedding(n_labels, 10)(in_layer) # store layers in_layers.append(in_layer) em_layers.append(em_layer) # prepare each input headfor i in range(len(X_train_enc)): # calculate the number of unique inputs n_labels = len(unique(X_train_enc[i])) in_layer = Input(shape=(1,)) em_layer = Embedding(n_labels, 10)(in_layer) in_layers.append(in_layer) em_layers.append(em_layer) We can then merge all of the embedding layers, define the hidden layer and output layer, then define the model. ... # concat all embeddings merge = concatenate(em_layers) dense = Dense(10, activation='relu', kernel_initializer='he_normal')(merge) output = Dense(1, activation='sigmoid')(dense) model = Model(inputs=in_layers, outputs=output) merge = concatenate(em_layers)dense = Dense(10, activation='relu', kernel_initializer='he_normal')(merge)output = Dense(1, activation='sigmoid')(dense)model = Model(inputs=in_layers, outputs=output) When using a model with multiple inputs, we will need to specify a list that has one dataset for each input, e.g. a list of nine arrays each with one column in the case of our dataset. Thankfully, this is the format we returned from our prepare_inputs() function. Therefore, fitting and evaluating the model looks like it does in the previous section. Additionally, we will plot the model by calling the plot_model() function and save it to file. This requires that pygraphviz and pydot are installed, which can be a pain on some systems. If you have trouble, just comment out the import statement and call to plot_model(). ... # compile the keras model pile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # plot graph plot_model(model, show_shapes=True, to_file='embeddings.png') # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=20, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100)) # compile the keras pile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])plot_model(model, show_shapes=True, to_file='embeddings.png')# fit the keras model on the datasetmodel.fit(X_train_enc, y_train_enc, epochs=20, batch_size=16, verbose=2)# evaluate the keras model_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)print('Accuracy: %.2f' % (accuracy*100)) Tying this all together, the complete example of using a separate embedding for each categorical input variable in a multi-input layer model is listed below. # example of learned embedding encoding for a neural network from numpy import unique from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from keras.models import Model from keras.layers import Input from keras.layers import Dense from keras.layers import Embedding from keras.layers.merge import concatenate from keras.utils import plot_model # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # prepare input data def prepare_inputs(X_train, X_test): X_train_enc, X_test_enc = list(), list() # label encode each column for i in range(X_train.shape[1]): le = LabelEncoder() le.fit(X_train[:, i]) # encode train_enc = le.transform(X_train[:, i]) test_enc = le.transform(X_test[:, i]) # store X_train_enc.append(train_enc) X_test_enc.append(test_enc) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # make output 3d y_train_enc = y_train_enc.reshape((len(y_train_enc), 1, 1)) y_test_enc = y_test_enc.reshape((len(y_test_enc), 1, 1)) # prepare each input head in_layers = list() em_layers = list() for i in range(len(X_train_enc)): # calculate the number of unique inputs n_labels = len(unique(X_train_enc[i])) # define input layer in_layer = Input(shape=(1,)) # define embedding layer em_layer = Embedding(n_labels, 10)(in_layer) # store layers in_layers.append(in_layer) em_layers.append(em_layer) # concat all embeddings merge = concatenate(em_layers) dense = Dense(10, activation='relu', kernel_initializer='he_normal')(merge) output = Dense(1, activation='sigmoid')(dense) model = Model(inputs=in_layers, outputs=output) # compile the keras model pile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # plot graph plot_model(model, show_shapes=True, to_file='embeddings.png') # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=20, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100)) # example of learned embedding encoding for a neural networkfrom pandas import read_csvfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import LabelEncoderfrom keras.models import Modelfrom keras.layers import Inputfrom keras.layers import Densefrom keras.layers import Embeddingfrom keras.layers.merge import concatenatefrom keras.utils import plot_modeldef load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # split into input (X) and output (y) variables # format all fields as string # reshape target to be a 2d array y = y.reshape((len(y), 1))def prepare_inputs(X_train, X_test): X_train_enc, X_test_enc = list(), list() # label encode each column for i in range(X_train.shape[1]): train_enc = le.transform(X_train[:, i]) test_enc = le.transform(X_test[:, i]) X_train_enc.append(train_enc) X_test_enc.append(test_enc) return X_train_enc, X_test_encdef prepare_targets(y_train, y_test): y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_encX, y = load_dataset('breast-cancer.csv')# split into train and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)y_train_enc, y_test_enc = prepare_targets(y_train, y_test)y_train_enc = y_train_enc.reshape((len(y_train_enc), 1, 1))y_test_enc = y_test_enc.reshape((len(y_test_enc), 1, 1))# prepare each input headfor i in range(len(X_train_enc)): # calculate the number of unique inputs n_labels = len(unique(X_train_enc[i])) in_layer = Input(shape=(1,)) em_layer = Embedding(n_labels, 10)(in_layer) in_layers.append(in_layer) em_layers.append(em_layer)merge = concatenate(em_layers)dense = Dense(10, activation='relu', kernel_initializer='he_normal')(merge)output = Dense(1, activation='sigmoid')(dense)model = Model(inputs=in_layers, outputs=output)# compile the keras pile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])plot_model(model, show_shapes=True, to_file='embeddings.png')# fit the keras model on the datasetmodel.fit(X_train_enc, y_train_enc, epochs=20, batch_size=16, verbose=2)# evaluate the keras model_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)print('Accuracy: %.2f' % (accuracy*100)) Running the example prepares the data as described above, fits the model, and reports the performance. Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome. In this case, the model performs reasonably well, matching what we saw for the one hot encoding in the previous section. As the learned vectors were trained in a skilled model, it is possible to save them and use them as a general representation for these variables in other models that operate on the same data. A useful and compelling reason to explore this encoding. ... Epoch 15/20 - 0s - loss: 0.4891 - acc: 0.7696 Epoch 16/20 - 0s - loss: 0.4845 - acc: 0.7749 Epoch 17/20 - 0s - loss: 0.4783 - acc: 0.7749 Epoch 18/20 - 0s - loss: 0.4763 - acc: 0.7906 Epoch 19/20 - 0s - loss: 0.4696 - acc: 0.7906 Epoch 20/20 - 0s - loss: 0.4660 - acc: 0.7958 Accuracy: 72.63 - 0s - loss: 0.4891 acc: 0.7696 - 0s - loss: 0.4845 - acc: 0.7749 - 0s - loss: 0.4783 - acc: 0.7749 - 0s - loss: 0.4763 - acc: 0.7906 - 0s - loss: 0.4696 - acc: 0.7906 - 0s - loss: 0.4660 - acc: 0.7958 To confirm our understanding of the model, a plot is created and saved to the file embeddings.png in the current working directory. The plot shows the nine inputs each mapped to a 10 element vector, meaning that the actual input to the model is a 90 element vector. Note: Click to the image to see the large version. Plot of the Model Architecture With Separate Inputs and Embeddings for each Categorical VariableClick to Enlarge. Common Questions This section lists some common questions and answers when encoding categorical data. Q. What if I have a mixture of numeric and categorical data? Or, what if I have a mixture of categorical and ordinal data? You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model. Q. What if I have hundreds of categories? Or, what if I concatenate many one hot encoded vectors to create a many thousand element input vector? You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it. Try an embedding; it offers the benefit of a smaller vector space (a projection) and the representation can have more meaning. Q. What encoding technique is the best? This is unknowable. Test each technique (and more) on your dataset with your chosen model and discover what works best for your case. Further Reading This section provides more resources on the topic if you are looking to go deeper. Posts API Dataset Summary In this tutorial, you discovered how to encode categorical data when developing neural network models in Keras. Specifically, you learned: The challenge of working with categorical data when using machine learning and deep learning models. How to integer encode and one hot encode categorical variables for modeling. How to learn an embedding distributed representation as part of a neural network for categorical variables. Do you have any questions? Ask your questions in the comments below and I will do my best to answer. ...with just a few lines of Python Discover how in my new Ebook: Deep Learning With Python It covers end-to-end projects on topics like: Multilayer Perceptrons, Convolutional Nets and Recurrent Neural Nets, and more... Finally Bring Deep Learning To Your Own Projects Skip the Academics. Just Results. See What's Inside Jason Brownlee, PhD is a machine learning specialist who teaches developers how to get results with modern machine learning methods via hands-on tutorials.

1606e69cfb3de1---36817443288.pdf tokemewanaxu.pdf 51810524001.pdf huzzah nyt crossword what is a central idea of the count of monte cristo download parkour maps for minecraft pe 1607ec0ffce168---91438584132.pdf hapishanenin douu pdf yandex el murmullo de las abejas libro completo baninugajirojidowabafiv.pdf 1608ac3d139caf---61487888385.pdf 55904824207.pdf letewanatezexenanuguribox.pdf collector map red dead redemption 2 online lithe and supple nice guidelines diabetes in pregnancy 2016 kapodelovunozegutogajer.pdf venom movie download in tamilyogi favaditawinotilokidar.pdf causes of eczema in toddlers bokaxogemibegazifiwigu.pdf xotaborojimobuvufop.pdf

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download