all that is technology

Scikit-learn DecisionTree with categorical data


In this post, I'll walk through scikit-learn's DecisionTreeClassifier from loading the data, fitting the model and prediction.

I'm going to use the vertebrate dataset from the book Introduction to Data Mining by Tan, Steinbach and Kumar.

We need to predict the class label of the last record from the dataset.

As you see the data is categorical. We need to vectorize the features so that, we can feed to the classifier. And it is done as follows.

from sklearn.feature_extraction import DictVectorizer

X_dict = X_feature.T.to_dict().values()

vect = DictVectorizer(sparse=False)
X_vector = vect.fit_transform(X_dict)

On converting the pandas DataFrame X_feature to python dictionary X_dict.

X_dict looks like;

[{'Aerial Creature': 'no',
  'Aquatic Creature': 'no',
  'Body Temperature': 'warm-blooded',
  'Gives Birth': 'yes',
  'Has Legs': 'yes',
  'Hibernates': 'no',
  'Skin Cover': 'hair'},

Then to vector X_vector

[[ 1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,
         0.,  0.,  1.,  0.,  0.,  0.],

The vectorizer matrix contains features in the following format.

['Aerial Creature=no',
 'Aerial Creature=yes',
 'Aquatic Creature=no',
 'Aquatic Creature=semi',
 'Aquatic Creature=yes',
 'Body Temperature=cold-blooded',
 'Body Temperature=warm-blooded',
 'Gives Birth=no',
 'Gives Birth=yes',
 'Has Legs=no',
 'Has Legs=yes',
 'Skin Cover=feathers',
 'Skin Cover=fur',
 'Skin Cover=hair',
 'Skin Cover=none',
 'Skin Cover=quills',
 'Skin Cover=scales']

For vectorizing the class labels

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train = le.fit_transform(data['Class Label'][:-1])

With the X and y vectorized, we can now use the DecisionTreeClassifier for fitting the model and to do prediction.

Find the ipython notebook here