How do I one-hot encode an array of strings with Numpy?

Question

I know there are sub-optimal solutions out there, but I'm trying to optimise my code. So far, the shortest way I found is this:

import numpy as np
from sklearn.preprocessing import OrdinalEncoder

target = np.array(['dog', 'dog', 'cat', 'cat', 'cat', 'dog', 'dog', 'cat', 'cat'])

oe = OrdinalEncoder()
target = oe.fit_transform(target.reshape(-1, 1)).ravel()
target = np.eye(np.unique(target).shape[0])[np.array(target, dtype=np.int32)]
print(target)

[[0. 1.]
[0. 1.]
[1. 0.]
[1. 0.]
...

This is ugly code, and very long. Remove any part of it and it won't work. I'm looking for a simpler way, that won't involve calls to more than half a dozen functions from two different libraries.

What is the target produced by oe

hpaulj
– hpaulj

2019-11-03 01:34:23 +00:00
Commented Nov 3, 2019 at 1:34 — hpaulj
– hpaulj, Commented Nov 3, 2019 at 1:34

Nicolas Gervais · Accepted Answer · 2019-11-07 15:11:17Z

7

Got it. This will work with arrays of any number of unique values.

import numpy as np

target = np.array(['dog', 'dog', 'cat', 'cat', 'cat', 'dog', 'dog', 
    'cat', 'cat', 'hamster', 'hamster'])

def one_hot(array):
    unique, inverse = np.unique(array, return_inverse=True)
    onehot = np.eye(unique.shape[0])[inverse]
    return onehot

print(one_hot(target))

Out[9]:
[[0., 1., 0.],
[0., 1., 0.],
[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.],
[0., 1., 0.],
[0., 1., 0.],
[1., 0., 0.],
[1., 0., 0.],
[0., 0., 1.],
[0., 0., 1.]])

edited Nov 7, 2019 at 15:11

answered Nov 3, 2019 at 1:48

Nicolas Gervais

36.9k23 gold badges123 silver badges160 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

hpaulj Over a year ago

preprocessing.OneHotEncoder also does this, though your's is faster.

user2357112 Over a year ago

Any particular reason you didn't use the return_inverse argument to numpy.unique, instead of numpy.searchsorted? Also, you're calling numpy.array on something that's already an array in np.array(array).

user2357112 Over a year ago

Also, using return_counts=True is unnecessary. You're only using the result for its length, but that length is the same length as words.

Matt Eding · Accepted Answer · 2019-11-03 07:37:32Z

Why not use OneHotEncoder?

>>> from sklearn.preprocessing import OneHotEncoder
>>> ohe = OneHotEncoder(categories='auto', sparse=False)
>>> arr = ohe.fit_transform(target[:, np.newaxis])
>>> arr
array([[0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.]])

It stores nice metadata about the transformation:

>>> ohe.categories_
[array(['cat', 'dog'], dtype='<U3')]

Plus you can easily convert back:

>>> ohe.inverse_transform(arr).ravel()
array(['dog', 'dog', 'cat', 'cat', 'cat', 'dog', 'dog', 'cat', 'cat'],
      dtype='<U3')

Pritish kumar · Accepted Answer · 2019-11-03 07:40:20Z

0

You can use keras and LabelEncoder for it

import numpy as np
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder

# define example
data = np.array(['dog', 'dog', 'cat', 'cat', 'cat', 'dog', 'dog', 'cat', 'cat'])

label_encoder = LabelEncoder()
data = label_encoder.fit_transform(data)
# one hot encode
encoded = to_categorical(data)

answered Nov 3, 2019 at 7:40

Pritish kumar

5127 silver badges13 bronze badges

1 Comment

Nicolas Gervais Over a year ago

This is not NumPy

Collectives™ on Stack Overflow

How do I one-hot encode an array of strings with Numpy?

3 Answers 3

3 Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

1 Comment

Linked

Related