Load numpy array from pandas dataframe into tensorflow dataset

Question

I'm trying to load my pandas dataframe (df) into a Tensorflow dataset with the following command:

target = df['label']
features = df['encoded_sentence']

dataset = tf.data.Dataset.from_tensor_slices((features.values, target.values))

Here's an excerpt from my pandas dataframe:

+-------+-----------------------+------------------+
| label | sentence              | encoded_sentence |
+-------+-----------------------+------------------+
| 0     | Hello world           | [5, 7]           |
+-------+-----------------------+------------------+
| 1     | my name is john smith | [1, 9, 10, 2, 6] |
+-------+-----------------------+------------------+
| 1     | Hello! My name is     | [5, 3, 9, 10]    |
+-------+-----------------------+------------------+
| 0     | foo baar              | [8, 4]           |
+-------+-----------------------+------------------+

# df.dtypes gives me:
label                int8
sentence             object
encoded_sentencee    object

But it keeps giving me a Value Error:

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

Can anyone tell me how to use the encoded sentences in my Tensorflow dataset? Help would be greatly appreciated!

javidcf · Accepted Answer · 2020-05-04 18:09:37Z

You can make your Pandas values into a ragged tensor first and then make the dataset from it:

import tensorflow as tf
import pandas as pd

df = pd.DataFrame({'label': [0, 1, 1, 0],
                   'sentence': ['Hello world', 'my name is john smith',
                                'Hello! My name is', 'foo baar'],
                   'encoded_sentence': [[5, 7], [1, 9, 10, 2, 6],
                                        [5, 3, 9, 10], [8, 4]]})
features = tf.ragged.stack(list(df['encoded_sentence']))
target = tf.convert_to_tensor(df['label'].values)
dataset = tf.data.Dataset.from_tensor_slices((features, target))
for f, t in dataset:
    print(f.numpy(), t.numpy())

Output:

[5 7] 0
[ 1  9 10  2  6] 1
[ 5  3  9 10] 1
[8 4] 0

Note you may want to use padded_batch to get batches of examples from the dataset.

EDIT: Since padded-batching does not seem to work with a dataset made from a ragged tensor at the moment, you can also convert the ragged tensor to a regular one first:

import tensorflow as tf
import pandas as pd

df = pd.DataFrame({'label': [0, 1, 1, 0],
                   'sentence': ['Hello world', 'my name is john smith',
                                'Hello! My name is', 'foo baar'],
                   'encoded_sentence': [[5, 7], [1, 9, 10, 2, 6],
                                        [5, 3, 9, 10], [8, 4]]})
features_ragged = tf.ragged.stack(list(df['encoded_sentence']))
features = features_ragged.to_tensor(default_value=-1)
target = tf.convert_to_tensor(df['label'].values)
dataset = tf.data.Dataset.from_tensor_slices((features, target))
batches = dataset.batch(2)
for f, t in batches:
    print(f.numpy(), t.numpy())

Output:

[[ 5  7 -1 -1 -1]
 [ 1  9 10  2  6]] [0 1]
[[ 5  3  9 10 -1]
 [ 8  4 -1 -1 -1]] [1 0]

Thank you so much for your help! When I try to create a batch it gives me a type error... TypeError: ('Padded batching of components of type ', <class 'tensorflow.python.ops.ragged.ragged_tensor.RaggedTensorSpec'>, ' is not supported.') Can you tell the correct way to create a train and test set?
@StudentAsker I see, I'd say that is a bug, I filed issue #39163.
@StudentAsker I added an alternative simply converting the ragged tensor into a regular one.

nichole · Accepted Answer · 2025-03-18 01:14:16Z

You can encode the array into a string, and then various methods to create a tf.data.Dataset will succeed.

Then you can split the feature column of the tf dataset into RaggedTensor and then to_tensor(). I'll provide an example below. Here is where I first found this string encoding workaround: https://keras.io/examples/structured_data/movielens_recommendations_transformers/#encode-input-features

#encode the pandas dataframe column as a string:
def encode_list_as_string(int_list : list, separator=","):
    return separator.join(map(str, int_list))

def encode_np_array_as_string_sep_comma(input : np.ndarray, separator=","):
    return separator.join(input.astype(str))

df['col_name'] = df['col_name'].apply(lambda x: encode_list_as_string(x))
or
df['col_name'] = df['col_name'].map(encode_list_as_string)


#decode the tensorflow string encoded column:
def expand_string_to_tensor(features, col_name, d_type):
    _ts = tf.strings.split(features[col_name], ",")
    t1 = tf.strings.to_number(_ts, out_type=d_type)
    type_tensor = t1.to_tensor()
    type_tensor = t1.to_tensor()
    return features

dataset = dataset.map(lambda x: expand_string_to_tensor(x, 'col_name', tf.int32))

The caveat is that the dataset is then a _MapDataset.

Collectives™ on Stack Overflow

Load numpy array from pandas dataframe into tensorflow dataset

2 Answers 2

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Related