Adding a column with one single categorical value to a pandas dataframe

Question

I have a pandas.DataFrame df and would like to add a new column col with one single value "hello". I would like this column to be of dtype category with the single category "hello". I can do the following.

df["col"] = "hello"
df["col"] = df["col"].astype("category")

Do I really need to write df["col"] three times in order to achieve this?
After the first line I am worried that the intermediate dataframe df might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value "hello" is actually a much longer string.)

Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues?

An alternative solution is

df["col"] = pd.Categorical(itertools.repeat("hello", len(df)))

but it requires itertools and the use of len(df), and I am not sure how memory usage is under the hood.

df.assign() with passing a dict for df.astype() would be a scalable way to go. The first can help create as variables, and the next can change dtypes, all in a stepwise manner in the same line of code. Check my answer for detailed examples. — Akshay Sehgal
– Akshay Sehgal, Commented Sep 17, 2021 at 1:50

Henry Ecker · Accepted Answer · 2021-09-15 14:17:59Z

5

We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__ then converting:

df['col'] = pd.Series('hello', index=df.index, dtype='category')

Sample Program:

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3]})

df['col'] = pd.Series('hello', index=df.index, dtype='category')

print(df)
print(df.dtypes)
print(df['col'].cat.categories)

   a    col
0  1  hello
1  2  hello
2  3  hello

a         int64
col    category
dtype: object

Index(['hello'], dtype='object')

answered Sep 15, 2021 at 14:17

Henry Ecker♦

35.8k19 gold badges48 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

DustByte Over a year ago

Is pandas clever enough to fill the column with the code (i.e., integer value) of the categorical and not fill with the string, followed by an implicit astype-instruction?

Henry Ecker Over a year ago

With this approach the dtype is a consideration throughout the entire process. There is no astype somewhere because the dtype is known, but there is a temporary array that is factorized to build the cat.codes. The benefit of this approach however, is that the 'hello' array is built once, factorized, and destroyed in favour of the categorical (similar to the way that reading in from a csv directly as categorical works). This also prevents multiple alignment steps with multiple reassignments, or creating unnecessary copies. You'd have to test out memory usage on your actual dataset.

Akshay Sehgal · Accepted Answer · 2021-09-16 10:25:12Z

A simple way to do this would be to use df.assign to create your new variable, then change dtype to category using df.astype along with dictionary of dtypes for the specific columns.

df = df.assign(col="hello").astype({'col':'category'})

df.dtypes

A         int64
col    category
dtype: object

That way you don't have to create a series of length equal to the dataframe. You can just broadcast the input string directly, which would be a bit more time and memory efficient.

This approach is quite scalable as you can see. You can assign multiple variables as per your need, some based on complex functions as well. Then set datatypes for them as per requirement.

df = pd.DataFrame({'A':[1,2,3,4]})

df = (df.assign(col1 = 'hello',                    #Define column based on series or broadcasting
                col2 = lambda x:x['A']**2,         #Define column based on existing columns
                col3 = lambda x:x['col2']/x['A'])  #Define column based on previously defined columns
        .astype({'col1':'category',
                 'col2':'float'}))

print(df)
print(df.dtypes)

   A   col1  col2  col3
0  1  hello   1.0   1.0
1  2  hello   4.0   2.0
2  3  hello   9.0   3.0
3  4  hello  16.0   4.0


A          int64
col1    category  #<-changed dtype
col2     float64  #<-changed dtype
col3     float64
dtype: object

Is df.assign(col="hello") not an intermediate table where the column col is of dtype object?

Matteo Felici · Accepted Answer · 2021-09-15 12:44:17Z

0

This solution surely solves the first point, not sure about the second:

df['col'] = pd.Categorical(('hello' for i in len(df)))

Essentially

we first create a generator of 'hello' with length equal to the number of records in df
then we pass it to pd.Categorical to make it a categorical column.

answered Sep 15, 2021 at 12:44

Matteo Felici

1,10711 silver badges20 bronze badges

1 Comment

DustByte Over a year ago

Not sure about the second point either in this case. Another flavour of this approach is df["col"] = pd.Categorical(itertools.repeat("hello", len(df))), although it is arguably somewhat long-winded.

Collectives™ on Stack Overflow

Adding a column with one single categorical value to a pandas dataframe

3 Answers 3

2 Comments

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

1 Comment

Related