7

I have a pandas.DataFrame df and would like to add a new column col with one single value "hello". I would like this column to be of dtype category with the single category "hello". I can do the following.

df["col"] = "hello"
df["col"] = df["col"].astype("category")
  1. Do I really need to write df["col"] three times in order to achieve this?
  2. After the first line I am worried that the intermediate dataframe df might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value "hello" is actually a much longer string.)

Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues?

An alternative solution is

df["col"] = pd.Categorical(itertools.repeat("hello", len(df)))

but it requires itertools and the use of len(df), and I am not sure how memory usage is under the hood.

1
  • df.assign() with passing a dict for df.astype() would be a scalable way to go. The first can help create as variables, and the next can change dtypes, all in a stepwise manner in the same line of code. Check my answer for detailed examples. Commented Sep 17, 2021 at 1:50

3 Answers 3

5

We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__ then converting:

df['col'] = pd.Series('hello', index=df.index, dtype='category')

Sample Program:

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3]})

df['col'] = pd.Series('hello', index=df.index, dtype='category')

print(df)
print(df.dtypes)
print(df['col'].cat.categories)
   a    col
0  1  hello
1  2  hello
2  3  hello

a         int64
col    category
dtype: object

Index(['hello'], dtype='object')
Sign up to request clarification or add additional context in comments.

2 Comments

Is pandas clever enough to fill the column with the code (i.e., integer value) of the categorical and not fill with the string, followed by an implicit astype-instruction?
With this approach the dtype is a consideration throughout the entire process. There is no astype somewhere because the dtype is known, but there is a temporary array that is factorized to build the cat.codes. The benefit of this approach however, is that the 'hello' array is built once, factorized, and destroyed in favour of the categorical (similar to the way that reading in from a csv directly as categorical works). This also prevents multiple alignment steps with multiple reassignments, or creating unnecessary copies. You'd have to test out memory usage on your actual dataset.
1

A simple way to do this would be to use df.assign to create your new variable, then change dtype to category using df.astype along with dictionary of dtypes for the specific columns.

df = df.assign(col="hello").astype({'col':'category'})

df.dtypes
A         int64
col    category
dtype: object

That way you don't have to create a series of length equal to the dataframe. You can just broadcast the input string directly, which would be a bit more time and memory efficient.


This approach is quite scalable as you can see. You can assign multiple variables as per your need, some based on complex functions as well. Then set datatypes for them as per requirement.

df = pd.DataFrame({'A':[1,2,3,4]})

df = (df.assign(col1 = 'hello',                    #Define column based on series or broadcasting
                col2 = lambda x:x['A']**2,         #Define column based on existing columns
                col3 = lambda x:x['col2']/x['A'])  #Define column based on previously defined columns
        .astype({'col1':'category',
                 'col2':'float'}))

print(df)
print(df.dtypes)
   A   col1  col2  col3
0  1  hello   1.0   1.0
1  2  hello   4.0   2.0
2  3  hello   9.0   3.0
3  4  hello  16.0   4.0


A          int64
col1    category  #<-changed dtype
col2     float64  #<-changed dtype
col3     float64
dtype: object

1 Comment

Is df.assign(col="hello") not an intermediate table where the column col is of dtype object?
0

This solution surely solves the first point, not sure about the second:

df['col'] = pd.Categorical(('hello' for i in len(df)))

Essentially

  • we first create a generator of 'hello' with length equal to the number of records in df
  • then we pass it to pd.Categorical to make it a categorical column.

1 Comment

Not sure about the second point either in this case. Another flavour of this approach is df["col"] = pd.Categorical(itertools.repeat("hello", len(df))), although it is arguably somewhat long-winded.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.