Combine two columns of text in pandas dataframe

Question

I have a dataframe that looks like

Year  quarter
2000       q2
2001       q3

How do I add a new column by combining these columns to get the following dataframe?

Year  quarter  period
2000       q2  2000q2
2001       q3  2001q3

Searchers: here's a similar question with more answers

ᴍᴇʜᴏᴠ
– ᴍᴇʜᴏᴠ

2022-10-18 19:40:26 +00:00
Commented Oct 18, 2022 at 19:40 — ᴍᴇʜᴏᴠ
– ᴍᴇʜᴏᴠ, Commented Oct 18, 2022 at 19:40

End genocide - save Gaza · Accepted Answer · 2021-11-11 11:20:27Z

1286

If both columns are strings, you can concatenate them directly:

df["period"] = df["Year"] + df["quarter"]

If one (or both) of the columns are not string typed, you should convert it (them) first,

df["period"] = df["Year"].astype(str) + df["quarter"]

Beware of NaNs when doing this!

If you need to join multiple string columns, you can use agg:

df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)

Where "-" is the separator.

edited Nov 11, 2021 at 11:20

End genocide - save Gaza

24.9k10 gold badges113 silver badges133 bronze badges

answered Oct 15, 2013 at 10:09

silvado

18.3k2 gold badges33 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

19 Comments

Heisenberg Over a year ago

Is it possible to add multiple columns together without typing out all the columns? Let's say add(dataframe.iloc[:, 0:10]) for example?

silvado Over a year ago

@Heisenberg That should be possible with the Python builtin sum.

c1c1c1 Over a year ago

@silvado could you please make an example for adding multiple columns? Thank you

Ozgur Ozturk Over a year ago

Be careful, you need to apply map(str) to all columns that are not string in the first place. if quarter was a number you would do dataframe["period"] = dataframe["Year"].map(str) + dataframe["quarter"].map(str) map is just applying string conversion to all entries.

user2270655 Over a year ago

This solution can create problems iy you have nan values, e careful

|

Sunderam Dubey · Accepted Answer · 2022-05-30 17:37:07Z

Small data-sets (< 150rows)

[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

or slightly slower but more compact:

df.Year.str.cat(df.quarter)

Larger data sets (> 150rows)

df['Year'].astype(str) + df['quarter']

UPDATE: Timing graph Pandas 0.23.4

Let's test it on 200K rows DF:

In [250]: df
Out[250]:
   Year quarter
0  2014      q1
1  2015      q2

In [251]: df = pd.concat([df] * 10**5)

In [252]: df.shape
Out[252]: (200000, 2)

UPDATE: new timings using Pandas 0.19.0

Timing without CPU/GPU optimization (sorted from fastest to slowest):

In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop

In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop

In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop

In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop

In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop

In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop

Timing using CPU/GPU optimization:

In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop

In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop

In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop

In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop

Answer contribution by @anton-vbr

@AntonProtopopov, i guess it's a mixture of two timings - one used CPU/GPU optimization, another one didn't. I've updated my answer and put both timing sets there...
This use of .sum() fails If all columns look like they could be integers (ie are string forms of integers). Instead, it seems pandas converts them back to numeric before summing!
@MaxU How did you go about the CPU/GPU optimization? Is that just a more powerful computer or is it something you did with code?

kepy97 · Accepted Answer · 2018-01-08 16:21:34Z

329

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)

Yields this dataframe

   Year quarter  period
0  2014      q1  2014q1
1  2015      q2  2015q2

This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']] with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1).

You can check more information about apply() method here

edited Jan 8, 2018 at 16:21

kepy97

1,10411 silver badges12 bronze badges

answered Sep 11, 2015 at 17:36

Russ

3,7911 gold badge15 silver badges16 bronze badges

11 Comments

DSM Over a year ago

lambda x: ''.join(x) is just ''.join, no?

DSM Over a year ago

@OzgurOzturk: the point is that the lambda part of the lambda x: ''.join(x) construction doesn't do anything; it's like using lambda x: sum(x) instead of just sum.

Max Ghenis Over a year ago

Confirmed same result when using ''.join, i.e.: df['period'] = df[['Year', 'quarter']].apply(''.join, axis=1).

John Strood Over a year ago

@Archie join takes only str instances in an iterable. Use a map to convert them all into str and then use join.

Manjul Over a year ago

'-'.join(x.map(str))

|

G. Sliepen · Accepted Answer · 2018-09-23 08:49:07Z

The method cat() of the .str accessor works really well for this:

>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"], 
...                    ["2015", "q3"]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014      q1
1  2015      q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
   Year Quarter  Period
0  2014      q1  2014q1
1  2015      q3  2015q3

cat() even allows you to add a separator so, for example, suppose you only have integers for year and period, you can do this:

>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
...                    [2015, 3]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014       1
1  2015       3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
   Year Quarter  Period
0  2014       1  2014q1
1  2015       3  2015q3

Joining multiple columns is just a matter of passing either a list of series or a dataframe containing all but the first column as a parameter to str.cat() invoked on the first column (Series):

>>> df = pd.DataFrame(
...     [['USA', 'Nevada', 'Las Vegas'],
...      ['Brazil', 'Pernambuco', 'Recife']],
...     columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
  Country       State       City                   AllTogether
0     USA      Nevada  Las Vegas      USA - Nevada - Las Vegas
1  Brazil  Pernambuco     Recife  Brazil - Pernambuco - Recife

Do note that if your pandas dataframe/series has null values, you need to include the parameter na_rep to replace the NaN values with a string, otherwise the combined column will default to NaN.

This seems way better (maybe more efficient, too) than lambda or map; also it just reads most cleanly.
Which version of pandas are you using? I get ValueError: Did you mean to supply a sep keyword? in pandas-0.23.4. Thanks!
@QinqingLiu, I retested these with pandas-0.23.4 and they seem work. The sep parameter is only necessary if you intend to separate the parts of the concatenated string. If you get an error, please show us your failing example.
@LeoRochael can i do a newline instead of '-' with sep keyword?
@arun-menon: I don't see why not. In the last example above you could do .str.cat(df[['State', 'City']], sep ='\n'), for example. I haven't tested it yet, though.

Bill Gale · Accepted Answer · 2016-03-16 16:57:26Z

Use of a lamba function this time with string.format().

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df

  Quarter  Year
0      q1  2014
1      q2  2015
  Quarter  Year YearQuarter
0      q1  2014      2014q1
1      q2  2015      2015q2

This allows you to work with non-strings and reformat values as needed.

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df

df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df

Quarter     int64
Year       object
dtype: object
   Quarter  Year
0        1  2014
1        2  2015
   Quarter  Year YearQuarter
0        1  2014      2014q1
1        2  2015      2015q2

This solution worked great for my needs since I had to do some formatting. df_game['formatted_game_time'] = df_game[['wday', 'month', 'day', 'year', 'time']].apply(lambda x: '{}, {}/{}/{} @ {}'.format(x[0], x[1], x[2], x[3], x[4]), axis=1)

geher · Accepted Answer · 2019-07-30 10:38:10Z

25

generalising to multiple columns, why not:

columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)

answered Jul 30, 2019 at 10:38

geher

4951 gold badge7 silver badges14 bronze badges

3 Comments

Odisseo Over a year ago

Looks cool but what if I want to add a delimiter between the strings, like '-'?

Dd H Over a year ago

@Odisseo maybe create a delimiter column?

Eamonn Kenny Over a year ago

This is the correct solution. The highly voted solutions don't work because they assume that every column contains strings, which is not generally true.

buhtz · Accepted Answer · 2021-06-01 12:12:15Z

19

You can use lambda:

combine_lambda = lambda x: '{}{}'.format(x.Year, x.quarter)

And then use it with creating the new column:

df['period'] = df.apply(combine_lambda, axis = 1)

edited Jun 1, 2021 at 12:12

buhtz

12.5k21 gold badges95 silver badges196 bronze badges

answered Feb 28, 2021 at 16:25

Pobaranchuk

89511 silver badges13 bronze badges

Comments

cs95 · Accepted Answer · 2019-01-24 09:38:44Z

15

Let us suppose your dataframe is df with columns Year and Quarter.

import pandas as pd
df = pd.DataFrame({'Quarter':'q1 q2 q3 q4'.split(), 'Year':'2000'})

Suppose we want to see the dataframe;

df
>>>  Quarter    Year
   0    q1      2000
   1    q2      2000
   2    q3      2000
   3    q4      2000

Finally, concatenate the Year and the Quarter as follows.

df['Period'] = df['Year'] + ' ' + df['Quarter']

You can now print df to see the resulting dataframe.

df
>>>  Quarter    Year    Period
    0   q1      2000    2000 q1
    1   q2      2000    2000 q2
    2   q3      2000    2000 q3
    3   q4      2000    2000 q4

If you do not want the space between the year and quarter, simply remove it by doing;

df['Period'] = df['Year'] + df['Quarter']

edited Jan 24, 2019 at 9:38

cs95

406k106 gold badges744 silver badges795 bronze badges

answered Jul 22, 2018 at 5:20

Samuel Nde

2,7712 gold badges26 silver badges23 bronze badges

5 Comments

Stuber Over a year ago

Specified as strings df['Period'] = df['Year'].map(str) + df['Quarter'].map(str)

Karl Baker Over a year ago

I'm getting TypeError: Series cannot perform the operation + when I run either df2['filename'] = df2['job_number'] + '.' + df2['task_number'] or df2['filename'] = df2['job_number'].map(str) + '.' + df2['task_number'].map(str).

Karl Baker Over a year ago

However, df2['filename'] = df2['job_number'].astype(str) + '.' + df2['task_number'].astype(str) did work.

Samuel Nde Over a year ago

@KarlBaker, I think you did not have strings in your input. But I am glad you figured that out. If you look at the example dataframe that I created above, you will see that all the columns are strings.

AMC Over a year ago

What exactly is the point of this solution, since it's identical to the top answer?

Anton Protopopov · Accepted Answer · 2015-11-25 10:25:15Z

14

Although the @silvado answer is good if you change df.map(str) to df.astype(str) it will be faster:

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop

In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop

answered Nov 25, 2015 at 10:25

Anton Protopopov

31.9k13 gold badges93 silver badges96 bronze badges

Comments

Pedro M Duarte · Accepted Answer · 2017-04-03 17:05:10Z

Here is an implementation that I find very versatile:

In [1]: import pandas as pd 

In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
   ...:                    [1, 'fox', 'jumps', 'over'], 
   ...:                    [2, 'the', 'lazy', 'dog']],
   ...:                   columns=['c0', 'c1', 'c2', 'c3'])

In [3]: def str_join(df, sep, *cols):
   ...:     from functools import reduce
   ...:     return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep), 
   ...:                   [df[col] for col in cols])
   ...: 

In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')

In [5]: df
Out[5]: 
   c0   c1     c2     c3                cat
0   0  the  quick  brown  0-the-quick-brown
1   1  fox  jumps   over   1-fox-jumps-over
2   2  the   lazy    dog     2-the-lazy-dog

FYI: This method works great with Python 3, but gives me trouble in Python 2.

Colin Wang · Accepted Answer · 2018-01-09 02:13:45Z

more efficient is

def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)

and here is a time test:

import numpy as np
import pandas as pd

from time import time


def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)


def concat_df_str2(df):
    """ run time: 5.2758s """
    return df.astype(str).sum(axis=1)


def concat_df_str3(df):
    """ run time: 5.0076s """
    df = df.astype(str)
    return df[0] + df[1] + df[2] + df[3] + df[4] + \
           df[5] + df[6] + df[7] + df[8] + df[9]


def concat_df_str4(df):
    """ run time: 7.8624s """
    return df.astype(str).apply(lambda x: ''.join(x), axis=1)


def main():
    df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
    df = df.astype(int)

    time1 = time()
    df_en = concat_df_str4(df)
    print('run time: %.4fs' % (time() - time1))
    print(df_en.head(10))


if __name__ == '__main__':
    main()

final, when sum(concat_df_str2) is used, the result is not simply concat, it will trans to integer.

+1 Neat solution, this also allows us to specify the columns: e.g. df.values[:, 0:3] or df.values[:, [0,2]].

Anton vBR · Accepted Answer · 2019-02-20 17:07:44Z

Using zip could be even quicker:

df["period"] = [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

Graph:

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

myfuncs = {
"df['Year'].astype(str) + df['quarter']":
    lambda: df['Year'].astype(str) + df['quarter'],
"df['Year'].map(str) + df['quarter']":
    lambda: df['Year'].map(str) + df['quarter'],
"df.Year.str.cat(df.quarter)":
    lambda: df.Year.str.cat(df.quarter),
"df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df.loc[:, ['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df[['Year','quarter']].astype(str).sum(axis=1),
    "df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)":
    lambda: df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1),
    "[''.join(i) for i in zip(dataframe['Year'].map(str),dataframe['quarter'])]":
    lambda: [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
}

d = defaultdict(dict)
step = 10
cont = True
while cont:
    lendf = len(df); print(lendf)
    for k,v in myfuncs.items():
        iters = 1
        t = 0
        while t < 0.2:
            ts = timeit.repeat(v, number=iters, repeat=3)
            t = min(ts)
            iters *= 10
        d[k][lendf] = t/iters
        if t > 2: cont = False
    df = pd.concat([df]*step)

pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()

Markus Dutschke · Accepted Answer · 2019-03-15 16:37:59Z

7

This solution uses an intermediate step compressing two columns of the DataFrame to a single column containing a list of the values. This works not only for strings but for all kind of column-dtypes

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['list']=df[['Year','quarter']].values.tolist()
df['period']=df['list'].apply(''.join)
print(df)

Result:

   Year quarter        list  period
0  2014      q1  [2014, q1]  2014q1
1  2015      q2  [2015, q2]  2015q2

answered Mar 15, 2019 at 16:37

Markus Dutschke

10.8k5 gold badges72 silver badges67 bronze badges

4 Comments

Lohith Arcot Over a year ago

looks like other dtypes won't work. I got a TypeError: sequence item 1: expected str instance, float found

Markus Dutschke Over a year ago

apply first a cast to string. The join operation works only for strings

Good Fit Over a year ago

This solution won't work to combine two columns with different dtype, see my answer for the correct solution for such case.

Bill Over a year ago

Instead of .apply(''.join) why not use .str.join('')?

Good Fit · Accepted Answer · 2019-05-16 13:19:03Z

Here is my summary of the above solutions to concatenate / combine two columns with int and str value into a new column, using a separator between the values of columns. Three solutions work for this purpose.

# be cautious about the separator, some symbols may cause "SyntaxError: EOL while scanning string literal".
# e.g. ";;" as separator would raise the SyntaxError

separator = "&&" 

# pd.Series.str.cat() method does not work to concatenate / combine two columns with int value and str value. This would raise "AttributeError: Can only use .cat accessor with a 'category' dtype"

df["period"] = df["Year"].map(str) + separator + df["quarter"]
df["period"] = df[['Year','quarter']].apply(lambda x : '{} && {}'.format(x[0],x[1]), axis=1)
df["period"] = df.apply(lambda x: f'{x["Year"]} && {x["quarter"]}', axis=1)

At least your first solution does not work (any more?). I use: df["period"] = (df["Year"].astype(str) + separator + df["quarter"].astype(str)).astype('category')

leo · Accepted Answer · 2020-08-18 04:13:36Z

5

my take....

listofcols = ['col1','col2','col3']
df['combined_cols'] = ''

for column in listofcols:
    df['combined_cols'] = df['combined_cols'] + ' ' + df[column]
'''

answered Aug 18, 2020 at 4:13

leo

4415 silver badges12 bronze badges

1 Comment

annedroiid Over a year ago

You should add an explanation to this code snippet. Adding only code answers encourages people to use code they don't understand and doesn't help them learn.

Ax_ · Accepted Answer · 2023-01-07 14:12:30Z

3

When combining columns with strings by concatenating them using the addition operator + if any is NaN then entire output will be NaN so use fillna()

df["join"] = "some" + df["col"].fillna(df["val_if_nan"])

answered Jan 7, 2023 at 14:12

Ax_

1,02711 silver badges13 bronze badges

Comments

Ted Petrou · Accepted Answer · 2017-10-25 03:21:25Z

2

As many have mentioned previously, you must convert each column to string and then use the plus operator to combine two string columns. You can get a large performance improvement by using NumPy.

%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

answered Oct 25, 2017 at 3:21

Ted Petrou

62.4k19 gold badges139 silver badges139 bronze badges

2 Comments

Karl Baker Over a year ago

I'd like to use the numpyified version but I'm getting an error: Input: df2['filename'] = df2['job_number'].values.astype(str) + '.' + df2['task_number'].values.astype(str) --> Output: TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U21') dtype('<U21') dtype('<U21'). Both job_number and task_number are ints.

AbdulRehmanLiaqat Over a year ago

That's because you are combining two numpy arrays. It works if you combine an numpy array with pandas Series. as df['Year'].values.astype(str) + df.quarter

Sergey · Accepted Answer · 2018-12-01 10:55:10Z

2

One can use assign method of DataFrame:

df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}).
  assign(period=lambda x: x.Year+x.quarter ))

answered Dec 1, 2018 at 10:55

Sergey

4873 silver badges7 bronze badges

Comments

Marc Torrellas Socastro · Accepted Answer · 2021-12-04 12:43:11Z

1

Similar to @geher answer but with any separator you like:

SEP = " "
INPUT_COLUMNS_WITH_SEP = ",sep,".join(INPUT_COLUMNS).split(",")

df.assign(sep=SEP)[INPUT_COLUMNS_WITH_SEP].sum(axis=1)

answered Dec 4, 2021 at 12:43

Marc Torrellas Socastro

4693 silver badges8 bronze badges

Comments

cottontail · Accepted Answer · 2023-12-06 03:43:30Z

DataFrame.eval()

For a little terse code, we can use .eval(). We can concatenate two (or more) string dtype columns horizontally using the + operator as follows.

df = pd.DataFrame({'A': ['x', 'y', 'z'], 'B': ['1', '2', '3']}, dtype='string')
df['C'] = df.eval("A + B")

You can even include the new column assignment inside the evaluated expression (which also opens up the possibility to do it in-place).

df = df.eval('C = A + B')
df.eval('C = A + B', inplace=True)

eval doesn't allow a similarly terse way to add delimiters; however, we can call str.cat() (which has the sep= kwarg) inside the numerical expression.

df = df.eval("D = A.str.cat([B, C], '_')")

which produces the following output (where columns, A, B and C are concatenated horizontally):

When to use vectorized concat vs explicit loop

There are major string concatenation methods given on this page:

vectorized +: df['A'] + df['B']
string formatting in a loop (N.B. converting the columns into lists makes the loop faster):
```
[f"{x}{y}" for x,y in zip(df['A'].tolist(), df['B'].tolist())])]
```
vectorized str.cat(): df['A'].str.cat(df['B'])

As the following figure shows, vectorized concatenation (via +) is fastest if the strings being concatenated are short such as in the OP. However, if the strings are long (e.g. each cell contains a tweet or a book excerpt), then an explicit Python loop (use f-string in a list comprehension) is the fastest.

Code to reproduce the above figure:

import matplotlib.pyplot as plt
import pandas as pd
import perfplot

fig, axs = plt.subplots(1, 2, figsize=(15,5))
for ax, s, title in zip(axs, ('a'*1000, 'a'), ("long", "short")):
    plt.sca(ax)
    perfplot.plot(
        kernels=[lambda df: df.assign(C=df['A'] + '_' + df['B']),
                 lambda df: df.assign(C=df['A'].str.cat(df['B'], '_')),
                 lambda df: df.assign(C=[f"{x}_{y}" for x,y in zip(df['A'].tolist(), df['B'].tolist())])],
        n_range=[2**k for k in range(18)],
        setup=lambda n: pd.DataFrame({'A': [s]*n, 'B': [s]*n}),
        labels=["df['A'] + df['B']", "str.cat", "list comp"],
        xlabel="DataFrame length",
        title=f"When the strings are {title}", 
        equality_check=pd.DataFrame.equals)
fig.tight_layout()
fig.savefig("string_concat_perf.png")

BMW · Accepted Answer · 2017-07-21 20:26:34Z

def madd(x):
    """Performs element-wise string concatenation with multiple input arrays.

    Args:
        x: iterable of np.array.

    Returns: np.array.
    """
    for i, arr in enumerate(x):
        if type(arr.item(0)) is not str:
            x[i] = x[i].astype(str)
    return reduce(np.core.defchararray.add, x)

For example:

data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])

df

    Year    quarter period
0   2000    q1  2000q1
1   2000    q2  2000q2
2   2000    q3  2000q3
3   2000    q4  2000q4

Keiku · Accepted Answer · 2018-02-12 07:54:17Z

0

Use .combine_first.

df['Period'] = df['Year'].combine_first(df['Quarter'])

edited Feb 12, 2018 at 7:54

Keiku

8,8936 gold badges45 silver badges46 bronze badges

answered Feb 10, 2018 at 4:01

Abul

1952 gold badges4 silver badges15 bronze badges

1 Comment

Steve G Over a year ago

This is not correct. .combine_first will result in either the value from 'Year' being stored in 'Period', or, if it is Null, the value from 'Quarter'. It will not concatenate the two strings and store them in 'Period'.

user28755543 · Accepted Answer · 2024-12-12 17:09:27Z

0

Using .agg with columns that are not string

Combining a few answers in this thread, I found this worked quite well when encountering columns that are not strings whilst avoiding slow lambda functions and allowing delimiters.

df['Period'] = df[['Year', 'Quarter']].astype(str).agg('-'.join, axis=1)

This can also be used to combine a large number of columns

cols = ['Year', 'Quarter', 'Week', 'Day', 'Hour']
df['Period'] = df[cols].astype(str).agg('-'.join, axis=1)

edited Dec 12, 2024 at 17:09

answered Dec 12, 2024 at 16:55

user28755543

12 bronze badges

Collectives™ on Stack Overflow

Combine two columns of text in pandas dataframe

23 Answers 23

Beware of NaNs when doing this!

19 Comments

Small data-sets (< 150rows)

Larger data sets (> 150rows)

16 Comments

11 Comments

5 Comments

2 Comments

3 Comments

Comments

5 Comments

Comments

1 Comment

1 Comment

Comments

4 Comments

1 Comment

1 Comment

Comments

2 Comments

Comments

Comments

DataFrame.eval()

When to use vectorized concat vs explicit loop

Comments

2 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

23 Answers 23

Beware of NaNs when doing this!

19 Comments

Small data-sets (< 150rows)

Larger data sets (> 150rows)

16 Comments

11 Comments

5 Comments

2 Comments

3 Comments

Comments

5 Comments

Comments

1 Comment

1 Comment

Comments

4 Comments

1 Comment

1 Comment

Comments

2 Comments

Comments

Comments

When to use vectorized concat vs explicit loop

Comments

2 Comments

1 Comment

Comments

Linked

Related