Pandas - Select row value from specific column based on value from other columns

Question

This is little convoluted, but I'll just show my data

I constructed following dataframe:

      Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2
867    1019.11   1027.64  1022.68   XYZ   YYY
873    1018.04   1027.58  1022.81   XYZ   ZZZ

I would want to select values from columns based on Select1 and Select2 strings by matching on part of a column name. In a first row, this would be

1019.11 and 1027.64 (column Mid_XYZ and Mid_YYY) - because Select1 has string XYZ and Select2 has string YYY.

where, in a second row

1018.04 and 1022.81 (column Mid_XYZ and Mid_ZZZ)

Later, I plan to store sum of those values in new column. DataFrame will look like this

      Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2 Sum
867    1019.11   1027.64  1022.68   XYZ   YYY   2046.75
873    1018.04   1027.58  1022.81   XYZ   ZZZ   2040.85

I can change column names to exact matching, but there should be some solution with regex? I know about df.filter(regex='XYZ'), but how can I do it row-wise?

Dani Mesejo · Accepted Answer · 2021-01-04 18:50:24Z

5

Use the following vectorized solution:

import numpy as np

# clean rows
clean = df.columns.str.replace('^Mid_', '', regex=True)

# find matching column indices
s1 = np.argmax(clean.values == df['Select1'].values[:, None], axis=1)
s2 = np.argmax(clean.values == df['Select2'].values[:, None], axis=1)

# index and sum
df['Sum'] = df.values[np.arange(len(s1)), s1] + df.values[np.arange(len(s2)), s2]

print(df)

Output

     Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2      Sum
867  1019.11  1027.64  1022.68     XYZ     YYY  2046.75
873  1018.04  1027.58  1022.81     XYZ     ZZZ  2040.85

answered Jan 4, 2021 at 18:50

Dani Mesejo

62.2k6 gold badges56 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

zabop · Accepted Answer · 2021-01-04 18:57:12Z

import pandas as pd

If you have:

df=pd.DataFrame.from_dict({'Mid_XYZ':[1019.11,1018.04],
                           'Mid_YYY':[1027.64,1027.58],
                           'Mid_ZZZ':[1022.68,1022.81],
                           'Select1':['XYZ','XYZ'],
                           'Select2':['YYY','ZZZ']})

You can do:

df['Sum']=df.apply(lambda row:
                   row['Mid_'+row['Select1']]+\
                   row['Mid_'+row['Select2']],
                   axis=1)

df will be:

   Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2      Sum
0  1019.11  1027.64  1022.68     XYZ     YYY  2046.75
1  1018.04  1027.58  1022.81     XYZ     ZZZ  2040.85

If you don't like lambda functions, can achieve the same result by defining a function:

def sumfunc(row):
    return row['Mid_'+row['Select1']]+row['Mid_'+row['Select2']]

Then:

df['Sum']=df.apply(sumfunc,axis=1)

this answer is good, but it isnt vectorized... voted up anyway :)
Thanks! I also voted up the vectorized answer. When I am having not incredibly large dataframes, I prefer readibility over vectorization, and I find this more solution more readable. Ofc, this is just an opinion :)

adir abargil · Accepted Answer · 2021-01-04 19:16:22Z

in addition to @Dani Mesejo answer i added a little bit faster implementation and more strightforward using numpy built in where...

my implementation is vec2 :

def vec1(df):
    clean = df.columns.str.replace('^Mid_', '', regex=True)
    s1 = np.argmax(clean.values == df['Select1'].values[:, None], axis=1)
    s2 = np.argmax(clean.values == df['Select2'].values[:, None], axis=1)
    # index and sum
    df['Sum'] = df.values[np.arange(len(s1)), s1] + df.values[np.arange(len(s2)), s2]
    return df

def vec2(df):
    clean = df.columns.str.replace('^Mid_', '', regex=True)
    idx1 = np.where(clean.values == df['Select1'].values[:,None] )
    idx2 = np.where(clean.values == df['Select2'].values[:,None] )
    df['Sum'] = df.values[idx1] + df.values[idx2]
    return df

here is timing comparisons:

my implementation:

%timeit vec2(df) : 388 µs ± 3.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@Dani Mesejo :

%timeit vec1(df) : 405 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

David Erickson · Accepted Answer · 2021-01-04 19:22:44Z

2

Another solution using melt and concat:

cols = ['Select1', 'Select2']
df1 = df.melt(id_vars=cols, ignore_index=False)
df['Sum'] = (pd.concat([df1[('Mid_' + df1[col]) == df1['variable']] 
                        for col in cols]).groupby(level=0).sum())  # can also pass `sort=False` to `groupby` for ~10% or something speed boost
df
Out[1]: 
     Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2      Sum
867  1019.11  1027.64  1022.68     XYZ     YYY  2046.75
873  1018.04  1027.58  1022.81     XYZ     ZZZ  2040.85

edited Jan 4, 2021 at 19:22

answered Jan 4, 2021 at 19:06

David Erickson

16.7k2 gold badges21 silver badges37 bronze badges

Collectives™ on Stack Overflow

Pandas - Select row value from specific column based on value from other columns

4 Answers 4

Comments

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Linked

Related