4

This is little convoluted, but I'll just show my data

I constructed following dataframe:

      Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2
867    1019.11   1027.64  1022.68   XYZ   YYY
873    1018.04   1027.58  1022.81   XYZ   ZZZ

I would want to select values from columns based on Select1 and Select2 strings by matching on part of a column name. In a first row, this would be

1019.11 and 1027.64 (column Mid_XYZ and Mid_YYY) - because Select1 has string XYZ and Select2 has string YYY.

where, in a second row

1018.04 and 1022.81 (column Mid_XYZ and Mid_ZZZ)

Later, I plan to store sum of those values in new column. DataFrame will look like this

      Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2 Sum
867    1019.11   1027.64  1022.68   XYZ   YYY   2046.75
873    1018.04   1027.58  1022.81   XYZ   ZZZ   2040.85

I can change column names to exact matching, but there should be some solution with regex? I know about df.filter(regex='XYZ'), but how can I do it row-wise?

4 Answers 4

5

Use the following vectorized solution:

import numpy as np

# clean rows
clean = df.columns.str.replace('^Mid_', '', regex=True)

# find matching column indices
s1 = np.argmax(clean.values == df['Select1'].values[:, None], axis=1)
s2 = np.argmax(clean.values == df['Select2'].values[:, None], axis=1)

# index and sum
df['Sum'] = df.values[np.arange(len(s1)), s1] + df.values[np.arange(len(s2)), s2]

print(df)

Output

     Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2      Sum
867  1019.11  1027.64  1022.68     XYZ     YYY  2046.75
873  1018.04  1027.58  1022.81     XYZ     ZZZ  2040.85
Sign up to request clarification or add additional context in comments.

Comments

2
import pandas as pd

If you have:

df=pd.DataFrame.from_dict({'Mid_XYZ':[1019.11,1018.04],
                           'Mid_YYY':[1027.64,1027.58],
                           'Mid_ZZZ':[1022.68,1022.81],
                           'Select1':['XYZ','XYZ'],
                           'Select2':['YYY','ZZZ']})

You can do:

df['Sum']=df.apply(lambda row:
                   row['Mid_'+row['Select1']]+\
                   row['Mid_'+row['Select2']],
                   axis=1)

df will be:

   Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2      Sum
0  1019.11  1027.64  1022.68     XYZ     YYY  2046.75
1  1018.04  1027.58  1022.81     XYZ     ZZZ  2040.85

If you don't like lambda functions, can achieve the same result by defining a function:

def sumfunc(row):
    return row['Mid_'+row['Select1']]+row['Mid_'+row['Select2']]

Then:

df['Sum']=df.apply(sumfunc,axis=1)

2 Comments

this answer is good, but it isnt vectorized... voted up anyway :)
Thanks! I also voted up the vectorized answer. When I am having not incredibly large dataframes, I prefer readibility over vectorization, and I find this more solution more readable. Ofc, this is just an opinion :)
2

in addition to @Dani Mesejo answer i added a little bit faster implementation and more strightforward using numpy built in where...

my implementation is vec2 :

def vec1(df):
    clean = df.columns.str.replace('^Mid_', '', regex=True)
    s1 = np.argmax(clean.values == df['Select1'].values[:, None], axis=1)
    s2 = np.argmax(clean.values == df['Select2'].values[:, None], axis=1)
    # index and sum
    df['Sum'] = df.values[np.arange(len(s1)), s1] + df.values[np.arange(len(s2)), s2]
    return df

def vec2(df):
    clean = df.columns.str.replace('^Mid_', '', regex=True)
    idx1 = np.where(clean.values == df['Select1'].values[:,None] )
    idx2 = np.where(clean.values == df['Select2'].values[:,None] )
    df['Sum'] = df.values[idx1] + df.values[idx2]
    return df

here is timing comparisons:

my implementation:

%timeit vec2(df) : 388 µs ± 3.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@Dani Mesejo :

%timeit vec1(df) : 405 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Comments

2

Another solution using melt and concat:

cols = ['Select1', 'Select2']
df1 = df.melt(id_vars=cols, ignore_index=False)
df['Sum'] = (pd.concat([df1[('Mid_' + df1[col]) == df1['variable']] 
                        for col in cols]).groupby(level=0).sum())  # can also pass `sort=False` to `groupby` for ~10% or something speed boost
df
Out[1]: 
     Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2      Sum
867  1019.11  1027.64  1022.68     XYZ     YYY  2046.75
873  1018.04  1027.58  1022.81     XYZ     ZZZ  2040.85

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.