0

Given the following data frame:

import pandas as pd
DF = pd.DataFrame({'COL1': ['A', 'A','B'], 
                   'COL2' : [1,2,1],
                   'COL3' : ['X','Y','X']})

DF

  COL1  COL2   COL3
0   A    1      X
1   A    2      Y
2   B    1      X

I would like to have an additional row for COL1 = 'B' so that both values (COL1 A and B) are represented by the COL3 values X and Y, with a 0 for COL2 in the generated row.

The desired result is as follows:

  COL1  COL2   COL3
0   A    1      X
1   A    2      Y
2   B    1      X
3   B    0      Y

This is just a simplified example, but I need a calculation that could handle many such instances (and not just inserting the row in interest manually).

Thanks in advance!

UPDATE:

For a generalized scenario where there are many different combinations of values under 'COL1' and 'COL3', this works but is probably not nearly as efficient as it can be:

#Get unique set of COL3
COL3SET = set(DF['COL3'])
#Get unique set of COL1
COL1SET = set(DF['COL1'])
#Get all possible combinations of unique sets
import itertools
COMB=[]
for combination in itertools.product(COL1SET, COL3SET):
    COMB.append(combination)
#Create dataframe from new set:
UNQ = pd.DataFrame({'COMB':COMB})

#Split tuples into columns
new_col_list = ['COL1unq','COL3unq']
for n,col in enumerate(new_col_list):
    UNQ[col] = UNQ['COMB'].apply(lambda COMB: COMB[n])
UNQ = UNQ.drop('COMB',axis=1)

#Merge original data frame with unique set data frame
DF = pd.merge(DF,UNQ,left_on=['COL1','COL3'],right_on=['COL1unq','COL3unq'],how='outer')

#Fill in empty values of COL1 and COL3 where they did not have records
DF['COL1'] = DF['COL1unq']
DF['COL3'] = DF['COL3unq']

#Replace 'NaN's in column 2 with zeros
DF['COL2'].fillna(0, inplace=True)

#Get rid of COL1unq and COL3unq
DF.drop(['COL1unq','COL3unq'],axis=1, inplace=True)
DF
2
  • So how did you get the Y in the bottom row of COL3? Also, does COL1 only contain A and B values, or are there others. Could you extend your example to include an additional case given that you have many such instances. Commented Dec 20, 2015 at 4:32
  • I'm hoping the solution will detect that there is no existing COL3='Y' for COL1='B' and therefore add the row while setting COL2 to 0 for the new row. The code should get the set of unique values of COL3, check to see if all exist for all unique values of COL1, and if not, add the row. It doesn't get more complex than this, I was only trying to get an answer that I can apply to many rows instead of just manually inserting that specific row. Commented Dec 20, 2015 at 4:35

1 Answer 1

1

Something like this?

col1_b_vals = set(DF.loc[DF.COL1 == 'B', 'COL3'])
col1_not_b_col3_vals = set(DF.loc[DF.COL1 != 'B', 'COL3'])
missing_vals = col1_not_b_col3_vals.difference(col1_b_vals)
missing_rows = DF.loc[(DF.COL1 != 'B') & (DF.COL3.isin(missing_vals)), :]
missing_rows['COL1'] = 'B'
missing_rows['COL2'] = 0
>>> pd.concat([DF, missing_rows], ignore_index=True)
  COL1  COL2 COL3
0    A     1    X
1    A     2    Y
2    B     1    X
3    B     0    Y
Sign up to request clarification or add additional context in comments.

3 Comments

That works for the sample data, but can it be adjusted to work for any value of COL1 that lacks any value of the unique set of COL3 values?
I think the revised answer does what you want. I used sets to get the difference.
I set that as the correct answer because it is for the data I provided. I am also going to add what I worked out for the generalized case. Thanks for your help!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.