0

I am trying to merge a number of csv files together. They all have a few columns in common which are:

CU_NUMBER   CYCLE_DATE  JOIN_NUMBER CU_NAME PhysicalAddressLine1    PhysicalAddressCity PhysicalAddressStateCode

And to the right of these columns would be various columns of interest in all of the csv files. Now, some of these csv files will have different columns of interest that I want to still merge. Also, some files may not have the same CU_NUMBER, CU_NAME, PhysicalAddressLine1, PhysicalAddressCity, PhysicalAddressStateCode.

Here is an example of what I want to do. Say I have a dataframe

enter image description here

and another data frame

enter image description here

After merging I want to have something like this:

enter image description here

The tricky part with this is there are various columns of interest for all the csv files and I want to see if there is a good way to merge all of them in this manner without manually specifying each column I want. I have a total of 20 csv files that I want to merge into one in this manner.

What I have so far:

I have tried something like this:

df_concat1 = pd.concat([ df13[['CU_NUMBER','CYCLE_DATE',
                                      'JOIN_NUMBER',
                                      'PhysicalAddressLine1','PhysicalAddressCity', 
                               'PhysicalAddressStateCode','(CECL) Allowance for Credit Losses on Loans and Leases']] 
                      ], axis = 0)
new_df1 = df12.merge(df_concat1, how='left', on=['CU_NUMBER','CYCLE_DATE', 'JOIN_NUMBER',
                                                'CU_NAME', 'PhysicalAddressLine1',
                                                'PhysicalAddressCity', 'PhysicalAddressStateCode'])

But I get this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-c2b139ce1777> in <module>
      6 new_df1 = df12.merge(df_concat1, how='left', on=['CU_NUMBER','CYCLE_DATE', 'JOIN_NUMBER',
      7                                                 'CU_NAME', 'PhysicalAddressLine1',
----> 8                                                 'PhysicalAddressCity', 'PhysicalAddressStateCode'])

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
   7295             copy=copy,
   7296             indicator=indicator,
-> 7297             validate=validate,
   7298         )
   7299 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     84         copy=copy,
     85         indicator=indicator,
---> 86         validate=validate,
     87     )
     88     return op.get_result()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    625             self.right_join_keys,
    626             self.join_names,
--> 627         ) = self._get_merge_keys()
    628 
    629         # validate the merge keys dtypes. We may need to coerce

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _get_merge_keys(self)
    981                     if not is_rkey(rk):
    982                         if rk is not None:
--> 983                             right_keys.append(right._get_label_or_level_values(rk))
    984                         else:
    985                             # work-around for merge_asof(right_index=True)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)
   1690             values = self.axes[axis].get_level_values(key)._values
   1691         else:
-> 1692             raise KeyError(key)
   1693 
   1694         # Check for duplicates

KeyError: 'CU_NAME'

I am not sure why I get this error. What I want to have is merge all the columns of interest into one file and if there is columns of interest that are unique to that file alone then it will just be a new column. If there is duplicate columns then I want to just append new rows if that makes sense.

4
  • First please tell us by what logic you want to merge the the columns of interest? Which ones do you want in the new DF. Are there duplicated columns (apart from the first 7) and how do you want to handle them? What have you tried so far. Can you use Dataframe.merge to solve your problem? Commented May 15, 2020 at 17:56
  • @Joooeey Sure let me clarify. Commented May 15, 2020 at 17:59
  • @Joooeey I tried to add some clarification I am not sure if I articulated it well enough though. Commented May 15, 2020 at 18:05
  • Total left-field shot in the dark: Given it’s a KeyError, are you certain one of the CU_NAME fields in your source data doesn’t have a stray space in the column name? Commented May 15, 2020 at 18:19

1 Answer 1

1

The error you are seeing is because your df_concat1 doesn't contain a column or index with name 'CU_NAME'. When merging, all names passed via on= must exist in both dataframes.

So, yes, DataFrame.merge is your friend, DataFrame.concat is not useful here. If you're certain that the common columns exist in every dataframe, you can merge in a loop:

common_columns = [...]
df_m, *df_others = my_dataframes
for df in df_others:
    # using 'outer' makes sure we keep all rows from all files 
    df_m = df_m.merge(df, how='outer', on=common_columns)

# do work with df_m
Sign up to request clarification or add additional context in comments.

3 Comments

That is strange cause I am looking at the .head() in both dataframes and they both contain CU_NAME.
Using your approach I can see that we can merge the common columns together which is nice. But how would we merge the rest of the columns as well that are not common but yet you still want to have them joined?
@Snorrlaxxx 1) In your code above (What I have so far), df_concat1 is the results of some call to concat() and doesn't seem to contain that column. 2) on= just defines the merge or join index. The resulting df_m also contains all non-common columns from all merged dataframes.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.