1

I have a table like so after performing some data scraping on a pdf:

index colA           colB                     colC colD colE colF colG
-------------------------------------------------------------------
 1    ABCD  veryLongTextThatShouldNotCutOff   12   x    x    x    x
 2    ABCD  veryLongText                      14   x    x    x    x
 4    NaN   ThatShouldNotCutOff               NaN  NaN  NaN  NaN  NaN
 5    ABCD  veryLongTextThatShouldNotCutOff   12   x    x    x    x
 6    ABCD  veryLongTextThatShouldNotCutOff   x    19   x    x    x
 7    ABCD  veryLongText                      21   x    x    x    x
 9    NaN   ThatShouldNotCutOff               NaN  NaN  NaN  NaN  NaN
 10   ABCD  veryLongTextThatShouldNotCutOff   13   x    x    x    x
...

I need to clean the data up so that information in rows with index 2 and 4 are merged together like so, and with the indexes in the right order again.

index colA           colB                     colC colD colE colF colG
-------------------------------------------------------------------
 1    ABCD  veryLongTextThatShouldNotCutOff   12   x    x    x    x
 2    ABCD  veryLongTextThatShouldNotCutOff   14   x    x    x    x
 3    ABCD  veryLongTextThatShouldNotCutOff   12   x    x    x    x
 4    ABCD  veryLongTextThatShouldNotCutOff   x    19   x    x    x
 5    ABCD  veryLongTextThatShouldNotCutOff   21   x    x    x    x
 6    ABCD  veryLongTextThatShouldNotCutOff   13   x    x    x    x
...

Essentially, some rows with colA as NaN needs to be "concatenated" with the previous rows.

How do I do it using pandas?

1 Answer 1

3

You could use the non NaN values in colA to set up a group and merge the colB. Then drop the NaN rows:

group = df['colA'].notna().cumsum()
df['colB'] = df.groupby(group)['colB'].transform('sum')

df = df.dropna(subset=['colA'])

output:

   index  colA                             colB  colC colD colE colF colG
0      1  ABCD  veryLongTextThatShouldNotCutOff  12.0    x    x    x    x
1      2  ABCD  veryLongTextThatShouldNotCutOff  14.0    x    x    x    x
3      5  ABCD  veryLongTextThatShouldNotCutOff  12.0    x    x    x    x

Sign up to request clarification or add additional context in comments.

5 Comments

colB now has Nan, but he needed concatination instead of that nan.
thanks @GedasMiksenas that was a mistake of mine, I needed transform
is there a way to do without groupby? I don't think it is the right approach. I've updated my qn for more clarity
@mozway sorry, check again. I feel that a sliding window approach is more appropriate in my case, but I'm not sure how to do that in pandas
@yameroyo I don't see a fundamental difference (except having a few more rows)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.