0

Hi I have a datframe that look like this

        Date     Col1         Col2         Col3       col4 
0     1-4-2020    a        ed kv sc        sf ds       n
1     1-4-2020    b        hed sf ed       df hed      c
2     1-4-2020    c        df xz sf        sf ds       n
3     2-4-2020    a        ed df hed       sf ds       n
4     2-4-2020    b        gv bb sc        s ds        c
5     2-4-2020    c        ed sf sc        hed ds      n

What I want to do is to filter the rows with multiple conditions. If:

  • Tow rows have the same value in Col1. For example a == a
  • And they have the same values in Col4. n==n
  • If both conditions are true search Col2 and Col3 for the key strings "hed" and "sf". they can be in Col2 or Col3(or both)

So the result will look like this:

       Date     Col1         Col2         Col3       col4 
1     1-4-2020    b        hed sf ed      df hed       c
3     2-4-2020    a        ed df hed      sf djs       n
5     2-4-2020    c        ed sf ssc      hed djs      n

I hope it is clear Update

In case Col4

        Date     Col1         Col2         Col3       col4 
0     1-4-2020    a        ed kv sc        sf ds       n
1     1-4-2020    b        hed sf ed       df hed      c
2     1-4-2020    c        df xz sf        sf ds       n
3     1-4-2020    d        df xdg sf       sf hed      y 
4     2-4-2020    a        ed df hed       sf ds       n
5     2-4-2020    b        gv bb sc        s ds        c
6     2-4-2020    c        ed sf sc        hed ds      n
7     2-4-2020    d        df xgfgf       gf gfd       null

In case that the value in Col4 appears only once in the dataframe like in row 3 (hint no next row to compare with) the function should return the row.

        Date     Col1         Col2         Col3       col4 
1     1-4-2020    b        hed sf ed       df hed      c
3     1-4-2020    d        df xdg sf       sf hed      y 
4     2-4-2020    a        ed df hed       sf ds       n
6     2-4-2020    c        ed sf sc        hed ds      n

1 Answer 1

1

First get counts of rows by same Col1 and Col4 by GroupBy.transform with GroupBy.size and join column Col2 with Col3 toegether:

count = df.groupby(['Col1','col4'])['col4'].transform('size')
s = (df['Col2'] + ' ' + df['Col3'])

You can verify output:

print (df.assign(count=count, both=s))
       Date Col1       Col2    Col3 col4  count              both
0  1-4-2020    a   ed kv sc   sf ds    n      2    ed kv sc sf ds
1  1-4-2020    b  hed sf ed  df hed    c      2  hed sf ed df hed
2  1-4-2020    c   df xz sf   sf ds    n      2    df xz sf sf ds
3  2-4-2020    a  ed df hed   sf ds    n      2   ed df hed sf ds
4  2-4-2020    b   gv bb sc    s ds    c      2     gv bb sc s ds
5  2-4-2020    c   ed gf sc  hed ds    n      2   ed gf sc hed ds

Last filter by boolean indexing if counts are 2 by Series.eq, then Series.str.contains for test substrings and chained mask together by & for bitwise AND:

df = df[count.eq(2) & s.str.contains('hed') & s.str.contains('sf')]
print (df)
       Date Col1       Col2    Col3 col4
1  1-4-2020    b  hed sf ed  df hed    c
3  2-4-2020    a  ed df hed   sf ds    n
Sign up to request clarification or add additional context in comments.

4 Comments

This answer would be perfect, but the issue with it is some times all three conditions are only applied to one row (there is no other row to be compared with it). Is there a way to fined the first occurrence of the string.
@Sarah - sorry, not understand, is possible change data for see what you mean?
@Sarah - Tested and for me working df = df[count.isin([1,2]) & s.str.contains('hed') & s.str.contains('sf')], can you check it?
It gave a better result yet still missing rows. I believe I shall investigate the data more. Thanks

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.