0

I have the following code:

import pandas as pd
import random


a = [random.randint(0, 1) for i in range(30)]
b = [random.randint(0, 1) for i in range(30)]

print(a)
print(b)

df = pd.DataFrame([a, b])
df = df.T

columns = ['column1', 'column2']
df.columns = columns
print(df)

that creates a dataframe stored in variable 'df'. It consists of 2 columns (column1 and column2) filled with random 0s and 1s.

This is the output I got when I ran the program (If you try to run it you won't get exactly the same result because of the randomint generation).

    column1  column2
0         0        1
1         1        0
2         0        1
3         1        1
4         0        1
5         1        1
6         0        1
7         1        1
8         1        0
9         0        1
10        0        0
11        1        1
12        1        1
13        0        1
14        0        0
15        0        1
16        1        1
17        1        1
18        0        1
19        1        0
20        0        0
21        1        0
22        0        1
23        1        0
24        1        1
25        0        0
26        1        1
27        1        0
28        0        1
29        1        0

I would like to create a filter on column2, showing only the clusters of data when there are three or more 1s in a row. The output would be something like this:

    column1  column2
2         0        1
3         1        1
4         0        1
5         1        1
6         0        1
7         1        1

11        1        1
12        1        1
13        0        1

15        0        1
16        1        1
17        1        1
18        0        1

I have left a space between the clusters for visual clarity, but the real output would not have the empty spaces in the dataframe.

I would like to do it in the following way.

filter1 = (some boolean condition) &/| (maybe some other stuff)
final_df = df[filter1]

Thank you

1 Answer 1

3

We can use GroupBy.transform.

n = 3
blocks = df['column2'].ne(df['column2'].shift()).cumsum()
m1 = (df.groupby(blocks)['column2']
        .transform('size').ge(n))
m2 = df['column2'].eq(1)
df_filtered = df.loc[m1 & m2]
# Alternative without df['column2'].eq(1)
#df_filtered = df.loc[m1.mul(df['column2'])]
print(df_filtered)

Output

    column1  column2
2         0        1
3         1        1
4         0        1
5         1        1
6         0        1
7         1        1

11        1        1
12        1        1
13        0        1

15        0        1
16        1        1
17        1        1
18        0        1

If column2 really contains only 1's and 0's in your original DataFrame then we can use transform('sum') instead transform('size')


blocks has a new value every time the value in column2 changes

print(blocks)
0      1
1      2
2      3
3      3
4      3
5      3
6      3
7      3
8      4
9      5
10     6
11     7
12     7
13     7
14     8
15     9
16     9
17     9
18     9
19    10
20    10
21    10
22    11
23    12
24    13
25    14
26    15
27    16
28    17
29    18
Name: column2, dtype: int64

Alternative

I often use this code in my projects and I came to the conclusion that it can be generally a little bit faster to use Series.map + Series.value_counts. The performance difference between the two methods will never be great and you can choose the one you want. But I usually use this last one that I have explained and I think it was worth mentioning it

%%timeit
m1 = blocks.map(blocks.value_counts().ge(n))
1.41 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


%%timeit
m1 = (df.groupby(blocks)['column2']
        .transform('size').ge(n))
2.12 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sign up to request clarification or add additional context in comments.

1 Comment

Excellent answer and explanation!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.