1

I am attempting to recode column values in pandas using a combination of 'where' and 'count' functions. The desired result is to select 200 random rows from rows that are labeled as "Low_Valence" and 200 random rows from rows that are labeled as "Low_Valence", within the "valence_median_split" column. However, this does not seem to be working.

Here is the df:

df.head()

Out[34]: 
              ID Category  Num Vert_Horizon Description  Fem_Valence_Mean  \
0  Animals_001_h  Animals    1            h  Dead Stork              2.40   
1  Animals_002_v  Animals    2            v        Lion              6.31   
2  Animals_003_h  Animals    3            h       Snake              5.14   
3  Animals_004_v  Animals    4            v        Wolf              4.55   
4  Animals_005_h  Animals    5            h         Bat              5.29   

   Fem_Valence_SD  Fem_Av_Ap_Mean  Fem_Av/Ap_SD  Arousal_Mean  \
0            1.30            3.03          1.47          6.72   
1            2.19            5.96          2.24          6.69   
2            1.19            5.14          1.75          5.34   
3            1.87            4.82          2.27          6.84   
4            1.56            4.61          1.81          5.50   

          Luminance  Contrast  JPEG_size80   LABL   LABA  \
0          ...              126.05     68.45       263028  51.75  -0.39   
1          ...              123.41     32.34       250208  52.39  10.63   
2          ...              135.28     59.92       190887  55.45   0.25   
3          ...              122.15     75.10       282350  49.84   3.82   
4          ...              131.81     59.77       329325  54.26  -0.34   

    LABB  Entropy  Classification  temp_selection  valence_median_split  
0  16.93     7.86                            High           Low_Valence  
1  30.30     6.71                             NaN          High_Valence  
2   4.41     7.83                            High           Low_Valence  
3   1.36     7.69                            High           Low_Valence  
4  -0.95     7.82                            High           Low_Valence  

[5 rows x 35 columns]

Here is what I tried:

df['temp_selection'] = ''
df['temp_selection'] = np.where(df['valence_median_split'] == 'Low_Valence', df['valence_median_split'].sample(n=200).reindex(df.index), 'Low')
df['temp_selection'] = np.where(df['valence_median_split'] == 'High_Valence', df['valence_median_split'].sample(n=200).reindex(df.index), 'High')
df.temp_selection.unique()

However, the results indicate that this did not work:

array(['High', nan, 'High_Valence'], dtype=object)

I am wondering if there is an error with combining these functions.

Here is a reproducible example:

d = {'col1': [1, 2, 3, 4, 3, 3, 2, 2], 'col2': [1, 2, 3, 4, 3, 3, 2, 2]}
df = pd.DataFrame(data=d)
df['valence_median_split'] = ''
#Get median of valence
valence_median = df['col1'].median()
df['valence_median_split'] = np.where(df['col2'] < valence_median, 'Low_Valence', 'High_Valence')
df['temp_selection'] = ''
df['temp_selection'] = np.where(df['valence_median_split'] == 'Low_Valence', df['valence_median_split'].sample(n=2).reindex(df.index), 'Low')
df['temp_selection'] = np.where(df['valence_median_split'] == 'High_Valence', df['valence_median_split'].sample(n=2).reindex(df.index), 'High')
df
   col1  col2 valence_median_split temp_selection
0     1     1          Low_Valence           High
1     2     2          Low_Valence           High
2     3     3         High_Valence   High_Valence
3     4     4         High_Valence            NaN
4     3     3         High_Valence            NaN
5     3     3         High_Valence   High_Valence
6     2     2          Low_Valence           High
7     2     2          Low_Valence           High

As can be seen in the df above, there is a 'High_Valence' classification within 'temp_selection' that should not be there, and no 'Low' classifications.

2
  • 4
    is it possible to create a smaller dataset and expected output with the code and explaination? It will help you and us understand the question better. Thanks Commented Mar 30, 2019 at 15:19
  • @anky_91 Thanks for the suggestion. I added a reproducible example. Commented Mar 30, 2019 at 21:21

1 Answer 1

1

Idea is get indices of sample of filtered data ans instead double np.where use numpy.select:

low = df.loc[df['valence_median_split'] == 'Low_Valence', 
                'valence_median_split'].sample(n=2).index
high = df.loc[df['valence_median_split'] == 'High_Valence',
                 'valence_median_split'].sample(n=2).index
df['temp_selection'] = np.select([df.index.isin(low), df.index.isin(high)],
                                 ['Low', 'High'], default=np.nan)

Or:

df['temp_selection'] = np.where(df.index.isin(low), 'Low', 
                       np.where(df.index.isin(high), 'High', np.nan))

print (df)
   col1  col2 valence_median_split temp_selection
0     1     1          Low_Valence            nan
1     2     2          Low_Valence            Low
2     3     3         High_Valence            nan
3     4     4         High_Valence            nan
4     3     3         High_Valence           High
5     3     3         High_Valence           High
6     2     2          Low_Valence            nan
7     2     2          Low_Valence            Low

Or:

df.loc[low, 'temp_selection'] = 'Low'
df.loc[high, 'temp_selection'] = 'High'
print (df)
   col1  col2 valence_median_split temp_selection
0     1     1          Low_Valence            NaN
1     2     2          Low_Valence            Low
2     3     3         High_Valence            NaN
3     4     4         High_Valence            NaN
4     3     3         High_Valence           High
5     3     3         High_Valence           High
6     2     2          Low_Valence            NaN
7     2     2          Low_Valence            Low

Another ide is use numpy.random.choice:

low = np.random.choice(df.index[df['valence_median_split'] == 'Low_Valence'], size=2)
high = np.random.choice(df.index[df['valence_median_split']== 'High_Valence'], size=2)

df.loc[low, 'temp_selection'] = 'Low'
df.loc[high, 'temp_selection'] = 'High'
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.