Numpy array remove duplicates with if statement

Question

I have a NumPy string array in the shape of (112, 7). First few elements are just letters and the rest is numbers like follows

List[0] = array(['ID32', 'TRED', 'PLUS', '434','0.34', '11.9', '4.8'], dtype='<U14')
List[1] = array(['ID32', 'TRED', 'PLUS', '994','0.84', '44.3', '1.11'], dtype='<U14')
List[2] = array(['ID32', 'PROP', 'MINUS', '234','0.56', '44.3', '1.11'], dtype='<U14')

....

What I would like to achieve is the IF statement check the first three elements and if they are identical calculate the ratio of fourth and 5th element and remove the smaller one from the list

For instance List[0] and List[1] have same first three elements so checking the ratios (434/0.34 = 1276.5, 994/0.84 = 1183), so List[1] is smaller and should be removed from the list.

Here is my failed attempt

for i, val in enumerate(List):
    if val[i][0] == val[i][1]
        print(val[3].astype(np.float)/val[4].astype(np.float))

I appreciate any help.

@HenryYik Sure, can you please give an example? I do not have much experience with pandas? — Sara Krauss
– Sara Krauss, Commented Aug 14, 2020 at 11:58

Henry Yik · Accepted Answer · 2020-08-14 12:11:19Z

3

If you are open to using pandas:

import pandas as pd


# setup
l = [['ID32', 'TRED', 'PLUS', '434', '0.34', '11.9', '4.8'],
     ['ID32', 'TRED', 'PLUS', '994', '0.84', '44.3', '1.11'],
     ['ID32', 'PROP', 'MINUS', '234', '0.56', '44.3', '1.11']]

df = pd.DataFrame(l)

print (df.assign(ratio=df[3].astype(float)/df[4].astype(float))
         .sort_values([0,1,2,"ratio"], ascending=False)
         .drop_duplicates([0,1,2], keep="first")
         .sort_index()
         .drop("ratio", 1)
         .to_numpy())

Result:

[['ID32' 'TRED' 'PLUS' '434' '0.34' '11.9' '4.8']
 ['ID32' 'PROP' 'MINUS' '234' '0.56' '44.3' '1.11']]

edited Aug 14, 2020 at 12:11

answered Aug 14, 2020 at 12:04

Henry Yik

22.6k5 gold badges21 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

John Zwinck Over a year ago

This is clever but it will change the order of all rows by sort_values...not sure if that matters.

Henry Yik Over a year ago

let me throw in a sort_index after drop_duplicates to retain original order :-)

John Zwinck · Accepted Answer · 2020-08-14 12:06:54Z

First make a mask to track which rows to keep, and convert the numeric columns:

keep = np.ones(len(arr), bool) # [True, True, True]
numer = arr[:,3].astype(float)
denom = arr[:,4].astype(float)

Then a loop to edit the mask of which rows we want to keep:

for ii in range(1, len(arr)): 
    if np.all(arr[ii-1,:3] == arr[ii,:3]): 
        if numer[ii-1] / denom[ii-1] < numer[ii] / denom[ii]: 
            keep[ii-i] = False 
        else: 
            keep[ii] = False

Now you have keep as array([ True, False, True]), which you can easily use to get the final result:

arr[keep]

Giving you:

array([['ID32', 'TRED', 'PLUS', '434', '0.34', '11.9', '4.8'],
       ['ID32', 'PROP', 'MINUS', '234', '0.56', '44.3', '1.11']],
      dtype='<U14')

If the number of matching strings is small compared with the total number of rows, this might be faster:

matches = 1 + np.where((xxx[1:] == xxx[:-1]).all(1))[0] # [1]
for ii in matches: # now we already know the strings match
    if numer[ii-1] / denom[ii-1] < numer[ii] / denom[ii]: 
        keep[ii-i] = False 
    else: 
        keep[ii] = False

That way the code is still quite readable but the loop is only iterating the number of matches, not the number of rows.

Collectives™ on Stack Overflow

Numpy array remove duplicates with if statement

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related