1

I have a NumPy string array in the shape of (112, 7). First few elements are just letters and the rest is numbers like follows

List[0] = array(['ID32', 'TRED', 'PLUS', '434','0.34', '11.9', '4.8'], dtype='<U14')
List[1] = array(['ID32', 'TRED', 'PLUS', '994','0.84', '44.3', '1.11'], dtype='<U14')
List[2] = array(['ID32', 'PROP', 'MINUS', '234','0.56', '44.3', '1.11'], dtype='<U14')

....

What I would like to achieve is the IF statement check the first three elements and if they are identical calculate the ratio of fourth and 5th element and remove the smaller one from the list

For instance List[0] and List[1] have same first three elements so checking the ratios (434/0.34 = 1276.5, 994/0.84 = 1183), so List[1] is smaller and should be removed from the list.

Here is my failed attempt

for i, val in enumerate(List):
    if val[i][0] == val[i][1]
        print(val[3].astype(np.float)/val[4].astype(np.float))

I appreciate any help.

2
  • If you are open to using pandas it can be easily done. Commented Aug 14, 2020 at 11:55
  • @HenryYik Sure, can you please give an example? I do not have much experience with pandas? Commented Aug 14, 2020 at 11:58

2 Answers 2

3

If you are open to using pandas:

import pandas as pd


# setup
l = [['ID32', 'TRED', 'PLUS', '434', '0.34', '11.9', '4.8'],
     ['ID32', 'TRED', 'PLUS', '994', '0.84', '44.3', '1.11'],
     ['ID32', 'PROP', 'MINUS', '234', '0.56', '44.3', '1.11']]

df = pd.DataFrame(l)

print (df.assign(ratio=df[3].astype(float)/df[4].astype(float))
         .sort_values([0,1,2,"ratio"], ascending=False)
         .drop_duplicates([0,1,2], keep="first")
         .sort_index()
         .drop("ratio", 1)
         .to_numpy())

Result:

[['ID32' 'TRED' 'PLUS' '434' '0.34' '11.9' '4.8']
 ['ID32' 'PROP' 'MINUS' '234' '0.56' '44.3' '1.11']]
Sign up to request clarification or add additional context in comments.

2 Comments

This is clever but it will change the order of all rows by sort_values...not sure if that matters.
let me throw in a sort_index after drop_duplicates to retain original order :-)
0

First make a mask to track which rows to keep, and convert the numeric columns:

keep = np.ones(len(arr), bool) # [True, True, True]
numer = arr[:,3].astype(float)
denom = arr[:,4].astype(float)

Then a loop to edit the mask of which rows we want to keep:

for ii in range(1, len(arr)): 
    if np.all(arr[ii-1,:3] == arr[ii,:3]): 
        if numer[ii-1] / denom[ii-1] < numer[ii] / denom[ii]: 
            keep[ii-i] = False 
        else: 
            keep[ii] = False 

Now you have keep as array([ True, False, True]), which you can easily use to get the final result:

arr[keep]

Giving you:

array([['ID32', 'TRED', 'PLUS', '434', '0.34', '11.9', '4.8'],
       ['ID32', 'PROP', 'MINUS', '234', '0.56', '44.3', '1.11']],
      dtype='<U14')

If the number of matching strings is small compared with the total number of rows, this might be faster:

matches = 1 + np.where((xxx[1:] == xxx[:-1]).all(1))[0] # [1]
for ii in matches: # now we already know the strings match
    if numer[ii-1] / denom[ii-1] < numer[ii] / denom[ii]: 
        keep[ii-i] = False 
    else: 
        keep[ii] = False 

That way the code is still quite readable but the loop is only iterating the number of matches, not the number of rows.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.