sort rows of array to match order of another array using an identifier column

Question

I have two arrays like this:

A = [[111, ...],          B = [[222, ...],
     [222, ...],               [111, ...],
     [333, ...],               [333, ...],
     [555, ...]]               [444, ...],
                               [555, ...]]

Where the first column contains identifiers and the remaining columns some data, where the number of columns of B is much larger than the number of columns of A. The identifiers are unique. The number of rows in A can be less than in B, so that in some cases empty spacer rows would be necessary.
I am looking for an efficient way to match the rows of matrix A to matrix B, so that that the result would look like that:

A = [[222, ...],
     [111, ...],
     [333, ...],
     [nan, nan], #could be any unused value
     [555, ...]]

I could just sort both matrices or write a for loop, but both approaches seem clumsy... Are there better implementations?

What do you want to do with the remaining columns? Should the remaining columns of B just be appended to those of A? — Daniel Porteous
– Daniel Porteous, Commented Jun 16, 2016 at 10:31
Can you assume that A and B have equal number of rows, and an identical set of identifiers? And that the identifiers are unique? — donkopotamus
– donkopotamus, Commented Jun 16, 2016 at 10:31
I think you forgot to add more details. Do you mean to match identical list (Example: both the length of the list and data in the list are same) found in both the 2-D matrices A and B? Given that no. of columns in B is much greater than A, I wonder how one would match exactly... — Swadhikar
– Swadhikar, Commented Jun 16, 2016 at 10:32
Also, is A always going to be sorted, or was that just how it appears in the question by chance! — donkopotamus
– donkopotamus, Commented Jun 16, 2016 at 10:38

Community · Accepted Answer · 2017-05-23 11:58:50Z

Here's a vectorized approach using np.searchsorted -

# Store the sorted indices of A
sidx = A[:,0].argsort()

# Find the indices of col-0 of B in col-0 of sorted A
l_idx = np.searchsorted(A[:,0],B[:,0],sorter = sidx)

# Create a mask corresponding to all those indices that indicates which indices
# corresponding to B's col-0 match up with A's col-0
valid_mask = l_idx != np.searchsorted(A[:,0],B[:,0],sorter = sidx,side='right')

# Initialize output array with NaNs. 
# Use l_idx to set rows from A into output array. Use valid_mask to select 
# indices from l_idx and output rows that are to be set.
out = np.full((B.shape[0],A.shape[1]),np.nan)
out[valid_mask] = A[sidx[l_idx[valid_mask]]]

Please note that valid_mask could also be created using np.in1d : np.in1d(B[:,0],A[:,0]) for a more intuitive answer. But, we are using np.searchsorted as that's better in terms of performance as also disscused in greater detail in this other solution.

Sample run -

In [184]: A
Out[184]: 
array([[45, 11, 86],
       [18, 74, 59],
       [30, 68, 13],
       [55, 47, 78]])

In [185]: B
Out[185]: 
array([[45, 11, 88],
       [55, 83, 46],
       [95, 87, 77],
       [30,  9, 37],
       [14, 97, 98],
       [18, 48, 53]])

In [186]: out
Out[186]: 
array([[ 45.,  11.,  86.],
       [ 55.,  47.,  78.],
       [ nan,  nan,  nan],
       [ 30.,  68.,  13.],
       [ nan,  nan,  nan],
       [ 18.,  74.,  59.]])

Its all numpy so it feels like it should be faster than my suggestion ... but it doesn't seem be (at least on this small array) ... %timeit vectorised(A, B) => 9.6 µs vs %timeit match_order(A, B) => 6.99 µs. Is this due to the two searchsorted's?
That's better ... the vectorised is faster once we make the arrays somewhat larger (e.g. 1000 rows)
@donkopotamus It's because of the setup work we are doing to process it. That's generally the case with vectorized approaches as they expect decent sized arrays.
@donkopotamus Also, if A[:,0] is already sorted, we can shave a considerable runtime there by avoiding creating and using sidx.
@Divakar this is great, I would not have been able to come up with the 'np.searchsorted' approach! thanks a lot

Vadim Pushtaev · Accepted Answer · 2016-06-16 10:32:02Z

0

The simple approach is to build a dict from A and then use it to map identifiers found in B to the new array.

Constructing dict:

>>> A = [[1,"a"], [2,"b"], [3,"c"]]
>>> A_dict = {x[0]: x for x in A}
>>> A_dict
{1: [1, 'a'], 2: [2, 'b'], 3: [3, 'c']}

Mapping:

>>> B = [[3,"..."], [2,"..."], [1,"..."]]
>>> result = (A_dict[x[0]] for x in B)
>>> list(result)
[[3, 'c'], [2, 'b'], [1, 'a']]

answered Jun 16, 2016 at 10:32

Vadim Pushtaev

2,34818 silver badges34 bronze badges

Comments

donkopotamus · Accepted Answer · 2016-06-16 10:58:37Z

Its not clear if you wish to concatenate the values in B onto A. Lets assume not ... then the simplest way is probably to just build a dictionary of identifier to row and then reorder A:

def match_order(A, B):
    # identifier -> row
    by_id = {A[i, 0]: A[i] for i in range(len(A))}

    # make up a fill row and rearrange according to B
    fill_row = [-1] * A.shape[1]
    return numpy.array([by_id.get(k, fill_row) for k in B[:, 0]])

As an example, if we have:

A = numpy.array([[111, 1], [222, 2], [333, 3], [555, 5]])
B = numpy.array([[222, 2], [111, 1], [333, 3], [444, 4], [555, 5]])

Then

>>> match_order(A, B)
array([[222,   2],
       [111,   1],
       [333,   3],
       [ -1,  -1],
       [555,   5]])

If you wish to concatenate B, then you can do so simply as:

>>> numpy.hstack( (match_order(A, B), B[:, 1:]) )
array([[222,   2,   2],
       [111,   1,   1],
       [333,   3,   3],
       [ -1,  -1,   4],
       [555,   5,   5]])

Swadhikar · Accepted Answer · 2016-06-16 11:06:36Z

>>> A = [[3,'d', 'e', 'f'], [1,'a','b','c'], [2,'n','n','n']]
>>> B = [[1,'a','b','c'], [3,'d','e','f']]
>>> A_dict = {x[0]:x[1:] for x in A}
>>> A_dict
    {1: ['a', 'b', 'c'], 2: ['n', 'n', 'n'], 3: ['d', 'e', 'f']}
>>> B_dict = {x[0]:x[1:] for x in B}
>>> B_dict
    {1: ['a', 'b', 'c'], 3: ['d', 'e', 'f']} 
>>> result=[[x] + A_dict[x] for x in A_dict if x in B_dict and A_dict[x]==B_dict[x]]
>>> result
    [[1, 'a', 'b', 'c'], [3, 'd', 'e', 'f']]

Here A[0], B[1] and A[1],B[0] are identical. Converting into a dict and dealing the problem makes life easier here.

Step 1: Create dict objects for each 2D list.

Step 2: Iterate each key in A_dict and check: a. If Key exists in B_dict, b. If yes, see if both keys have same value

Step 3: Append the key and value to form a 2-D list.

Cheers!

Collectives™ on Stack Overflow

sort rows of array to match order of another array using an identifier column

4 Answers 4

7 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

Comments

Comments

Comments

Linked

Related