Sorting array of objects by row using custom dtype

Question

I am attempting to sort a some arrays lexicographically by rows. The integer case works perfectly:

>>> arr = np.random.choice(10, size=(5, 3))
>>> arr
array([[1, 0, 2],
       [8, 0, 8],
       [1, 8, 4],
       [1, 3, 9],
       [6, 1, 8]])
>>> np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr).sort()
>>> arr
array([[1, 0, 2],
       [1, 3, 9],
       [1, 8, 4],
       [6, 1, 8],
       [8, 0, 8]])

I can also do the sorting with

np.ndarray(arr.shape[0], dtype=[('', arr.dtype)] * arr.shape[1], buffer=arr).sort()

In both cases, the results are the same. However, that is not the case for object arrays:

>>> selection = np.array(list(string.ascii_lowercase), dtype=object)
>>> arr = np.random.choice(selection, size=(5, 3))
>>> arr
array([['t', 'p', 'g'],
       ['n', 's', 'd'],
       ['g', 'g', 'n'],
       ['g', 'h', 'o'],
       ['f', 'j', 'x']], dtype=object)
>>> np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr).sort()
>>> arr
array([['t', 'p', 'g'],
       ['n', 's', 'd'],
       ['g', 'h', 'o'],
       ['g', 'g', 'n'],
       ['f', 'j', 'x']], dtype=object)
>>> np.ndarray(arr.shape[0], dtype=[('', arr.dtype)] * arr.shape[1], buffer=arr).sort()
>>> arr
array([['f', 'j', 'x'],
       ['g', 'g', 'n'],
       ['g', 'h', 'o'],
       ['n', 's', 'd'],
       ['t', 'p', 'g']], dtype=object)

Clearly only the case with dtype=[('', arr.dtype)] * arr.shape[1] is working properly. Why is that? What is different about dtype=[('', arr.dtype, arr.shape[1])]? The sort is clearly doing something, but the order appears to be nonsensical at first glance. Is it using pointers as the sort keys?

For what it's worth, np.searchsorted appears to be doing the same sort of comparison as np.sort, as expected.

I think the first case wraps your elements of structured array around an object array (e.g. [(['f', 'r', 'h'],)]) while second one creates a structured array from elements directly (e.g. [('f', 'r', 'h')]). I would guess the first case sorts by array and second by elements. — Ehsan
– Ehsan, Commented Oct 5, 2020 at 20:58
@Ehsan. That's basically what the code shows. I'm a bit curious as to why. sort specifically mentions that the structure fields are sorted in lexicographical order, but it's a bit unclear how that applies in this case. If we sort arrays by pointer value, then why not sort scalars the same way. If we sort scalars by object comparison, then why not sort arrays the same way? — Mad Physicist
– Mad Physicist, Commented Oct 5, 2020 at 21:08

Mike MacNeil · Accepted Answer · 2020-10-15 16:13:52Z

0

This actually works fine

In [16]: selection = np.array(list(string.ascii_lowercase))

In [17]: arr = np.random.choice(selection, size=(5, 3))

In [18]: arr
Out[18]:
array([['x', 'l', 'i'],
       ['k', 'h', 'b'],
       ['y', 'h', 'w'],
       ['i', 'u', 't'],
       ['v', 'u', 'k']], dtype='<U1')

In [19]: np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr).sort()

In [20]: arr
Out[20]:
array([['i', 'u', 't'],
       ['k', 'h', 'b'],
       ['v', 'u', 'k'],
       ['x', 'l', 'i'],
       ['y', 'h', 'w']], dtype='<U1')

The issue is with using dtype object for selection.

In [21]: selection = np.array(list(string.ascii_lowercase), dtype = object)

In [22]: arr = np.random.choice(selection, size=(5, 3))

In [23]: arr
Out[23]:
array([['b', 'h', 'e'],
       ['o', 'z', 'c'],
       ['g', 'v', 'z'],
       ['r', 'n', 'k'],
       ['a', 'h', 't']], dtype=object)

In [24]: np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr).sort()

In [25]: arr
Out[25]:
array([['o', 'z', 'c'],
       ['b', 'h', 'e'],
       ['r', 'n', 'k'],
       ['a', 'h', 't'],
       ['g', 'v', 'z']], dtype=object)

Note dtype = 'O' means numpy type of python object see here for more, which I don't think provides a comparison operator.

The two types that you've provided, normally, should still work.

answered Oct 15, 2020 at 16:13

Mike MacNeil

273 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Mad Physicist Over a year ago

In my case, str is a standin for arbitrary objects that support rich comparison. I noted that the sort works perfectly for integers, and not surprisingly other types supported by numpy. The issue is that depending on the arrangement of the dtype, objects are compared differently by the sorter. I'd like to as a minimum understand what is going on, and if possible, get a sense for how to work around that. This answer just confirms what I've already discovered and discussed.

Mike MacNeil Over a year ago

It is the opposite of what you found. The case dtype=[('', arr.dtype, arr.shape[1])], which was not working for you, works in the first case. The reason is this is of type np.dtype('f_0','U1', (3,)) which translates to "fixed sized arrays of characters of dimension 3", which can be sorted lexicographically. In my second case, the dtype is python object (same as your code above) or dtype('f_0','O', (3,)). Which translates to "fixed sized arrays of size 3 of object". But objects can have arbitrary memory layouts.

Mad Physicist Over a year ago

Using a non-object dtype does not interest me. I've established that it works correctly in the question, although not specifically for fixed width character arrays. Setting the dtype to have shape[1] independent fields enables rich comparison of the objects pointed to in an object array. Setting the dtype to be a single field with an array of shape[1] elements appears to compare the pointers, although I'm not even sure about that. The question is restricted only to that. Instead of strings, imagine some arbitrary custom type with no numpy analog.

Mad Physicist Over a year ago

I think you're missing the point. Strings very much do have full rich comparison defined, which is exactly why I picked them for an MCVE instead of making a custom type. Don't let the fact that there's a numpy Unicode type distract you.

Mike MacNeil Over a year ago

Yea I understand. But your example, the case where you are able to compare objects pointed to when setting the dtype to have shape[1] independent fields, you can do rich comparisons. The question (as I understood it) was when you set the dtype to be a single field with an array of shape[1] elements, you no longer can, and it breaks. Its not clear it compares pointers. The example I gave shows that if the type is set correctly (rather than declaring it the generic keyword for object), that you still can do these rich comparisons you want.

|

Mad Physicist · Accepted Answer · 2021-01-07 16:55:37Z

The fact that sorting works for integers happens to be a coincidence this can be verified by looking at the result of floating point operations:

>>> arr = np.array([[0.5, 1.0, 10.2],
                    [0.4, 2.0, 11.0],
                    [1.0, 2.0, 4.0]])
>>> np.sort(np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr))
array([([ 0.5,  1. , 10.2],),
       ([ 1. ,  2. ,  4. ],),
       ([ 0.4,  2. , 11. ],)], dtype=[('f0', '<f8', (3,))])
>>> np.sort(np.ndarray(arr.shape[0], dtype=[('', arr.dtype)] * arr.shape[1], buffer=arr))
array([(0.4, 2., 11. ),
       (0.5, 1., 10.2),
       (1. , 2.,  4. )],
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8')])

Another hint comes from looking at the bits of the numbers 0.5, 0.4 and 1.0:

0.5 = 0x3FE0000000000000
0.4 = 0x3FD999999999999A
1.0 = 0x3FF6666666666666

On a little-endian machine, we have that 0x00 < 0x66 < 0x9A (the last byte shown above comes first).

The exact answer can be verified by looking at the sorting functions in the source code. For example, in quicksort.c.src, we see that all types that are not explicitly numerical (including structure fields that are not scalars), are handled by the npy_quicksort generic function. It uses function cmp as a comparator and macros GENERIC_SWAP and GENERIC_COPY to swap and copy, respectively.

The function cmp is defined as PyArray_DESCR(arr)->f->compare. The macros are defined as element-wise operations in npysort_common.h.

So the final result is that for any non-scalar type, including packed array structure fields, the comparison is done byte-by-byte. For objects, this will of course be the numeric values of the pointers. For floats, this will be the IEEE-754 representation. The fact that positive integers appear to work correctly is caused by the fact that my platform uses little-endian encoding. Negative integers stored in twos complement form would likely not yield correct results.

Cedric · Accepted Answer · 2020-10-16 16:49:32Z

This might not be a perfect answert but I hope I can help you:

1.) Why isn't it working properly: Because dtype=[('', arr.dtype)] * arr.shape[1] != dtype=[('', arr.dtype, arr.shape[1])]

2.) What's the difference between those two? Well, while the first one adds the length to the list the second one multiplies the list. This means the output of the first one is something like: [('', dtype('O'), 3)] whereas the second is [('', dtype('O')), ('', dtype('O')), ('', dtype('O'))]

3.) The sort is clearly doing something wrong - no the input was just in the wrong format

4.) Is it using pointers as the sort keys? Do you mean if it formats the data by the data-keys? Then no it sorts them according to the data itself.

EDIT: Ok to make it more clear:

First of all I think you misunderstood @Mike MacNeil's anwer. To make it more plastic here some examples:

Let's consider a class Foo:

class Foo:
    def __init__(self, id):
        self._id = id
    
    def get_id(self):
        return self._id

    def __le__(self, ob):
        return self < ob or self == ob

    def __lt__(self, ob):
        return self.get_id() < ob.get_id()

    def __ge__(self, ob):
        return not self < ob

    def __gt__(self, ob):
        return not self <= ob

    def __eq__(self, ob):
        return self.get_id() == ob.get_id()

    def __str__(self):
        return f'Foo({self.get_id()})'

    def __repr__(self):
        rep = super().__repr__()
        return f'{str(self)} {rep[rep.index("at"):rep.index(">")]}'

We see the comparisons have been implemented just like in string. I also implemented the __repr__() and __str__() methods give me a second and you will understand why:

Let's create a numpy array in the first step:

>>> arr4 = np.array([[Foo(1), Foo(2), Foo(3)],
        [Foo(4), Foo(5), Foo(6)],
        [Foo(7), Foo(8), Foo(9)],
        [Foo(10), Foo(11), Foo(12)]])

If we print it, it will look something like this:

>>> arr4
array([[Foo(1) at 0x000002411F753F08, Foo(2) at 0x000002411F73FF48, Foo(3) at 0x000002411F74EE48],
       [Foo(4) at 0x000002411F74EE88, Foo(5) at 0x000002411F74EE08, Foo(6) at 0x000002411F756148],
       [Foo(7) at 0x000002411F7561C8, Foo(8) at 0x000002411F756208, Foo(9) at 0x000002411F756248],
       [Foo(10) at 0x000002411F756288, Foo(11) at 0x000002411F7562C8,
        Foo(12) at 0x000002411F756308]], dtype=object)

If we now print the ndarray...

>>> np.ndarray(arr4.shape[0], dtype=[('', arr4.dtype, arr4.shape[1])], buffer=arr4)
array([([Foo(1) at 0x000002411F753F08, Foo(2) at 0x000002411F73FF48, Foo(3) at 0x000002411F74EE48],),
       ([Foo(4) at 0x000002411F74EE88, Foo(5) at 0x000002411F74EE08, Foo(6) at 0x000002411F756148],),
       ([Foo(7) at 0x000002411F7561C8, Foo(8) at 0x000002411F756208, Foo(9) at 0x000002411F756248],),
       ([Foo(10) at 0x000002411F756288, Foo(11) at 0x000002411F7562C8, Foo(12) at 0x000002411F756308],)], dtype=[('f0', 'O', (3,))])

...we see it basically has the same shape as

>>> np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr)
array([['t', 'p', 'g'],
       ['n', 's', 'd'],
       ['g', 'h', 'o'],
       ['g', 'g', 'n'],
       ['f', 'j', 'x']], dtype=[('f0', 'O', (3,))])

After sorting the Foo-Array with np.ndarray(arr4.shape[0], dtype=[('', arr4.dtype, arr4.shape[1])], buffer=arr4).sort() we see that the output of arr4 looks something like:

>>> arr4
array([[Foo(1) at 0x000002411F753F08, Foo(2) at 0x000002411F73FF48, Foo(3) at 0x000002411F74EE48],
       [Foo(10) at 0x000002411F756288, Foo(11) at 0x000002411F7562C8, Foo(12) at 0x000002411F756308],
       [Foo(4) at 0x000002411F74EE88, Foo(5) at 0x000002411F74EE08, Foo(6) at 0x000002411F756148],
       [Foo(7) at 0x000002411F7561C8, Foo(8) at 0x000002411F756208, Foo(9) at 0x000002411F756248]], dtype=object)

although

>>> Foo(10) > Foo(4)
True

(Still np.ndarray(arr4.shape[0], dtype=[('', arr4.dtype)] * arr4.shape[1], buffer=arr4).sort() can print out the expected result sorted with the id-key using the defined comparison-functions.)

The comparison-rule for the dtype=object is not as you would expect just using the standard comparison functions but is comparing the objects representations (→ in this case this would mean that e.g. repr(Foo(10)) < repr(Foo(2)) would be True although we would actually expect Foo(10) to be greater than Foo(2)).

But by telling numpy the exact dimensions/shape, numpy uses the standard comparisons which will lead to the expected result because it now knows that all the elements of a row are from exactly the same type and not just some random objects clenched together into one array. This is why also your example didn't work with string however would work with str as str will be supported natively by numpy (<U1).

Tyler · Accepted Answer · 2020-10-14 16:59:44Z

-2

It appears that your first method, dtype=[('', arr.dtype, arr.shape[1])], buffer=arr).sort(), is trying to sort the dtype=object but it doesn't have enough information to sort it. When you use the second method dtype=[('', arr.dtype)] * arr.shape[1], buffer=arr).sort(), it unpacks the object, allowing the sort method to "see" what it's supposed to sort on. When you're using these methods on scalars, the sort method can see they're scalars and not an object.

This is all conjecture on my part, but it makes sense to me. If anyone can correct me, please do!

answered Oct 14, 2020 at 16:59

Tyler

1429 bronze badges

3 Comments

Tyler Over a year ago

I don't mind downvotes...if the answer isn't useful, then it isn't. Please let me know why it's not useful though so I can increase my own understanding. Thanks! :)

Mad Physicist Over a year ago

Your answer is by your own admission pure conjecture that I already covered in the question. It's guesswork that does not contribute to my understanding of the subject. I too would like to learn, but posting whatever comes to mind does not help accomplish that.

Tyler Over a year ago

Awesome, thanks @MadPhysicist, and my apologies for not helping with my answer.

Collectives™ on Stack Overflow

Sorting array of objects by row using custom dtype

4 Answers 4

11 Comments

Comments

2 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

11 Comments

Comments

2 Comments

3 Comments

Linked

Related