8

I've got a numpy array containing labels. I'd like to get calculate a number for each label based on its size and bounding box. How can I write this more efficiently so that it's realistic to use on large arrays (~15000 labels)?

A = array([[ 1, 1, 0, 3, 3],
           [ 1, 1, 0, 0, 0],
           [ 1, 0, 0, 2, 2],
           [ 1, 0, 2, 2, 2]] )

B = zeros( 4 )

for label in range(1, 4):
    # get the bounding box of the label
    label_points = argwhere( A == label )
    (y0, x0), (y1, x1) = label_points.min(0), label_points.max(0) + 1

    # assume I've computed the size of each label in a numpy array size_A
    B[ label ] = myfunc(y0, x0, y1, x1, size_A[label])
3
  • How big is A in the real use case? Commented Nov 23, 2011 at 16:39
  • Did you do some profiling to see which of your statements is the one slowing you down? Maybe it's the function myfunc which could probably be parallized by saving y0, x0, y1, x1 in separate arrays getting out of the loop and only calling the function once. Otherwise, if speed really counts, you may want to look into whether it's worth doing some C code. I found cython to be really comfortable when working with numpy arrays. Commented Nov 23, 2011 at 16:46
  • 2
    I think the killer is the argwhere call for every label. Commented Nov 23, 2011 at 16:50

5 Answers 5

7

I wasn't really able to implement this efficiently using some NumPy vectorised functions, so maybe a clever Python implementation will be faster.

def first_row(a, labels):
    d = {}
    d_setdefault = d.setdefault
    len_ = len
    num_labels = len_(labels)
    for i, row in enumerate(a):
        for label in row:
            d_setdefault(label, i)
        if len_(d) == num_labels:
            break
    return d

This function returns a dictionary mapping each label to the index of the first row it appears in. Applying the function to A, A.T, A[::-1] and A.T[::-1] also gives you the first column as well as the last row and column.

If you would rather like a list instead of a dictionary, you can turn the dictionary into a list using map(d.get, labels). Alternatively, you can use a NumPy array instead of a dictionary right from the start, but you will lose the ability to leave the loop early as soon as all labels were found.

I'd be interested whether (and how much) this actually speeds up your code, but I'm confident that it is faster than your original solution.

Sign up to request clarification or add additional context in comments.

2 Comments

It's not the prettiest way to go, but it definitely works. My original way ran so long that I never even let it finish (gave up after 20 minutes). I just ran your method and got it in 6m30s.
@ajwood: Thanks for the feedback. I know it's not pretty, but the best easy solution I could think of. If you want to do it even faster, I'd suggest to implement it in Cython.
5

Another method:

use bincount() to get labels count in every row and column, and save the information in rows and cols array.

For each label you only need to search the range in rows and columns. It's faster than sort, on my pc, it can do the calculation in a few seconds.

def label_range2(A):
    maxlabel = np.max(A)+1
    h, w = A.shape
    rows = np.zeros((h, maxlabel), np.bool)
    for row in xrange(h):
        rows[row,:] = np.bincount(A[row,:], minlength=maxlabel) > 0

    cols = np.zeros((w, maxlabel), np.bool)
    for col in xrange(w):
        cols[col,:] =np.bincount(A[:,col], minlength=maxlabel) > 0

    for label in xrange(1, maxlabel):
        row = rows[:, label]
        col = cols[:, label]
        y = np.where(row)[0]
        x = np.where(col)[0]
        x0 = np.min(x)
        x1 = np.max(x)+1
        y0 = np.min(y)
        y1 = np.max(y)+1        
        yield label, x0,y0,x1,y1

1 Comment

This looks very promising, and I'll give it a try asap.
5

Algorithm:

  1. change the array to one dimension
  2. get the sort index by argsort()
  3. get the sorted version of on dimension array as sorted_A
  4. use where() and diff() to find label change position in sorted_A
  5. use the change position and the sort index to get the original position of the label in one dimension.
  6. calculate two dimension location from the on dimension position.

for large array such as (7000, 9000), is can finished the calculation in 30s.

here is the code:

import numpy as np

A = np.array([[ 1, 1, 0, 3, 3],
           [ 1, 1, 0, 0, 0],
           [ 1, 0, 0, 2, 2],
           [ 1, 0, 2, 2, 2]] )

def label_range(A):
    from itertools import izip_longest
    h, w = A.shape
    tmp = A.reshape(-1)

    index = np.argsort(tmp)
    sorted_A = tmp[index]
    pos = np.where(np.diff(sorted_A))[0]+1
    for p1,p2 in izip_longest(pos,pos[1:]):
        label_index = index[p1:p2]
        y = label_index // w
        x = label_index % w

        x0 = np.min(x)
        x1 = np.max(x)+1
        y0 = np.min(y)
        y1 = np.max(y)+1
        label = tmp[label_index[0]]

        yield label,x0,y0,x1,y1

for label,x0,y0,x1,y1 in label_range(A):
    print "%d:(%d,%d)-(%d,%d)" % (label, x0,y0,x1,y1)

#B = np.random.randint(0, 100, (7000, 9000))
#list(label_range(B))

1 Comment

I accidentally downvoted you post because I thought the algorithm was wrong. I had to do a dummy edit to unlock the vote -- upvoted instead. :)
1

The performace bottleneck seems indeed to be the call to argmax. It can be avoided by changing the loop as follows (only computing y0, y1, but easy to generalize to x0, x1):

for label in range(1, 4):
    comp = (A == label)
    yminind = comp.argmax(0)
    ymin = comp.max(0)
    ymaxind = comp.shape[0] - comp[::-1].argmax(0)
    y0 = yminind[ymin].min()
    y1 = ymaxind[ymin].max()

I'm not sure about the reason for the performance difference, but one reason might be that all operations like ==, argmax, and max can preallocate their output array directly from the shape of the input array, which is not possible for argwhere.

Comments

1

Using PyPy you can just run the loop and not worry about the vectorization. It should be fast.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.