2

I have a 2D Numpy array that consists of (X,Y,Z,A) values, where (X,Y,Z) are Cartesian coordinates in 3D space, and A is some value at that location. As an example..

__X__|__Y__|__Z__|__A_
  13 |  7  |  21 | 1.5
  9  |  2  |  7  | 0.5
  15 |  3  |  9  | 1.1
  13 |  7  |  21 | 0.9
  13 |  7  |  21 | 1.7
  15 |  3  |  9  | 1.1

Is there an efficient way to find all the unique combinations of (X,Y), and add their values? For example, the total for (13,7) would be (1.5+0.9+1.7), or 4.1.

2 Answers 2

3

scipy.sparse matrix takes this kind of information, but for just 2d

sparse.coo_matrix((data, (row, col)))

where row and col are indices like your X,Y and Z. It sums duplicates.

The first step to doing that is a lexical sort of the indices. That puts points with matching coordinates next to each other.

The actually grouping and summing is done, I believe, in compiled code. Part of the difficulty in doing that fast in numpy terms is that there will be a variable number of elements in each group. Some will be unique, others might have 3 or more.

Python itertools has a groupby tool. Pandas also has grouping functions. I can also imagine using a default_dict to group and sum values.

The ufunc reduceat might also work, though it's easier to use in 1d than 2 or 3.

If you are ignoring the Z, the sparse coo_matrix approach may be easiest.

In [2]: X=np.array([13,9,15,13,13,15])
In [3]: Y=np.array([7,2,3,7,7,3])
In [4]: A=np.array([1.5,0.5,1.1,0.9,1.7,1.1])
In [5]: M=sparse.coo_matrix((A,(X,Y)))
In [15]: M.sum_duplicates()
In [16]: M.data
Out[16]: array([ 0.5,  2.2,  4.1])
In [17]: M.row
Out[17]: array([ 9, 15, 13])
In [18]: M.col
Out[18]: array([2, 3, 7])
In [19]: M
Out[19]: 
<16x8 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in COOrdinate format>

Here's what I had in mind with lexsort

In [32]: Z=np.array([21,7,9,21,21,9])
In [33]: xyz=np.stack([X,Y,Z],1)
In [34]: idx=np.lexsort([X,Y,Z])
In [35]: idx
Out[35]: array([1, 2, 5, 0, 3, 4], dtype=int32)
In [36]: xyz[idx,:]
Out[36]: 
array([[ 9,  2,  7],
       [15,  3,  9],
       [15,  3,  9],
       [13,  7, 21],
       [13,  7, 21],
       [13,  7, 21]])
In [37]: A[idx]
Out[37]: array([ 0.5,  1.1,  1.1,  1.5,  0.9,  1.7])

When sorted like this it becomes more evident that the Z coordinate is 'redundant', at least for this purpose.

Using reduceat to sum groups:

In [40]: np.add.reduceat(A[idx],[0,1,3])  
Out[40]: array([ 0.5,  2.2,  4.1])

(for now I just eyeballed the [0,1,3] list)

Sign up to request clarification or add additional context in comments.

1 Comment

These are great, thanks! I think lexsort was what I was planning to do the "hard way" -- I wasn't aware there was a function for it already, hence my question. I like the way the data is retained in lexsort, so I'll go with that.
2

Approach #1

Get each row as a view, thus converting each into a scalar each and then use np.unique to tag each row as a minimum scalar starting from (0......n), withnas no. of unique scalars based on the uniqueness among others and finally usenp.bincount` to perform the summing of the last column based on the unique scalars obtained earlier.

Here's the implementation -

def get_row_view(a):
    void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
    a = np.ascontiguousarray(a)
    return a.reshape(a.shape[0], -1).view(void_dt).ravel()

def groupby_cols_view(x):
    a = x[:,:2].astype(int)   
    a1D = get_row_view(a)     
    _, indx, IDs = np.unique(a1D, return_index=1, return_inverse=1)
    return np.c_[x[indx,:2],np.bincount(IDs, x[:,-1])]

Approach #2

Same as approach #1, but instead of working with the view, we will generate equivalent linear index equivalent for each row and thus reducing each row to a scalar. Rest of the workflow is same as with the first approach.

The implementation -

def groupby_cols_linearindex(x):
    a = x[:,:2].astype(int)   
    a1D = a[:,0] + a[:,1]*(a[:,0].max() - a[:,1].min() + 1)    
    _, indx, IDs = np.unique(a1D, return_index=1, return_inverse=1)
    return np.c_[x[indx,:2],np.bincount(IDs, x[:,-1])]

Sample runs

In [80]: data
Out[80]: 
array([[ 2.        ,  5.        ,  1.        ,  0.40756048],
       [ 3.        ,  4.        ,  6.        ,  0.78945661],
       [ 1.        ,  3.        ,  0.        ,  0.03943097],
       [ 2.        ,  5.        ,  7.        ,  0.43663582],
       [ 4.        ,  5.        ,  0.        ,  0.14919507],
       [ 1.        ,  3.        ,  3.        ,  0.03680583],
       [ 1.        ,  4.        ,  8.        ,  0.36504428],
       [ 3.        ,  4.        ,  2.        ,  0.8598825 ]])

In [81]: groupby_cols_view(data)
Out[81]: 
array([[ 1.        ,  3.        ,  0.0762368 ],
       [ 1.        ,  4.        ,  0.36504428],
       [ 2.        ,  5.        ,  0.8441963 ],
       [ 3.        ,  4.        ,  1.64933911],
       [ 4.        ,  5.        ,  0.14919507]])

In [82]: groupby_cols_linearindex(data)
Out[82]: 
array([[ 1.        ,  3.        ,  0.0762368 ],
       [ 1.        ,  4.        ,  0.36504428],
       [ 3.        ,  4.        ,  1.64933911],
       [ 2.        ,  5.        ,  0.8441963 ],
       [ 4.        ,  5.        ,  0.14919507]])

1 Comment

I actually wound up using your answer -- the addition of the "Sample runs" is really helpful, to double check the function.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.