Merge multidimensional NumPy arrays based on first row

Question

I have to work with sensor data (from ros, specifically, but it should not be relevant). To this end, I have several 2-D numpy arrays with one row storing the timestamps and the following others the corresponding sensors data. Problem is, such arrays do not have the same dimensions (different sampling times). I need to merge all of these arrays into a single big one. How can I do so based on the timestamp and, say, replace the missing numbers with 0 or NaN?

Example of my situation:

import numpy as np

time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)

a=np.array((time1,data1))
print(a)

time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)

b=np.array((time2,data2))

print(b)

Which returns output

[[  1   2   3   4   5   6   7   8   9]
 [ 51   9 117 174 164  60  95 197  30]]

[[  1   3   5   7   9]
 [ 35 188 114 153  36]]

What I am looking for is

[[  1   2   3   4   5   6   7   8   9]
 [ 51   9 117 174 164  60  95 197  30]
 [ 35   0 188   0 114   0 153   0  36]]

Is there any way to achieve this in an efficient way? This is an example but I am working with thousands of samples. Thanks!

Community · Accepted Answer · 2020-06-20 09:12:55Z

For simple case of one b-matrix

With first row of a storing all possible timestamps and both of those first rows in a and b being sorted, we can use np.searchsorted -

idx = np.searchsorted(a[0],b[0])
out_dtype = np.result_type((a.dtype,b.dtype))
b0 = np.zeros(a.shape[1],dtype=out_dtype)
b0[idx] = b[1]
out = np.vstack((a,b0))

For several b-matrices

Approach #1

To extend to multiple b-matrices, we can follow a similar method with np.searchsorted within a loop, like so -

def merge_arrays(a, B):
    # a : Array with first row holding all possible timestamps
    # B : list or tuple of all b-matrices
    
    lens = np.array([len(i) for i in B])
    L = (lens-1).sum() + len(a)
    out_dtype = np.result_type(*[i.dtype for i in B])
    out = np.zeros((L, a.shape[1]), dtype=out_dtype)
    out[:len(a)] = a
    s = len(a)
    for b_i in B:
        idx = np.searchsorted(a[0],b_i[0])
        out[s:s+len(b_i)-1,idx] = b_i[1:]
        s += len(b_i)-1
    return out

Sample run -

In [175]: a
Out[175]: 
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]])

In [176]: b0
Out[176]: 
array([[16, 22, 34, 56, 67, 91],
       [20, 80, 69, 79, 47, 64],
       [82, 88, 49, 29, 19, 19]])

In [177]: b1
Out[177]: 
array([[ 4, 16, 34, 99],
       [28, 34,  0,  0],
       [36, 53,  5, 38],
       [17, 79,  4, 42]])

In [178]: merge_arrays(a, [b0,b1])
Out[178]: 
array([[ 4, 11, 16, 22, 34, 56, 67, 87, 91, 99],
       [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [ 0,  0, 20, 80, 69, 79, 47,  0, 64,  0],
       [ 0,  0, 82, 88, 49, 29, 19,  0, 19,  0],
       [28,  0, 34,  0,  0,  0,  0,  0,  0,  0],
       [36,  0, 53,  0,  5,  0,  0,  0,  0, 38],
       [17,  0, 79,  0,  4,  0,  0,  0,  0, 42]])

Approach #2

If looping with np.searchsorted seems to be the bottleneck, we can vectorize that part -

def merge_arrays_v2(a, B):
    # a : Array with first row holding all possible timestamps
    # B : list or tuple of all b-matrices
    
    lens = np.array([len(i) for i in B])
    L = (lens-1).sum() + len(a)
    out_dtype = np.result_type(*[i.dtype for i in B])
    out = np.zeros((L, a.shape[1]), dtype=out_dtype)
    out[:len(a)] = a
    s = len(a)
    
    r0 = [i[0] for i in B]
    r0s = np.concatenate((r0))
    idxs = np.searchsorted(a[0],r0s)
    
    cols = np.array([i.shape[1] for i in B])
    sp = np.r_[0,cols.cumsum()]
    start,stop = sp[:-1],sp[1:]
    for (b_i,s0,s1) in zip(B,start,stop):
        idx = idxs[s0:s1]
        out[s:s+len(b_i)-1,idx] = b_i[1:]
        s += len(b_i)-1
    return out

This is really useful, thanks! However, how is it possible to do this operation on multiple matrices and without knowing the largest one (i.e., the one with most samples)?
@Bianca Would it be based on the first rows again? Also, would all of those be sorted?
yes, based on the first row. And yes they are sorted, imagine several matrices 'b' as in the example, all with different sampling times
Is the order in which they are "merged" relevant? @bianca I mean taking into account that they all merge with the largest
Nice soln @divakar I think I overcomplicated mine a little assuming we dont know which array is A. As if having different shaped arrays didn't complicate it enough :)

yatu · Accepted Answer · 2019-05-20 17:46:02Z

Here's an approach using np.searchsorted:

time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)

a=np.array((time1,data1))
# array([[  1,   2,   3,   4,   5,   6,   7,   8,   9],
#        [118, 105,  86,  94,  69,  17, 142,  46,  54]])

time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b=np.array((time2,data2))
# array([[ 1,  3,  5,  7,  9],
#        [70, 15,  4, 97, 57]])

out = np.vstack([a, np.zeros(a.shape[1])])
out[out.shape[0]-1, np.searchsorted(a[0], b[0])] = b[1]

array([[  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.],
       [118., 105.,  86.,  94.,  69.,  17., 142.,  46.,  54.],
       [ 70.,   0.,  15.,   0.,   4.,   0.,  97.,   0.,  57.]])

Update - Merging many matrices

Here's a almost fully vectorised approach for a scenario with multiple b matrices. This approach does not require a priori knowledge of which is the largest list:

def merge_timestamps(*x):
    # infer which is the list with maximum length
    # as well as individual lengths
    concat = np.concatenate(*x, axis=1)[0]
    lens = np.r_[np.flatnonzero(np.diff(concat) < 0), len(concat)]
    max_len_list = np.r_[lens[0], np.diff(lens)].argmax()
    # define the output matrix 
    A = x[0][max_len_list]
    out = np.vstack([A[1], np.zeros((len(*x)-1, len(A[0])))])
    others = np.flatnonzero(~np.in1d(np.arange(len(*x)), max_len_list))
    # Update the output matrix with the values of the smaller
    # arrays according to their index. This is of course assuming 
    # all values are contained in the largest
    for ix, i in enumerate(others):
        out[-(ix+1), x[0][i][0]-A[0].min()] = x[0][i][1]
    return out

Lets check with the following example:

time1=np.arange(1,10)
data1=np.random.randint(200, size=time1.shape)

a=np.array((time1,data1))

# array([[  1,   2,   3,   4,   5,   6,   7,   8,   9],
#        [107,  13, 123, 119, 137, 135,  65, 157,  83]])

time2=np.arange(1,10,2)
data2=np.random.randint(200, size=time2.shape)
b = np.array((time2,data2))
# array([[  1,   3,   5,   7,   9],
#        [ 81,  49,  83,  32, 179]])

time3=np.arange(1,4,2)
data3=np.random.randint(200, size=time3.shape)
c=np.array((time3,data3))
# array([[  1,   3],
#        [185, 117]])

merge_timestamps([a,b,c])

array([[  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.],
       [107.,  13., 123., 119., 137., 135.,  65., 157.,  83.],
       [185.,   0., 117.,   0.,   0.,   0.,   0.,   0.,   0.],
       [ 81.,   0.,  49.,   0.,  83.,   0.,  32.,   0., 179.]])

As mentioned this approach does not require a priori knowledge of which is the largest list, i.e. it would also work with:

merge_timestamps([b, c, a])

array([[  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.],
       [107.,  13., 123., 119., 137., 135.,  65., 157.,  83.],
       [185.,   0., 117.,   0.,   0.,   0.,   0.,   0.,   0.],
       [ 81.,   0.,  49.,   0.,  83.,   0.,  32.,   0., 179.]])

asonagra · Accepted Answer · 2019-05-20 17:36:07Z

Applicable only if sensor is capturing data at fixed interval. First we will need to create a dataframe with fixed interval (15 min interval in this case), then use concat function to this dataframe with sensor's data.

Code to generate dataframe with 15 min interval (copied)

l = (pd.DataFrame(columns=['NULL'],
                  index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
                                      freq='15T'))
       .between_time('07:00','21:00')
       .index.strftime('%Y-%m-%dT%H:%M:%SZ')
       .tolist()
)
l = pd.DataFrame(l)

Assuming below data comes from sensor

m = (pd.DataFrame(columns=['NULL'],
                  index=pd.date_range('2016-09-02T17:30:00Z', '2016-09-02T21:00:00Z',
                                      freq='30T'))
       .between_time('07:00','21:00')
       .index.strftime('%Y-%m-%dT%H:%M:%SZ')
       .tolist()
)
m = pd.DataFrame(m)
m['SensorData'] = np.arange(8)

merge above two dataframes

df = l.merge(m, left_on = 0, right_on= 0,how='left')
df.loc[df['SensorData'].isna() == True,'SensorData'] = 0

Output

                       0  SensorData
0   2016-09-02T17:30:00Z         0.0
1   2016-09-02T17:45:00Z         0.0
2   2016-09-02T18:00:00Z         1.0
3   2016-09-02T18:15:00Z         0.0
4   2016-09-02T18:30:00Z         2.0
5   2016-09-02T18:45:00Z         0.0
6   2016-09-02T19:00:00Z         3.0
7   2016-09-02T19:15:00Z         0.0
8   2016-09-02T19:30:00Z         4.0
9   2016-09-02T19:45:00Z         0.0
10  2016-09-02T20:00:00Z         5.0
11  2016-09-02T20:15:00Z         0.0
12  2016-09-02T20:30:00Z         6.0
13  2016-09-02T20:45:00Z         0.0
14  2016-09-02T21:00:00Z         7.0

Collectives™ on Stack Overflow

Merge multidimensional NumPy arrays based on first row

3 Answers 3

For simple case of one b-matrix

For several b-matrices

10 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

For simple case of one b-matrix

For several b-matrices

10 Comments

Comments

Comments

Related