37

The functionality I am looking for looks something like this:

data = np.array([[1, 2, 3, 4],
                 [2, 3, 1],
                 [5, 5, 5, 5],
                 [1, 1]])

result = fix(data)
print result

[[ 1.  2.  3.  4.]
 [ 2.  3.  1.  0.]
 [ 5.  5.  5.  5.]
 [ 1.  1.  0.  0.]]

These data arrays I'm working with are really large so I would really appreciate the most efficient solution.

Edit: Data is read in from disk as a python list of lists.

8
  • simply add the data type to the array function call, np.array(...,dtype=np.float64)np.array(...,dtype=np.float64), or use loadtxt, savetxt from numpy. Commented Aug 16, 2015 at 17:33
  • 1
    @zeroth I have tried that and got ValueError: setting an array element with a sequence. Could you explain more? Commented Aug 16, 2015 at 17:36
  • 1
    Is it likely to be a Sparse matrix with most entries as zero? Can it fit in memory as a dense matrix? Commented Aug 16, 2015 at 17:54
  • @musically_ut No it isn't sparse. Often there are only 1-3 elements missing at the ends. Commented Aug 16, 2015 at 18:10
  • 1
    This is relevant: stackoverflow.com/questions/27890052/… Commented Aug 16, 2015 at 19:05

5 Answers 5

28

This could be one approach -

def numpy_fillna(data):
    # Get lengths of each row of data
    lens = np.array([len(i) for i in data])

    # Mask of valid places in each row
    mask = np.arange(lens.max()) < lens[:,None]

    # Setup output array and put elements from data into masked positions
    out = np.zeros(mask.shape, dtype=data.dtype)
    out[mask] = np.concatenate(data)
    return out

Sample input, output -

In [222]: # Input object dtype array
     ...: data = np.array([[1, 2, 3, 4],
     ...:                  [2, 3, 1],
     ...:                  [5, 5, 5, 5, 8 ,9 ,5],
     ...:                  [1, 1]])

In [223]: numpy_fillna(data)
Out[223]: 
array([[1, 2, 3, 4, 0, 0, 0],
       [2, 3, 1, 0, 0, 0, 0],
       [5, 5, 5, 5, 8, 9, 5],
       [1, 1, 0, 0, 0, 0, 0]], dtype=object)
Sign up to request clarification or add additional context in comments.

3 Comments

The accepted answer is almost correct. I assume it was an oversight, but the following: # Mask of valid places in each row mask = np.arange(lens.size) < lens[:,None] Should Actually be: # Mask of valid places in each row mask = np.arange(max(lens)) < lens[:,None] The accepted answer happens to work for the tested input because lens.size == max(lens). If it's not, it no longer works...
I think lens.size should be lens.max() - in your answer these are equal to make a square matrix. But try with a ragged row longer than the number of rows and you will get an error.
that mask is brilliant
15

You could use pandas instead of numpy:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[1, 2, 3, 4],
   ...:                    [2, 3, 1],
   ...:                    [5, 5, 5, 5],
   ...:                    [1, 1]], dtype=float)


In [3]: df.fillna(0.0).values
Out[3]: 
array([[ 1.,  2.,  3.,  4.],
       [ 2.,  3.,  1.,  0.],
       [ 5.,  5.,  5.,  5.],
       [ 1.,  1.,  0.,  0.]])

1 Comment

Doesn't seem to work for deeper nesting levels, though :(
11

use np.pad().

In [62]: arr
Out[62]: 
[array([0]),
 array([83, 74]),
 array([87, 61, 23]),
 array([71,  3, 81, 77]),
 array([20, 44, 20, 53, 60]),
 array([54, 36, 74, 35, 49, 54]),
 array([11, 36,  0, 98, 29, 87, 21]),
 array([ 1, 22, 62, 51, 45, 40, 36, 86]),
 array([ 7, 22, 83, 58, 43, 59, 45, 81, 92]),
 array([68, 78, 70, 67, 77, 64, 58, 88, 13, 56])]

In [63]: max_len = np.max([len(a) for a in arr])

In [64]: np.asarray([np.pad(a, (0, max_len - len(a)), 'constant', constant_values=0) for a in arr])
Out[64]: 
array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [83, 74,  0,  0,  0,  0,  0,  0,  0,  0],
       [87, 61, 23,  0,  0,  0,  0,  0,  0,  0],
       [71,  3, 81, 77,  0,  0,  0,  0,  0,  0],
       [20, 44, 20, 53, 60,  0,  0,  0,  0,  0],
       [54, 36, 74, 35, 49, 54,  0,  0,  0,  0],
       [11, 36,  0, 98, 29, 87, 21,  0,  0,  0],
       [ 1, 22, 62, 51, 45, 40, 36, 86,  0,  0],
       [ 7, 22, 83, 58, 43, 59, 45, 81, 92,  0],
       [68, 78, 70, 67, 77, 64, 58, 88, 13, 56]])

Comments

4

This would be nice if in some vectorized way, but Im still a NOOB, so its all I could think now!

import numpy as np,numba as nb
a=np.array([[1, 2, 3, 4],
                 [2, 3, 1],
                 [5, 5, 5, 5,5],
                 [1, 1]])
@nb.jit()
def f(a):
    l=len(max(a,key=len))
    a0=np.empty(a.shape+(l,))
    for n,i in enumerate(a.flat):
        a0[n]=np.pad(i,(0,l-len(i)),mode='constant')
    a=a0
    return a

print(f(a))

Comments

0
data = np.array([[1, 2, 3, 4],
                 [2, 3, 1],
                 [5, 5, 5, 5],
                 [1, 1]])
max_len=max([len(i) for i in data])
np.array([ np.pad(data[i],
           (0,max_len-len(data[i])),
           'constant',
            constant_values=0) for i in range(len(data))])

The lengths of the individual arrays are computed, then the maximum among these lengths is stored in a variable. After which all the individual rows of the matrix is padded with 0s on the right to match the maximum length.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.