1

I have 10,000 rows in my csv file. I want to remove empty bracket [] and rows which are empty [[]] and it is depicted in the following picture:

enter image description here

For instance the first cell in the first column :

[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]

needs to be transformed into:

[['1', 2364, 2382, 1552, 1585],['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]

and the row with only empty bracket:

[[]]    [[]]

needs to be removed from the file. As a result we get:

enter image description here

I tried:

df1 = df.Column_1.str.strip([]).str.split(',', expand=True)

My data are from string class

print(type(df.loc[0,'Column_1']))
<class 'str'>

print(type(df.loc[0,'Column_2']))
<class 'str'>

EDIT1 After executing the following code:

df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])

df1 = df1[df1.applymap(len).ne(0).all(axis=1)]

df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)

df1 = df1.dropna()

it solves the problem. However l got some issue with comma (as a character and not a delimiter) ','

from the resulted line. I wanted to create a new csv file as follows:

columns =['char', 'left', 'right', 'top', 'down']

which corresponds for instance to:

'1' 2364 2382 1552 1585

to get a csv file as follow:

   char  left  top  right  bottom
0   'm'    38  104   2456    2492
1   'i'    40  102   2442     222
2   '.'   203  213    191     198
3   '3'   235  262    131    3333
4   'A'   275  347    147     239
5   'M'   363  465    145    3334
6   'A'    73   91    373     394
7   'D'    93  112    373      39
8   'D'   454  473    663     685
9   'O'   474  495    664      33
10  'A'   108  129    727     751
11  'V'   129  150    727     444

so the whole code to get this is:

df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])

df1 = df1[df1.applymap(len).ne(0).all(axis=1)]

df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)

df1 = df1.dropna()

cols = ['char','left','right','top','bottom']

df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)

However doing that I don't find any ',' in my file then it makes disorder in the new csv file rather getting:

',' 1491    1494    172 181 

I got no comma ',' .and the disorder is explained in the following two lines:

 '    '     1491    1494    172
181  'r'    1508    1517    159

it should be:

',' 1491 1494 172 181
'r' 1508 1517 159 ... and so on

EDIT2

I'm trying to add 2 other column called line_number and all_chars_in_same_row

  1. line_number corresponds to the line where for example

    'm' 38 104 2456 2492

is extracted let say from line 2

  1. all_chars_in_same_row corresponds to all (spaced) characters which are in the same row. for instance

    character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]

I get '1' '8' '4' '1' '7' and so on.

more formally: all_chars_in_same_row means: write all the character of the given row in line_number column

char  left  top  right  bottom     line_number  all_chars_in_same_row
0   'm'    38  104   2456    2492   from line 2  'm' '2' '5' 'g'
1   'i'    40  102   2442     222   from line 4
2   '.'   203  213    191     198   from line 6
3   '3'   235  262    131    3333  
4   'A'   275  347    147     239
5   'M'   363  465    145    3334
6   'A'    73   91    373     394
7   'D'    93  112    373      39
8   'D'   454  473    663     685
9   'O'   474  495    664      33
10  'A'   108  129    727     751
11  'V'   129  150    727     444

The code related to that is:

import pandas as pd

    df_data=pd.read_csv('see2.csv', header=None, usecols=[1], names=['character_position'])
    df_data = df_data.positionlrtb.str.strip('[]').str.split(', ', expand=True)
    
    x=len(df_data.columns) #get total number of columns 
    #get all characters from every 5th column, concatenate and create new column in df_data
    df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
    # get index of each row. This is the line number for your record
    df_data[x+1]=df_data.index.get_level_values(0) 
     # now set line number and character columns as Index of data frame
    df_data.set_index([x+1,x],inplace=True,drop=True)
    
    df_data.columns = [df_data.columns % 5, df_data.columns // 5]
    
    df_data = df_data.stack()
    df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
    df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
    cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
    df_data.columns=cols
    df_data.reset_index(inplace=True) #remove mutiindexing
    print df_data[cols]

and output 

         char  left   top right bottom  from line all_chars_in_same_row
    0     '.'   203   213   191    198          0  ['.', '3', 'C']
    1     '3'  1758  1775   370    391          0  ['.', '3', 'C']
    2     'C'   296   305  1492   1516          0  ['.', '3', 'C']
    3     'A'   275   347   147    239          1  ['A', 'M', 'D']
    4     'M'  2166  2184   370    391          1  ['A', 'M', 'D']
    5     'D'   339   362  1815   1840          1  ['A', 'M', 'D']
    6     'A'    73    91   373    394          2  ['A', 'D', 'A']
    7     'D'  1395  1415   427    454          2  ['A', 'D', 'A']
    8     'A'  1440  1455  2047   2073          2  ['A', 'D', 'A']
    9     'D'   454   473   663    685          3  ['D', 'O', '0']
    10    'O'  1533  1545   487    541          3  ['D', 'O', '0']
    11    '0'   339   360  2137   2163          3  ['D', 'O', '0']
    12    'A'   108   129   727    751          4  ['A', 'V', 'I']
    13    'V'  1659  1677   490    514          4  ['A', 'V', 'I']
    14    'I'   339   360  1860   1885          4  ['A', 'V', 'I']
    15    'N'    34    51   949    970          5  ['N', '/', '2']
    16    '/'  1890  1904   486    505          5  ['N', '/', '2']
    17    '2'  1266  1283  1951   1977          5  ['N', '/', '2']
    18    'S'  1368  1401    43     85          6  ['S', 'A', '8']
    19    'A'  1344  1361   583    607          6  ['S', 'A', '8']
    20    '8'  2207  2217  1492   1515          6  ['S', 'A', '8']
    21    'S'  1437  1457   112    138          7  ['S', 'o', 'O']
    22    'o'  1548  1580   979   1015          7  ['S', 'o', 'O']
    23    'O'  1331  1349   370    391          7  ['S', 'o', 'O']
    24    'h'  1686  1703   315    339          8  ['h', 't', 't']
    25    't'   169   190  1291   1312          8  ['h', 't', 't']
    26    't'   169   190  1291   1312          8  ['h', 't', 't']
    27    'N'  1331  1349   370    391          9  ['N', 'C', 'C']
    28    'C'   296   305  1492   1516          9  ['N', 'C', 'C']
    29    'C'   296   305  1492   1516          9  ['N', 'C', 'C']

However, I got a strange results(order of letter, numbers, columns, headers..). I can't share them the file is too long. I tried to share it. but it exceeds the max characters.

where this line of code

df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)

return None Value

  0      1      2      3      4     5      6      7      8      9     ...   \
0  'm'     38    104   2456   2492   'i'     40    102   2442   2448  ...    
1  '.'    203    213    191    198   '3'    235    262    131    198  ...    
2  'A'    275    347    147    239   'M'    363    465    145    239  ...    
3  'A'     73     91    373    394   'D'     93    112    373    396  ...    
4  'D'    454    473    663    685   'O'    474    495    664    687  ...    
5  'A'    108    129    727    751   'V'    129    150    727    753  ...    
6  'N'     34     51    949    970   '/'     52     61    948    970  ...    
7  'S'   1368   1401     43     85   'A'   1406   1446     43     85  ...    
8  'S'   1437   1457    112    138   'o'   1458   1476    118    138  ...    
9  'h'   1686   1703    315    339   't'   1706   1715    316    339  ...    
   1821  1822  1823  1824  1825  1826  1827  1828  1829  1830  
0  None  None  None  None  None  None  None  None  None  None  
1  None  None  None  None  None  None  None  None  None  None  
2  None  None  None  None  None  None  None  None  None  None  
3  None  None  None  None  None  None  None  None  None  None  
4  None  None  None  None  None  None  None  None  None  None  
5  None  None  None  None  None  None  None  None  None  None  
6  None  None  None  None  None  None  None  None  None  None  

EDIT3 However, when I add page_number along with character_position

df1 = pd.DataFrame({
        "from_line": np.repeat(df.index.values, df.character_position.str.len()),
        "b": list(chain.from_iterable(df.character_position)),
        "page_number" : np.repeat(df.index.values,df['page_number'])
})

I got the following error:

 File "/usr/local/lib/python3.5/dist-packages/numpy/core/fromnumeric.py", line 47, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'
4
  • Are these row values lists of lists or are they strings? Commented Apr 10, 2017 at 13:03
  • It's a dataframe of object Commented Apr 10, 2017 at 13:20
  • I understand that you have a dataframe but what are the data types of the columns? Run df.dtypes and edit your question with the output. Commented Apr 10, 2017 at 13:22
  • it' s a class of string <class 'str'> Commented Apr 10, 2017 at 13:28

4 Answers 4

1

For lists you can use applymap with list comprehension for remove [] first and then remove all rows with boolean indexing, where mask check if all values in row is no 0 - empty lists.

df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])

df1 = df1[df1.applymap(len).ne(0).all(axis=1)]

If need remove row if any value is [[]]:

df1 = df1[~(df1.applymap(len).eq(0)).any(1)]

If values are strings:

df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)

and then dropna:

df1 = df1.dropna(how='all')

Or:

df1 = df1.dropna()

EDIT1:

df = pd.read_csv('see2.csv', index_col=0)

df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)

df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
      page_number                                       positionlrtb  \
0  1841729699_001  [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...   
1  1841729699_001   [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]   
2  1841729699_001  [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...   
3  1841729699_001  [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...   
4  1841729699_001  [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...   

                    LineIndex  
0      [[mi, il, mu, il, il]]  
1                      [[.3]]  
2                   [[amsun]]  
3  [[adresse, de, livraison]]  
4                [[document]]

cols = ['char','left','top','right','bottom']

df1 = pd.DataFrame({
        "a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
        "b": list(chain.from_iterable(df.positionlrtb))})

df1 = pd.DataFrame(df1.b.values.tolist())    
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)  
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)   

print (df1)
     char  left   top  right  bottom
0       m    38   104   2456    2492
1       i    40   102   2442    2448
2       i    40   100   2402    2410
3       l    40   102   2372    2382
4       m    40   102   2312    2358
5       u    40   102   2292    2310
6       i    40   104   2210    2260
7       l    40   104   2180    2208
8       i    40   104   2140    2166

EDIT2:

#skip first row
df = pd.read_csv('see2.csv', usecols=[2], names=['character_position'], skiprows=1)
print (df.head())
                                  character_position
0  [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1  [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2  [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3  [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4  [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
#convert to list, remove empty lists
df.character_position = df.character_position.apply(ast.literal_eval)
df.character_position = df.character_position.apply(lambda x: [y for y in x if len(y) > 0])

#new df - http://stackoverflow.com/a/42788093/2901002
df1 = pd.DataFrame({
        "from line": np.repeat(df.index.values, df.character_position.str.len()),
        "b": list(chain.from_iterable(df.character_position))})

#filter by list comprehension string only, convert to tuple, because need create index 
df1['all_chars_in_same_row'] = 
df1['b'].apply(lambda x: tuple([y for y in x if isinstance(y, str)]))
df1 = df1.set_index(['from line','all_chars_in_same_row'])
#new df from column b
df1 = pd.DataFrame(df1.b.values.tolist(), index=df1.index)   
#Multiindex in columns
df1.columns = [df1.columns % 5, df1.columns // 5]
#reshape
df1 = df1.stack().reset_index(level=2, drop=True)  
cols = ['char','left','top','right','bottom']
df1.columns = cols
#convert last columns to int
df1[cols[1:]] = df1[cols[1:]].astype(int)
df1 = df1.reset_index()
#convert tuples to list
df1['all_chars_in_same_row'] = df1['all_chars_in_same_row'].apply(list)
print (df1.head(15))
    from line           all_chars_in_same_row char  left  top  right  bottom
0           0  [m, i, i, l, m, u, i, l, i, l]    m    38  104   2456    2492
1           0  [m, i, i, l, m, u, i, l, i, l]    i    40  102   2442    2448
2           0  [m, i, i, l, m, u, i, l, i, l]    i    40  100   2402    2410
3           0  [m, i, i, l, m, u, i, l, i, l]    l    40  102   2372    2382
4           0  [m, i, i, l, m, u, i, l, i, l]    m    40  102   2312    2358
5           0  [m, i, i, l, m, u, i, l, i, l]    u    40  102   2292    2310
6           0  [m, i, i, l, m, u, i, l, i, l]    i    40  104   2210    2260
7           0  [m, i, i, l, m, u, i, l, i, l]    l    40  104   2180    2208
8           0  [m, i, i, l, m, u, i, l, i, l]    i    40  104   2140    2166
9           0  [m, i, i, l, m, u, i, l, i, l]    l    40  104   2124    2134
10          1                          [., 3]    .   203  213    191     198
11          1                          [., 3]    3   235  262    131     198
12          2                 [A, M, S, U, N]    A   275  347    147     239
13          2                 [A, M, S, U, N]    M   363  465    145     239
14          2                 [A, M, S, U, N]    S   485  549    145     243
Sign up to request clarification or add additional context in comments.

20 Comments

What is print (type(df.loc[0,'Column_1'])) ?
Btw, for convert str to lists is possible use import ast df = df.applymap(ast.literal_eval)
Now I am going home, please check my solution, i hope it works nice. If some problem tomorrow I try help you.
I have an idea. Can you create small data sample with 3 rows and in each list only 10 or 5 values with desired output? Because your original data are large and it is really hard process it. And also is very hard verify solution if works or not. You can share it again and tomorrow I hope give you solution. Thanks.
because str.len() get length of each list in row and np.repeat create duplicity - if len is 3, then create page_number '5', '5', '5' - 3 times.
|
1

You could use a list comprehension for this:

arr = [['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]

new_arr = [x for x in arr if x]

Or perhaps you prefer list + filter:

new_arr = list(filter(lambda x: x, arr))

The reason the lambda x: x works in this case is because that particular lambda is testing whether a given x in arr is "truthy." More specifically, that lambda will filter out elements in arr that are "falsey," like an empty list, []. It's almost like saying, "Keep everything in arr that 'exists'," so to speak.

Comments

0
new_list = []
for x in old_list:
    if len(x) > 0:
        new_list.append(x)

1 Comment

this is a simple way to take the empty lists out of a list of lists. I just saw your call to df, if you are using Pandas, create a function, and look at applymap
0

You could do this:

lst = [['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382, 1644, 1668]]
new_lst = [i for i in lst if len(i) > 0]

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.