2

I'm using Anaconda shell for data analitycs Uploding to pandas bunch of excel files (25 files) On This files https://www.dropbox.com/s/16ea1cw6k63i16p/Newdata.zip?dl=0 I get error. Cant find the reason why and how to fix it.

import pandas as pd
import numpy as np
import os

os.chdir(r"C:\Users\Twentyouts\Desktop\Newdata" )
path = os.getcwd()

files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']

for f in files_xlsx:
    print(f)
    loading = pd.read_excel(f, heading = 0)
    df = df.append(loading)
2016-06-20–2016-06-26.xlsx
2016-06-27–2016-07-03.xlsx
2016-07-04–2016-07-10.xlsx
2016-07-11–2016-07-17.xlsx
2016-08-01–2016-08-07.xlsx
2016-08-15–2016-08-21.xlsx

KeyError                                  Traceback (most recent call last)
<ipython-input-23-5737d4d13b9f> in <module>()
      1 df = pd.DataFrame()
----> 2 pd.read_excel('2016-08-15–2016-08-21.xlsx')

C:\Users\Twentyouts\Anaconda3\lib\site-packages\pandas\io\excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, true_values, false_values, engine, squeeze, **kwds)
    189 
    190     if not isinstance(io, ExcelFile):
--> 191         io = ExcelFile(io, engine=engine)
    192 
    193     return io._parse_excel(

C:\Users\Twentyouts\Anaconda3\lib\site-packages\pandas\io\excel.py in __init__(self, io, **kwds)
    247             self.book = xlrd.open_workbook(file_contents=data)
    248         elif isinstance(io, compat.string_types):
--> 249             self.book = xlrd.open_workbook(io)
    250         else:
    251             raise ValueError('Must explicitly set engine if not passing in'

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    420                 formatting_info=formatting_info,
    421                 on_demand=on_demand,
--> 422                 ragged_rows=ragged_rows,
    423                 )
    424             return bk

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    831         x12sheet = X12Sheet(sheet, logfile, verbosity)
    832         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 833         x12sheet.process_stream(zflo, heading)
    834         del zflo
    835 

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in own_process_stream(self, stream, heading)
    546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
--> 548                 self_do_row(elem)
    549                 elem.clear() # destroy all child elements (cells)
    550             elif elem.tag == U_SSML12 + "dimension":

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in do_row(self, row_elem)
    743                     else:
    744                         bad_child_tag(child_tag)
--> 745                 value = error_code_from_text[tvalue]
    746                 self.sheet.put_cell(rowx, colx, XL_CELL_ERROR, value, xf_index)
    747             elif cell_type == "inlineStr":

KeyError: None
9

3 Answers 3

2

Indeed as @MaxU points out, Excel files are malformed but interestingly does resolve when properly saved as an .xlsx file. Possibly, the invalid files were attempted to be upgraded from previous .xls version by simply changing the extension to .xlsx. These two file formats are not simple text files that can change extension without hazard but very different binary formats.

Consider running a COM interface using wn32com module to properly save the malformed files to actual OpenXML workbooks using Excel's Workbook.SaveAs method. Note: this solution is only compliant in Python for Windows with installed MS Excel.

import pandas as pd
import glob
import win32com.client as win32

xlsxfiles = glob.glob("C:\\Path\\To\\Workbooks\\*.xlsx")

def save_xlsx(srcfile):
    try:
        newfile = srcfile.replace('.xlsx', '_new.xlsx')
        print('Malformed file saved as {}'.format(newfile))
        xlApp = win32.gencache.EnsureDispatch('Excel.Application')
        wb = xlApp.Workbooks.Open(srcfile)
        wb.SaveAs(newfile, 51)                 

    except Exception as e:
        print(e)            
    finally:
        wb.Close(True); wb = None
        xlApp.Quit; xlApp = None    
    return newfile

def xl_read():    
    dfs = []
    for f in xlsxfiles:        
        try:
            df = pd.read_excel(f)
        except Exception as e:            
            df = pd.read_excel(save_xlsx(f))

        print('File: {}, Shape: {}'.format(f, df.shape))
        dfs.append(df)            
    return pd.concat(dfs)

print('Final dataframe shape: {}'.format(xl_read().shape))  

Output (final dataframe of 330,257 rows and 30 columns)

File: C:\Path\To\Workbooks\2016-06-20–2016-06-26.xlsx, Shape: (5912, 27)
File: C:\Path\To\Workbooks\2016-06-27–2016-07-03.xlsx, Shape: (5362, 27)
File: C:\Path\To\Workbooks\2016-07-04–2016-07-10.xlsx, Shape: (5387, 27)
File: C:\Path\To\Workbooks\2016-07-11–2016-07-17.xlsx, Shape: (5331, 28)
File: C:\Path\To\Workbooks\2016-08-01–2016-08-07.xlsx, Shape: (4965, 28)
Malformed file saved as C:\Path\To\Workbooks\2016-08-15–2016-08-21_new.xlsx
File: C:\Path\To\Workbooks\2016-08-15–2016-08-21.xlsx, Shape: (5315, 27)
File: C:\Path\To\Workbooks\2016-08-22–2016-08-28.xlsx, Shape: (5179, 27)
File: C:\Path\To\Workbooks\2016-08-29–2016-09-04.xlsx, Shape: (5855, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-05–2016-09-11_new.xlsx
File: C:\Path\To\Workbooks\2016-09-05–2016-09-11.xlsx, Shape: (5838, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-12–2016-09-18_new.xlsx
File: C:\Path\To\Workbooks\2016-09-12–2016-09-18.xlsx, Shape: (5729, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-19–2016-09-25_new.xlsx
File: C:\Path\To\Workbooks\2016-09-19–2016-09-25.xlsx, Shape: (6401, 27)
File: C:\Path\To\Workbooks\2016-09-26–2016-10-02.xlsx, Shape: (7018, 27)
File: C:\Path\To\Workbooks\2016-09.xlsx, Shape: (23874, 27)
File: C:\Path\To\Workbooks\2016-10-03–2016-10-09.xlsx, Shape: (6587, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-12_new.xlsx
File: C:\Path\To\Workbooks\2016-10-10–2016-10-12.xlsx, Shape: (2883, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-13_new.xlsx
File: C:\Path\To\Workbooks\2016-10-10–2016-10-13.xlsx, Shape: (4174, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-20_new.xlsx
File: C:\Path\To\Workbooks\2016-10-17–2016-10-20.xlsx, Shape: (4560, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-23_new.xlsx
File: C:\Path\To\Workbooks\2016-10-17–2016-10-23.xlsx, Shape: (7111, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-27_new.xlsx
File: C:\Path\To\Workbooks\2016-10-24–2016-10-27.xlsx, Shape: (4921, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-30_new.xlsx
File: C:\Path\To\Workbooks\2016-10-24–2016-10-30.xlsx, Shape: (8005, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-31–2016-11-06_new.xlsx
File: C:\Path\To\Workbooks\2016-10-31–2016-11-06.xlsx, Shape: (7029, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10_new.xlsx
File: C:\Path\To\Workbooks\2016-10.xlsx, Shape: (28098, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-07–2016-11-13_new.xlsx
File: C:\Path\To\Workbooks\2016-11-07–2016-11-13.xlsx, Shape: (7076, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-14–2016-11-20_new.xlsx
File: C:\Path\To\Workbooks\2016-11-14–2016-11-20.xlsx, Shape: (7758, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-21_new.xlsx
File: C:\Path\To\Workbooks\2016-11-21.xlsx, Shape: (1689, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-21–2016-11-23_new.xlsx
File: C:\Path\To\Workbooks\2016-11-21–2016-11-23.xlsx, Shape: (4711, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-28–2016-12-04_new.xlsx
File: C:\Path\To\Workbooks\2016-11-28–2016-12-04.xlsx, Shape: (9286, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11_new.xlsx
File: C:\Path\To\Workbooks\2016-11.xlsx, Shape: (30505, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-05–2016-12-11_new.xlsx
File: C:\Path\To\Workbooks\2016-12-05–2016-12-11.xlsx, Shape: (8802, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-12–2016-12-18_new.xlsx
File: C:\Path\To\Workbooks\2016-12-12–2016-12-18.xlsx, Shape: (8333, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-16–2016-12-22_new.xlsx
File: C:\Path\To\Workbooks\2016-12-16–2016-12-22.xlsx, Shape: (8592, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-26–2016-12-31_new.xlsx
File: C:\Path\To\Workbooks\2016-12-26–2016-12-31.xlsx, Shape: (5362, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-01–2017-01-08_new.xlsx
File: C:\Path\To\Workbooks\2017-01-01–2017-01-08.xlsx, Shape: (4322, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-09–2017-01-15_new.xlsx
File: C:\Path\To\Workbooks\2017-01-09–2017-01-15.xlsx, Shape: (7608, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-23–2017-01-29_new.xlsx
File: C:\Path\To\Workbooks\2017-01-23–2017-01-29.xlsx, Shape: (8903, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-30–2017-02-05_new.xlsx
File: C:\Path\To\Workbooks\2017-01-30–2017-02-05.xlsx, Shape: (9173, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-12_new.xlsx
File: C:\Path\To\Workbooks\2017-02-13–2017-02-12.xlsx, Shape: (9144, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-19_new.xlsx
File: C:\Path\To\Workbooks\2017-02-13–2017-02-19.xlsx, Shape: (9911, 27)
File: C:\Path\To\Workbooks\test.xlsx, Shape: (5315, 27)
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 12-15.12_new.xlsx
File: C:\Path\To\Workbooks\Выгрузка 12-15.12.xlsx, Shape: (4818, 27)
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 21-27_new.xlsx
File: C:\Path\To\Workbooks\Выгрузка 21-27.xlsx, Shape: (8876, 27)
File: C:\Path\To\Workbooks\Выгрузка 26-29.12.xlsx, Shape: (4539, 27)
Final dataframe shape: (330257, 30)

Consider even a database engine approach using Windows' ACE Engine via pyodbc to query corresponding workbooks with pandas read_sql since each share same sheet name, TDSheet.

#...same as above
import pyodbc

def sql_read():    
    dfs = [] 
    for f in xlsxfiles:                
        try:
            conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\
                      'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(f), autocommit=True)
            df =  pd.read_sql('SELECT * FROM [TDSheet$];', conn)

        except Exception as e:
            conn.close()
            conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\
                      'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(save_xlsx(f)), autocommit=True)
            df = pd.read_excel('SELECT * FROM [TDSheet$];', conn)
            conn.close()

        print('File: {}, Shape: {}'.format(f, df.shape))
        dfs.append(df)
Sign up to request clarification or add additional context in comments.

2 Comments

Worked out!! Thank you.
I tried @Parfait code and it returns some error if you're not running it correctly. Must run it on empty notebook and don't change any except the directory
1

It looks like some of your Excel files are malformed:

import os
import glob
import pandas as pd

excel_files_mask = r'D:\temp\.data\42468475\*.xlsx'

files = glob.glob(excel_files_mask)

def merge_excel_files(files, **kwargs):
    #return pd.concat([pd.read_excel(f, **kwargs) for f in files],
    #                 ignore_index=True)
    dfs = []
    for f in files:
        #print('processing: [{}]'.format(f))
        try:
            df = pd.read_excel(f, **kwargs)
            dfs.append(df)
            print('parsed: [{}], shape: {}'.format(f, df.shape))
        except KeyError:
            print("ERROR: file [{}] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...".format(f))
    return pd.concat(dfs, ignore_index=True)

df = merge_excel_files(files, header=None, skiprows=1)
print(df.shape)

Yields:

parsed: [D:\temp\.data\42468475\2016-06-20–2016-06-26.xlsx], shape: (5912, 27)
parsed: [D:\temp\.data\42468475\2016-06-27–2016-07-03.xlsx], shape: (5362, 27)
parsed: [D:\temp\.data\42468475\2016-07-04–2016-07-10.xlsx], shape: (5387, 27)
parsed: [D:\temp\.data\42468475\2016-07-11–2016-07-17.xlsx], shape: (5331, 28)
parsed: [D:\temp\.data\42468475\2016-08-01–2016-08-07.xlsx], shape: (4965, 28)
ERROR: file [D:\temp\.data\42468475\2016-08-15–2016-08-21.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
parsed: [D:\temp\.data\42468475\2016-08-22–2016-08-28.xlsx], shape: (5179, 27)
parsed: [D:\temp\.data\42468475\2016-08-29–2016-09-04.xlsx], shape: (5855, 27)
ERROR: file [D:\temp\.data\42468475\2016-09-05–2016-09-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-09-12–2016-09-18.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-09-19–2016-09-25.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
parsed: [D:\temp\.data\42468475\2016-09-26–2016-10-02.xlsx], shape: (7018, 27)
parsed: [D:\temp\.data\42468475\2016-09.xlsx], shape: (23874, 27)
parsed: [D:\temp\.data\42468475\2016-10-03–2016-10-09.xlsx], shape: (6587, 27)
ERROR: file [D:\temp\.data\42468475\2016-10-10–2016-10-12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-10–2016-10-13.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-17–2016-10-20.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-17–2016-10-23.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-24–2016-10-27.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-24–2016-10-30.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-31–2016-11-06.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11-07–2016-11-13.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11-14–2016-11-20.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11-21.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11-21–2016-11-23.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11-28–2016-12-04.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-12-05–2016-12-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-12-12–2016-12-18.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-12-16–2016-12-22.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-12-26–2016-12-31.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-01-01–2017-01-08.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-01-09–2017-01-15.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-01-23–2017-01-29.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-01-30–2017-02-05.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-02-13–2017-02-12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-02-13–2017-02-19.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
parsed: [D:\temp\.data\42468475\test.xlsx], shape: (5315, 27)
ERROR: file [D:\temp\.data\42468475\Выгрузка 12-15.12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\Выгрузка 21-27.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
parsed: [D:\temp\.data\42468475\Выгрузка 26-29.12.xlsx], shape: (4539, 27)
(85324, 28)

Comments

0

I had the same error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tassos/.local/lib/python3.6/site-packages/pandas/io/excel/_base.py", line 867, in __init__
    self._reader = self._engines[engine](self._io)
  File "/home/tassos/.local/lib/python3.6/site-packages/pandas/io/excel/_xlrd.py", line 22, in __init__
    super().__init__(filepath_or_buffer)
  File "/home/tassos/.local/lib/python3.6/site-packages/pandas/io/excel/_base.py", line 353, in __init__
    self.book = self.load_workbook(filepath_or_buffer)
  File "/home/tassos/.local/lib/python3.6/site-packages/pandas/io/excel/_xlrd.py", line 37, in load_workbook
    return open_workbook(filepath_or_buffer)
  File "/home/tassos/.local/lib/python3.6/site-packages/xlrd/__init__.py", line 138, in open_workbook
    ragged_rows=ragged_rows,
  File "/home/tassos/.local/lib/python3.6/site-packages/xlrd/xlsx.py", line 812, in open_workbook_2007_xml
    x12book.process_stream(zflo, 'Workbook')
  File "/home/tassos/.local/lib/python3.6/site-packages/xlrd/xlsx.py", line 271, in process_stream
    meth(self, elem)
  File "/home/tassos/.local/lib/python3.6/site-packages/xlrd/xlsx.py", line 380, in do_sheet
    reltype = self.relid2reltype[rid]
KeyError: ''

Find Out that the problem was that I had concurrently opened the file with my spreadsheet program (Libre-Spreadsheets), so as long as the file is opened from another program

pd.read_excel("file.xlsx","Sheet1") 

won't run. This is another case that an Excel file is considered as malformed! Hope it helps!

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.