Read XLS file with Pandas & xlrd returns error; xlrd opens file on its own

Question

I am writing some automated scripts to process Excel files in Python, some are in XLS format. Here's a code snippet of my attempting to do so with Pandas:

df = pd.read_excel(contents, engine='xlrd', skiprows=5, names=['some', 'column', 'headers'])

contents is the file contents pulled from an AWS S3 bucket. When this line runs I get [ERROR] ValueError: File is not a recognized excel file.

In troubleshooting this, I have tried to access the spreadsheet using xlrd directly:

book = xlrd.open_workbook(file_contents=contents)
print("Number of worksheets is {}".format(book.nsheets))
print("Worksheet names: {}".format(book.sheet_names()))

This works without errors so xlrd seems to recognize it as an Excel file, just not when asked to do so by Pandas.

Anyone know why Pandas won't read the file with xlrd as the engine? Or can someone help me take the sheet from xlrd and convert it into a Pandas dataframe?

You say some are in XLS format. Is it possible you're getting this error when encountering a XLSX one? The reason I mention this is because in newer versions of pandas, engine=xlrd only works with the old-style Excel files. I'm not sure the reason for this change since xlrd itself can handle both. Reference: pandas.pydata.org/docs/reference/api/pandas.read_excel.html — mechanical_meat
– mechanical_meat, Commented Aug 28, 2021 at 19:23
Since xlrd reads the file on it's own, I'm pretty certain it's not an XLSX file. I've also run a file command on some of these XLS files and Linux does not recognize them as XLSX files either. — Christian
– Christian, Commented Aug 28, 2021 at 19:27
What I'm trying to say is xlrd reads both formats. You can use xlrd.inspect_format(path) to figure out which you're dealing with. Reference: xlrd.readthedocs.io/en/latest/api.html#xlrd.inspect_format — mechanical_meat
– mechanical_meat, Commented Aug 28, 2021 at 20:03

MDR · Accepted Answer · 2021-08-28 22:19:53Z

Or can someone help me take the sheet from xlrd and convert it into a Pandas dataframe?

pd.read_excel can take a book...

import xlrd

book = xlrd.open_workbook(filename='./file_check/file.xls')

df = pd.read_excel(book, skiprows=5)

print(df)

   some   column headers
0     1     some     foo
1     2  strings     bar
2     3     here     yes
3     4      too      no

I'll include the code below that may help if you want to check/handle Excel file types. Maybe you can adapt it for your needs.

The code loops through a local folder and shows the file and extension but then uses python-magic to drill into it. It also has a column showing guessing from mimetypes but that isn't as good. Do zoom into the image of the frame and see that some .xls are not what the extension says. Also a .txt is actually an Excel file.

import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic

path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")

data = []

for file in all_files:
    name, extension = os.path.splitext(file)
    data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])

df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])

# del df['magic.from_file(file, mime=True)']

df

From there you could filter files based on their type:

xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'

xls_file_format = 'application/vnd.ms-excel'

for file in all_files:
    if magic.from_file(file, mime=True) == xlsx_file_format:
        print('xlsx')
        #  DO SOMETHING SPECIAL WITH XLSX FILES
    elif magic.from_file(file, mime=True) == xls_file_format:
        print('xls')
        #  DO SOMETHING SPECIAL WITH XLS FILES
    else:
        continue

dfs = []

for file in all_files:
    if (magic.from_file(file, mime=True) == xlsx_file_format) or \
    (magic.from_file(file, mime=True) == xls_file_format):
        # who cares, it all works with this for the demo...
        df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
        dfs.append(df)
    
print('\nHow many frames did we get from seven files? ', len(dfs))

Output:

xlsx
xls
xls
xlsx

How many frames did we get from seven files?  4

I suppose if I had RTFM closer on Pandas and read_excel I would have seen xlrd.book as one of the options to pass data into Pandas. Thank you for your help!

Collectives™ on Stack Overflow

Read XLS file with Pandas & xlrd returns error; xlrd opens file on its own

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related