62

I've got about 1000 filenames read by os.listdir(), some of them are encoded in UTF8 and some are CP1252.

I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode?

Example:

for item in os.listdir(rootPath):

    #Convert to Unicode
    if isinstance(item, str):
        item = item.decode('cp1252')  # or item = item.decode('utf-8')
    print item
0

6 Answers 6

77

Use chardet library. It is super easy

import chardet

the_encoding = chardet.detect('your string')['encoding']

and that's it!

in python3 you need to provide type bytes or bytearray so:

import chardet
the_encoding = chardet.detect(b'your string')['encoding']
Sign up to request clarification or add additional context in comments.

8 Comments

Seems to me it doesnt work. I have created string variable and encoded it utf-8. chardet returned TIS-620 encoding.
I found that cchardet appears to be the current name for this or a similar library...; chardet was not findable.
A bit confused here. It seems like it isn't possible to provide an str class as an argument. Only b'your string' works for me, or directly providing a byte variable.
The problem with this answer for me is that some cp1252/latin1 characters can be interpreted as technically valid utf8 - which leads to ê type characters where it should have been ê. chardet seems to try utf8 first, which results in this. There may be a way to tell it which order to use, but lucemia's answer worked better for me.
In Python 3: TypeError: Expected object of type bytes or bytearray, got: <class 'str'>
|
38

if your files either in cp1252 and utf-8, then there is an easy way.

import logging
def force_decode(string, codecs=['utf8', 'cp1252']):
    for i in codecs:
        try:
            return string.decode(i)
        except UnicodeDecodeError:
            pass

    logging.warn("cannot decode url %s" % ([string]))

for item in os.listdir(rootPath):
    #Convert to Unicode
    if isinstance(item, str):
        item = force_decode(item)
    print item

otherwise, there is a charset detect lib.

Python - detect charset and convert to utf-8

https://pypi.python.org/pypi/chardet

Comments

16

You also can use json package to detect encoding.

import json

json.detect_encoding(b"Hello")

Comments

2

I tried with both json and chardet, and I got these results:

import json
import chardet

data = b'\xa9 2023'
json.detect_encoding(data)  # 'utf-8'
data.decode('utf-8')  # UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte

chardet.detect(data)  # {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
data.decode("ISO-8859-1")  # '© 2023'

Comments

1

charset_normalizer is a drop in replacement for chardet.

It works better on natural language and has a permissive MIT licence: https://github.com/Ousret/charset_normalizer/

from charset_normalizer import detect
encoding = detect(byte_string)['encoding']

PS: This is not strictly related to the original question but this page comes up in Google a lot

Comments

0

chardet detected encoding can be used to decode an bytearray without any exception, but the output string may not be correct.

The try ... except ... way works perfectly for known encodings, but it does not work for all scenarios.

We can use try ... except ... first and then chardet as plan B:

    def decode(byte_array: bytearray, preferred_encodings: List[str] = None):
        if preferred_encodings is None:
            preferred_encodings = [
                'utf8',       # Works for most cases
                'cp1252'      # Other encodings may appear in your project
            ]

        for encoding in preferred_encodings:
            # Try preferred encodings first
            try:
                return byte_array.decode(encoding)
            except UnicodeDecodeError:
                pass
        else:
            # Use detected encoding
            encoding = chardet.detect(byte_array)['encoding']
            return byte_array.decode(encoding)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.