UnicodeDecodeError: 'utf8' in Python 2.7

Question

I have a large file that has many lines, most of the lines are utf8, but looks like a few of lines are not utf8. When I try to read lines with a code like this:

 in_file = codecs.open(source, "r", "utf-8")
     for line in in_file:
         SOME OPERATIONS

I get the following error:

    for line in in_file:
  File "C:\Python27\lib\codecs.py", line 681, in next
    return self.reader.next()
  File "C:\Python27\lib\codecs.py", line 612, in next
    line = self.readline()
  File "C:\Python27\lib\codecs.py", line 527, in readline
    data = self.read(readsize, firstline=True)
  File "C:\Python27\lib\codecs.py", line 474, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 0: invalid continuation byte

What I would like to do is that for lines that are not utf8 do nothing without breaking the code, and then go to next line in the file and do my operations. How can I do it with try and except?

You can tell codecs.open() to handle errors by replacing the characters it cannot decode with placeholders or ignore them altogether, but you need to make sure you actually have the right codec here. — Martijn Pieters
– Martijn Pieters, Commented Jan 24, 2015 at 8:37
I know the file is mostly UTF-8 because I am looking at the content and I can see it is a none-English language. — TJ1
– TJ1, Commented Jan 24, 2015 at 14:49
That doesn't say anything. gb2312 is a non-english codec. Latin-1 can be used for loads of non-english texts. Etc. etc. etc. Your input is almost certainly not UTF-8. — Martijn Pieters
– Martijn Pieters, Commented Jan 24, 2015 at 14:50
I know the file is UTF-8 as I recognize the language as well and I know that is UTF-8. — TJ1
– TJ1, Commented Jan 24, 2015 at 14:53
What language is encoded says nothing about the codec used. But without seeing the actual data, this discussion is moot. — Martijn Pieters
– Martijn Pieters, Commented Jan 24, 2015 at 14:57

Ulrich Eckhardt · Accepted Answer · 2015-01-24 08:19:51Z

1

Open the file without any codec. Then, read the file line-by-line and try to decode each line from UTF-8. If that raises an exception, skip the line.

A completely different approach would be to tell the codec to replace or ignore faulty characters. This doesn't skip the lines but you don't seem to care too much about the contained data anyway, so it might be an alternative.

answered Jan 24, 2015 at 8:19

Ulrich Eckhardt

17.7k5 gold badges31 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

TJ1 Over a year ago

Thanks for smart answers. Both methods will work for me.

Collectives™ on Stack Overflow

UnicodeDecodeError: 'utf8' in Python 2.7

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related