0

I have a large file that has many lines, most of the lines are utf8, but looks like a few of lines are not utf8. When I try to read lines with a code like this:

 in_file = codecs.open(source, "r", "utf-8")
     for line in in_file:
         SOME OPERATIONS

I get the following error:

    for line in in_file:
  File "C:\Python27\lib\codecs.py", line 681, in next
    return self.reader.next()
  File "C:\Python27\lib\codecs.py", line 612, in next
    line = self.readline()
  File "C:\Python27\lib\codecs.py", line 527, in readline
    data = self.read(readsize, firstline=True)
  File "C:\Python27\lib\codecs.py", line 474, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 0: invalid continuation byte

What I would like to do is that for lines that are not utf8 do nothing without breaking the code, and then go to next line in the file and do my operations. How can I do it with try and except?

8
  • 1
    You can tell codecs.open() to handle errors by replacing the characters it cannot decode with placeholders or ignore them altogether, but you need to make sure you actually have the right codec here. Commented Jan 24, 2015 at 8:37
  • 3
    I know the file is mostly UTF-8 because I am looking at the content and I can see it is a none-English language. Commented Jan 24, 2015 at 14:49
  • 2
    That doesn't say anything. gb2312 is a non-english codec. Latin-1 can be used for loads of non-english texts. Etc. etc. etc. Your input is almost certainly not UTF-8. Commented Jan 24, 2015 at 14:50
  • 1
    I know the file is UTF-8 as I recognize the language as well and I know that is UTF-8. Commented Jan 24, 2015 at 14:53
  • 2
    What language is encoded says nothing about the codec used. But without seeing the actual data, this discussion is moot. Commented Jan 24, 2015 at 14:57

1 Answer 1

1

Open the file without any codec. Then, read the file line-by-line and try to decode each line from UTF-8. If that raises an exception, skip the line.

A completely different approach would be to tell the codec to replace or ignore faulty characters. This doesn't skip the lines but you don't seem to care too much about the contained data anyway, so it might be an alternative.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for smart answers. Both methods will work for me.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.