Continuing for loop after exception in Python

Question

So first of all I saw similar questions, but nothing worked/wasn't applicable to my problem. I'm writing a program that is taking in a Text file with a lot of search queries to be searched on Youtube. The program is iterating through the text file line by line. But these have special UTF-8 characters that cannot be decoded. So at a certain point the program stops with a

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to

As I cannot check every line of my entries, I want it to except the error, print the line it was working on and continue at that point. As the error is not happening in my for loop, but rather the for loop itself, I don't know how to write an try...except statement. This is the code:

import urllib.request
import re
from unidecode import unidecode
with open('out.txt', 'r') as infh,\
        open("links.txt", "w") as outfh:
    for line in infh:
        try:
            clean = unidecode(line)
            search_keyword = clean
            html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
            video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
            outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
            #print("https://www.youtube.com/watch?v=" + video_ids[0])
        except:
            print("Error encounted with Line: " + line)

This is the full error message, to see that the for loop itself is causing the problem.

Traceback (most recent call last): File "ytbysearchtolinks.py", line 6, in for line in infh: File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to

If you need an example of input I'm working with: https://pastebin.com/LEkwdU06

Your try-except-block looks good. I ran your code (on Linux without unidecode) with your input and it worked for me. — Paul P
– Paul P, Commented Mar 3, 2021 at 9:10
did you do it with my paste? It's about Unicode, so that's pretty essential? — Loewe8
– Loewe8, Commented Mar 3, 2021 at 9:13
Yes, I downloaded the file and ran the exact same code that you posted here, with the only difference that I didn't import and use unidecode. I get a list of YT links and a few errors like Error encounted with Line: Baianá+Bakermat, but it continues on. — Paul P
– Paul P, Commented Mar 3, 2021 at 9:30
I implemented the Unidecode package because I want to account for those, as an 'á' is decoded into an a. But other characters aren't. My problem is why this error is not caught with the try...except. — Loewe8
– Loewe8, Commented Mar 3, 2021 at 9:40
I understand. Please the code in my answer below, it worked for me. — Paul P
– Paul P, Commented Mar 3, 2021 at 9:55

Paul P · Accepted Answer · 2021-03-03 10:11:18Z

The try-except-block looks correct and should allow you to catch all occurring exceptions.

The usage of unidecode probably won't help you because non-ASCII characters must be encoded in a specific way in URLs, see, e.g., here.

One solution is to use urllib's quote() function. As per documentation:

Replace special characters in string using the %xx escape.

This is what works for me with the input you've provided:

import urllib.request
from urllib.parse import quote
import re

with open('out.txt', 'r', encoding='utf-8') as infh,\
         open("links.txt", "w") as outfh:
     for line in infh:             
         search_keyword = quote(line)
         html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
         video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
         outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
         print("https://www.youtube.com/watch?v=" + video_ids[0])

EDIT: After thinking about it, I believe you are running into the following problem:

You are running the code on Windows, and apparently, Python will try to open the file with cp1252 encoding when on Windows, while the file that you shared is in UTF-8 encoding:

$  file out.txt
out.txt: UTF-8 Unicode text, with CRLF line terminators

This would explain the exception you are getting and why it's not being caught by your try-except-block (it's occurring when trying to open the file).

Make sure that you are using encoding='utf-8' when opening the file.

Isn't solving my exception, but using the quote function is much smarter than my approach to the encoding problem. Thanks!

lunk17 · Accepted Answer · 2021-03-03 09:09:32Z

0

i ran your code, but i didnt have some problems. Do you have create virtual environment with virtualenv and install all the packages you use ?

answered Mar 3, 2021 at 9:09

lunk17

314 bronze badges

3 Comments

Loewe8 Over a year ago

I'm on Windows. All packages are installed. The program works until an exception is caught. So the try...except is not working as the decode error is not ignored.

lunk17 Over a year ago

Ok, i tested on linux sorry, maybe the problem come from the encoding file text in your out.txt, i dont know but maybe you can try to encoding this file entirely on UTF-8

Loewe8 Over a year ago

As said, my problem isn't the error itself. I want the exception to that to work.

Collectives™ on Stack Overflow

Continuing for loop after exception in Python

2 Answers 2

3 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Related