1

So first of all I saw similar questions, but nothing worked/wasn't applicable to my problem. I'm writing a program that is taking in a Text file with a lot of search queries to be searched on Youtube. The program is iterating through the text file line by line. But these have special UTF-8 characters that cannot be decoded. So at a certain point the program stops with a

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to

As I cannot check every line of my entries, I want it to except the error, print the line it was working on and continue at that point. As the error is not happening in my for loop, but rather the for loop itself, I don't know how to write an try...except statement. This is the code:

import urllib.request
import re
from unidecode import unidecode
with open('out.txt', 'r') as infh,\
        open("links.txt", "w") as outfh:
    for line in infh:
        try:
            clean = unidecode(line)
            search_keyword = clean
            html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
            video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
            outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
            #print("https://www.youtube.com/watch?v=" + video_ids[0])
        except:
            print("Error encounted with Line: " + line)

This is the full error message, to see that the for loop itself is causing the problem.

Traceback (most recent call last): File "ytbysearchtolinks.py", line 6, in for line in infh: File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to

If you need an example of input I'm working with: https://pastebin.com/LEkwdU06

5
  • Your try-except-block looks good. I ran your code (on Linux without unidecode) with your input and it worked for me. Commented Mar 3, 2021 at 9:10
  • did you do it with my paste? It's about Unicode, so that's pretty essential? Commented Mar 3, 2021 at 9:13
  • Yes, I downloaded the file and ran the exact same code that you posted here, with the only difference that I didn't import and use unidecode. I get a list of YT links and a few errors like Error encounted with Line: Baianá+Bakermat, but it continues on. Commented Mar 3, 2021 at 9:30
  • I implemented the Unidecode package because I want to account for those, as an 'á' is decoded into an a. But other characters aren't. My problem is why this error is not caught with the try...except. Commented Mar 3, 2021 at 9:40
  • 1
    I understand. Please the code in my answer below, it worked for me. Commented Mar 3, 2021 at 9:55

2 Answers 2

1

The try-except-block looks correct and should allow you to catch all occurring exceptions.

The usage of unidecode probably won't help you because non-ASCII characters must be encoded in a specific way in URLs, see, e.g., here.

One solution is to use urllib's quote() function. As per documentation:

Replace special characters in string using the %xx escape.

This is what works for me with the input you've provided:

import urllib.request
from urllib.parse import quote
import re

with open('out.txt', 'r', encoding='utf-8') as infh,\
         open("links.txt", "w") as outfh:
     for line in infh:             
         search_keyword = quote(line)
         html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
         video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
         outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
         print("https://www.youtube.com/watch?v=" + video_ids[0])

EDIT: After thinking about it, I believe you are running into the following problem:

You are running the code on Windows, and apparently, Python will try to open the file with cp1252 encoding when on Windows, while the file that you shared is in UTF-8 encoding:

$  file out.txt
out.txt: UTF-8 Unicode text, with CRLF line terminators

This would explain the exception you are getting and why it's not being caught by your try-except-block (it's occurring when trying to open the file).

Make sure that you are using encoding='utf-8' when opening the file.

Sign up to request clarification or add additional context in comments.

3 Comments

Isn't solving my exception, but using the quote function is much smarter than my approach to the encoding problem. Thanks!
Are you using encoding='utf-8' when opening the file?
Added an explanation as to why this is the case.
0

i ran your code, but i didnt have some problems. Do you have create virtual environment with virtualenv and install all the packages you use ?

3 Comments

I'm on Windows. All packages are installed. The program works until an exception is caught. So the try...except is not working as the decode error is not ignored.
Ok, i tested on linux sorry, maybe the problem come from the encoding file text in your out.txt, i dont know but maybe you can try to encoding this file entirely on UTF-8
As said, my problem isn't the error itself. I want the exception to that to work.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.