Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign uputils.web.getEncoding() always returning 'None' in Web plugin #1362
Comments
Backport fixes for the Web plugin [1][2][3]. [1] ProgVal/Limnoria#1371 [2] ProgVal/Limnoria#1362 [3] ProgVal/Limnoria#1359 Submitted by: DanDare (GitHub: Rodrigo-NH, via IRC) git-svn-id: svn+ssh://svn.freebsd.org/ports/head@513446 35697150-7ecd-e111-bb59-0022644237b5
Backport fixes for the Web plugin [1][2][3]. [1] ProgVal/Limnoria#1371 [2] ProgVal/Limnoria#1362 [3] ProgVal/Limnoria#1359 Submitted by: DanDare (GitHub: Rodrigo-NH, via IRC)
Backport fixes for the Web plugin [1][2][3]. [1] ProgVal/Limnoria#1371 [2] ProgVal/Limnoria#1362 [3] ProgVal/Limnoria#1359 Submitted by: DanDare (GitHub: Rodrigo-NH, via IRC) git-svn-id: https://svn.freebsd.org/ports/head@513446 35697150-7ecd-e111-bb59-0022644237b5
Backport fixes for the Web plugin [1][2][3]. [1] ProgVal/Limnoria#1371 [2] ProgVal/Limnoria#1362 [3] ProgVal/Limnoria#1359 Submitted by: DanDare (GitHub: Rodrigo-NH, via IRC) git-svn-id: svn+ssh://svn.freebsd.org/ports/head@513446 35697150-7ecd-e111-bb59-0022644237b5

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

Hi. While trying to find why NBSP (non-breaking space) decodes incorrectly if page is charset iso8859-1 I discovered that in the Web plugin, actual line 155 "text = text.decode(utils.web.getEncoding(text) or 'utf8', 'replace')" the utils.web.getEncoding(text) is always returning 'None'.
I tried a couple of different pages with same result, getEnconding not being capable of returning actual encoding.
Example of the problem: Title returned in the page 'https://www.freebsd.org/doc/handbook/usb-device-mode-terminals.html' the title contains nbsp in the right encoding accordingly iso8859-1. If I set decoding to iso8859-1 explicity in the code web plugin returns the title correctly.
I didn't look at getEnconding() yet to try finding the issue (in the case it's really a getEnconding() issue)
The current (running) version of this Limnoria is installed on 2019-01-24T22-12-03, running on Python 3.6.8 (default, Jan 3 2019, 01:10:23) [GCC 4.2.1 Compatible FreeBSD Clang 6.0.0 (tags/RELEASE_600/final 326565)]. The newest versions available online are 2019.02.22 (in master), 2019.02.22 (in testing).
Thanks!