0

When I am downloading page content by Node.js Request and the content is encoded by ISO-8859-2, it is impossible to convert it to UTF-8.

I am using node-iconv for it.

Code:

request('https://www.jakpsatweb.cz', function(err, resp, body){
    const title = regexToRetrieveTitle(body);
    const iconv = new Iconv('ISO-8859-2', 'UTF-8');
    const buffer = iconv.convert(title);
    console.log(buffer);
    console.log(buffer.toString('UTF8'));
})

Console:

<Buffer 52 65 6b 6c 61 6d 61 3a 20 6a 61 6b 20 66 75 6e 67 75 6a 65 20 77 65 62 6f 76 c4 8f c5 bc cb 9d 20 72 65 6b 6c 61 6d 61>
Reklama: jak funguje webovďż˝ reklama

Expected result:

Reklama: jak funguje webová reklama

Do anyone know where is problem?

EDIT:

For example I download THIS PAGE . I recognised ISO-8859-2 by meta tags (chrome browser also) and I need to convert the content of page and save to database. My Database is UTF-8 therefore I need to encode it.

4
  • Please provide the expected input and output strings (not just a buffer) Commented Oct 19, 2016 at 12:31
  • It is there. How you can see, there are two console.logs() It means there is buffer and the second line is string. Expected string is without buffer. Commented Oct 19, 2016 at 12:50
  • What is the value of title? Commented Oct 19, 2016 at 13:28
  • Title is parsed content of <title>content</title>. Question updated. Commented Oct 19, 2016 at 13:35

2 Answers 2

2

The problem is in Node.js request. There is encoding set to UTF8 by default. I had to set it to null and now everything works fine.

request({ uri: 'https://www.jakpsatweb.cz', encoding: null}, function(err, resp, body){
    .....
})
Sign up to request clarification or add additional context in comments.

1 Comment

In my case I've just changed request by axios
1

The conversion from ISO-8859-2 to UTF-8 worked fine. It was the input (the title variable) that has a wrong contents: The title contains the bytes EF BF BD. This means that the title was already UTF-8 encoded, but with a U+FFFD (REPLACEMENT CHARACTER) in the place where you would expect the letter á (LATIN SMALL LETTER A WITH ACUTE).

Now, the original web page https://www.jakpsatweb.cz/reklama/index.html is correctly encoded in ISO-8859-2 and also has the required charset declaration in the <head> section.

Therefore the problem must be in the software that downloads the web page (NodeJS) or the regexToRetrieveTitle function.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.