3

I have this code :

request({ url: 'http://www.myurl.com/' }, function(error, response, html) {
  if (!error && response.statusCode == 200) {
    console.log($('title', html).text());
  }
});

But the websites that Im crawling can have different charset (utf8, iso-8859-1, etc..) how to get it and encode/decode the html always to the right encoding (utf8) ?

Thanks and sorry for my english ;)

1
  • Well I know that I can use the option encoding for the request but the problem I dont know yet the charset of the page (that I know with the header or the meta tag) Commented Apr 23, 2011 at 14:33

2 Answers 2

2

The website could return the content encoding in the content-type header or the content-type meta tag inside the returned HTML, eg:

<meta http-equiv="Content-Type" content="text/html; charset=latin1"/>

You can use the charset module to automatically check both of these for you. Not all websites or servers will specify an encoding though, so you'll want to fall back to detecting the charset from the data itself. The jschardet module can help you with that.

Once you've worked out the charset you can use the iconv module to do the actual conversion. Here's a full example:

request({url: 'http://www.myurl.com/', encoding: 'binary'}, function(error, response, html) {
    enc = charset(response.headers, html)
    enc = enc or jchardet.detect(html).encoding.toLowerCase()
    if enc != 'utf-8'
        iconv = new Iconv(enc, 'UTF-8//TRANSLIT//IGNORE')
        html = iconv.convert(new Buffer(html, 'binary')).toString('utf-8')
    console.log($('title', html).text());
});
Sign up to request clarification or add additional context in comments.

Comments

0

First up, you could send an Accept-Charset header which would prevent websites from sending data in other charsets.

Once you get a response, you can check the Content-Type header for the charset entry and do appropriate processing.

Anothr hack (I've used in the past) when the content encoding is unknown is try to decode using all possible content encodings and stick to the one that doesn't throw an exception (using in python though).

1 Comment

You can also try the module announced on this page: groups.google.com/group/nodejs/browse_thread/thread/… and here's the directi link: github.com/franzenzenhofer/whatlang

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.