1

I am tring to get text from a website; when you change the language the html url have an "/en" inside, but the page that have the information that i want don't have.

http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92

html tags: (the text contains the description of the photo)
<div id="redx_gallery_pic_title"> text text </div>

The problem is that the website is in german and i want the text in english, and my script gets only the german version

Any ideas how can i do it?

java code:
...
URL oracle = new URL(x);
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
    String inputLine=null;
    StringBuffer theText = new StringBuffer();
    while ((inputLine = in.readLine()) != null)
            theText.append(inputLine+"\n");
    String html = theText.toString();
    in.close();

String[] name = StringUtils.substringsBetween(html, "redx_gallery_pic_title\">", "</div>");
2
  • What programming language are you using? What language APIs are you using to parse HTML? Show the code which you've so far to get the HTML contents. Commented Aug 3, 2011 at 16:06
  • I posted an answer, but in the future, you should really mention and tag that as such. There are a gazillion ways to parse HTML from a site and you didn't even tell anything about it. Commented Aug 3, 2011 at 19:25

1 Answer 1

3

That site is internationalized with German as default. You need to tell the server what language you're accepting by specifying the desired ISO 639-1 language code in the Accept-Language request header.

URLConnection connection = new URL(url).openConnection();
connection.setRequestProperty("Accept-Language", "en");
InputStream input = connection.getInputStream();
// ...

Unrelated to the concrete problem, may I suggest you to have a look at Jsoup as a HTML parser? It's much more convenient with its jQuery-like CSS selector syntax and therefore much less bloated than your attempt as far:

String url = "http://www.wippro.at/module/gallery/index.php?limitstart=0&picno=0&gallery_key=92";
Document document = Jsoup.connect(url).header("Accept-Language", "en").get();
String title = document.select("#redx_gallery_pic_title").text();
System.out.println(title); // Beech, glazing V3

That's all.

Sign up to request clarification or add additional context in comments.

5 Comments

But, if i want to get the text for romanian language? If i put "ro" instead of "en" i don't get the special characters.
That's because you're relying on the platform default encoding to read the response body. You need to use the other constructor of InputStreamReader which takes the charset as second argument and specify it with "UTF-8". Jsoup takes this fully transparently into account by the way :)
You're right with jsoup is easier, but i still dont know how to set the charset type (for jsoup code).
You don't need to. As said, it takes this fully transparently into account. It's smart enough to figure that based on the HTTP response header.
The problem is there where you displayed or saved the character. Are you displaying it in an IDE like Eclipse using System.out.println()? If so, set the Eclipse console encoding by Window > Preferences > General > Workspace and then set Text file encoding to UTF-8. Otherwise it'll use the platform default one. For more hints, see balusc.blogspot.com/2009/05/…

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.