Extract HTML from URL

Question

I'm using Boilerpipe to extract text from url, using this code:

URL url = new URL("http://www.example.com/some-location/index.html");
String text = ArticleExtractor.INSTANCE.getText(url);

the String text contains just the text of the html page, but I need to extract to whole html code from it.

Is there anyone who used this library and knows how to extract the HTML code?

You can check the demo page for more info on the library.

Goran Jovic · Accepted Answer · 2011-03-06 22:02:54Z

12

For something as simple as this you don't really need an external library:

 URL url = new URL("http://www.google.com");
 InputStream is = (InputStream) url.getContent();
 BufferedReader br = new BufferedReader(new InputStreamReader(is));
 String line = null;
 StringBuffer sb = new StringBuffer();
 while((line = br.readLine()) != null){
   sb.append(line);
 }
 String htmlContent = sb.toString();

edited Mar 6, 2011 at 22:02

answered Mar 6, 2011 at 21:49

Goran Jovic

9,5283 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Farmer Over a year ago

sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast to java.lang.String

Konrad Rudolph · Accepted Answer · 2011-03-06 21:50:42Z

1

Just use the KeepEverythingExtractor instead of the ArticleExtractor.

But this is using the wrong tool for the wrong job. What you want is just to download the HTML content of a URL (right?), not extract content. So why use a content extractor?

answered Mar 6, 2011 at 21:50

Konrad Rudolph

549k142 gold badges965 silver badges1.3k bronze badges

1 Comment

Farmer Over a year ago

KeepEverythingExtractor is not returning the HTML code, it returns the full text on the HTML page (links, ...)

Paul Vargas · Accepted Answer · 2015-04-25 21:42:03Z

1

With Java 7 and a trick of Scanner, you can do the following:

public static String toHtmlString(URL url) throws IOException {
    Objects.requireNonNull(url, "The url cannot be null.");
    try (InputStream is = url.openStream(); Scanner sc = new Scanner(is)) {
        sc.useDelimiter("\\A");
        if (sc.hasNext()) {
            return sc.next();
        } else {
            return null; // or empty
        }
    }
}

answered Apr 25, 2015 at 21:42

Paul Vargas

42.1k16 gold badges108 silver badges148 bronze badges

Collectives™ on Stack Overflow

Extract HTML from URL

3 Answers 3

1 Comment

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Related