Extracting anchor tag from html using Java

Question

I have several anchor tags in a text,

Input: <a href="http://stackoverflow.com" >Take me to StackOverflow</a>

Output: http://stackoverflow.com

How can I find all those input strings and convert it to the output string in java, without using a 3rd party API ???

Bart Kiers · Accepted Answer · 2011-07-11 10:00:16Z

There are classes in the core API that you can use to get all href attributes from anchor tags (if present!):

import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class HtmlParseDemo {
   public static void main(String [] args) throws Exception {

       String html =
           "<a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> " +
           "<!--                                                               " +
           "<a href=\"http://ignoreme.com\" >...</a>                           " +
           "-->                                                                " +
           "<a href=\"http://www.google.com\" >Take me to Google</a>           " +
           "<a>NOOOoooo!</a>                                                   ";

       Reader reader = new StringReader(html);
       HTMLEditorKit.Parser parser = new ParserDelegator();
       final List<String> links = new ArrayList<String>();

       parser.parse(reader, new HTMLEditorKit.ParserCallback(){
           public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
               if(t == HTML.Tag.A) {
                   Object link = a.getAttribute(HTML.Attribute.HREF);
                   if(link != null) {
                       links.add(String.valueOf(link));
                   }
               }
           }
       }, true);

       reader.close();
       System.out.println(links);
   }
}

which will print:

[http://stackoverflow.com, http://www.google.com]

Wow. I didnt know the existence of HTMLEditorKit. Which is the best HTML parser if I were to use one ??
The best parser is the one that passes all your unit tests :). This is a better option than using a regex hack.

Op De Cirkel · Accepted Answer · 2011-07-11 11:29:51Z

5

public static void main(String[] args) {
    String test = "qazwsx<a href=\"http://stackoverflow.com\">Take me to StackOverflow</a>fdgfdhgfd"
            + "<a href=\"http://stackoverflow2.com\">Take me to StackOverflow2</a>dcgdf";

    String regex = "<a href=(\"[^\"]*\")[^<]*</a>";

    Pattern p = Pattern.compile(regex);

    Matcher m = p.matcher(test);
    System.out.println(m.replaceAll("$1"));
}

NOTE: All Andrzej Doyle's points are valid and if you have more then simple <a href="X">Y</a> in your input, and you are sure that is parsable HTML, then you are better with HTML parser.

To summarize:

The regex i posted doesn't work if you have <a> in comment. (you can treat it as special case)
It doesn't work if you have other attributes in the <a> tag. (again you can treat it as special case)
there are many other cases that regex wont work, and you can not cover all of them with regex, since HTML is not regular language.

However, if your req is always replace <a href="X">Y</a> with "X" without considering the context, then the code i've posted will work.

edited Jul 11, 2011 at 11:29

answered Jul 11, 2011 at 9:00

Op De Cirkel

29.7k6 gold badges42 silver badges53 bronze badges

11 Comments

Andrzej Doyle Over a year ago

-1: HTML is not a regular language. (Need I say more?)

Op De Cirkel Over a year ago

>>> (Need I say more?) Yes, give me test case that the code will fail to achive what SO is asking

Andrzej Doyle Over a year ago

Many, many inputs. <a class="stripey" href="http://stackoverflow.com">Take me...</a> will give a false negative.  will give a false positive. In both cases, using an HTML parser would extract the href attribute correctly (including not finding the element at all in the second case).

Op De Cirkel Over a year ago

I am sorry, the question reads: I have several anchor tags in a text, where did you see HTML? So that you use HTML parser?

Andrzej Doyle Over a year ago

@Op De Cirkel - I've removed my -1 as this was overly harsh. But unless the OP can guarantee that this really is an arbitrary plaintext file which coincidentally looks like HTML (but isn't), I wouldn't recommend using regex. The HTML file might be simple now but there are so many legal ways it could be rewritten that would cause the code to break, that it's less headache to just parse it as HTML from the start. (Commenting out is the "killer use case" here - we're all used to doing it to have something temporarily ignored, so code that doesn't handle that is confusion waiting to happen.)

|

Bart Kiers · Accepted Answer · 2011-07-11 09:39:01Z

5

You can use JSoup

String html = "<p>An <a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String linkHref = link.attr("href"); // "http://stackoverflow.com"

Also See

Example

edited Jul 11, 2011 at 9:39

Bart Kiers

171k38 gold badges307 silver badges297 bronze badges

answered Jul 11, 2011 at 8:38

Jigar Joshi

241k42 gold badges409 silver badges446 bronze badges

3 Comments

Ebbu Abraham Over a year ago

Any way to do this in Java itself without using a 3rd party API ??

Andrzej Doyle Over a year ago

@Ebbu - Of course, you can always write an HTML parser yourself if you fancy. But if you want to extract data from HTML, you need an HTML parser (see my comment to Op's answer, so if you don't like reinventing wheels, in practice you should just pull in a third-party library. And you shouldn't be worried about that; the library support is one of Java's biggest advantages.

Ebbu Abraham Over a year ago

Thanks Andrzej, but right now this is my only requirement and I dont want to use a 3rd party API just for this. Otherwise I completely agree with what you said. This is my first experience with regex and I am having some difficulty in solving this.

Kristen Gillard · Accepted Answer · 2012-08-27 11:09:51Z

The above example works perfect; if you want to parse an HTML document say instead of concatenated strings, write something like this to compliment the code above.

Existing code above ~ modified to show: HtmlParser.java (HtmlParseDemo.java) above complementing code with HtmlPage.java below. The content of the HtmlPage.properties file is at the bottom of this page.

The main.url property in the HtmlPage.properties file is: main.url=http://www.whatever.com/

That way you can just parse the url that your after. :-) Happy coding :-D

import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HtmlParser
{
    public static void main(String[] args) throws Exception
    {
        String html = HtmlPage.getPage();

        Reader reader = new StringReader(html);
        HTMLEditorKit.Parser parser = new ParserDelegator();
        final List<String> links = new ArrayList<String>();

        parser.parse(reader, new HTMLEditorKit.ParserCallback()
        {
            public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
            {
                if (t == HTML.Tag.A)
                {
                    Object link = a.getAttribute(HTML.Attribute.HREF);
                    if (link != null)
                    {
                        links.add(String.valueOf(link));
                    }
                }
            }
        }, true);

        reader.close();

        // create the header
        System.out.println("<html>\n<head>\n   <title>Link City</title>\n</head>\n<body>");

        // spit out the links and create href
        for (String l : links)
        {
            System.out.print("   <a href=\"" + l + "\">" + l + "</a>\n");
        }

        // create footer
        System.out.println("</body>\n</html>");
    }
}

import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ResourceBundle;

public class HtmlPage
{
    public static String getPage()
    {
        StringWriter sw = new StringWriter();
        ResourceBundle bundle = ResourceBundle.getBundle(HtmlPage.class.getName().toString());

        try
        {
            URL url = new URL(bundle.getString("main.url"));

            HttpURLConnection connection = (HttpURLConnection) url.openConnection();
            connection.setRequestMethod("GET");
            connection.setDoOutput(true);

            InputStream content = (InputStream) connection.getInputStream();
            BufferedReader in = new BufferedReader(new InputStreamReader(content));

            String line;

            while ((line = in.readLine()) != null)
            {
                sw.append(line).append("\n");
            }

        } catch (Exception e)
        {
            e.printStackTrace();
        }

        return sw.getBuffer().toString();
    }
}

For example, this will output links from http://ebay.com.au/ if viewed in a browser. This is a subset, as there are a lot of links

    
    
       Link City
    
    
       #mainContent
       http://realestate.ebay.com.au/

pap · Accepted Answer · 2011-07-11 09:16:57Z

The most robust way (as has been suggested already) is to use regular expressions (java.util.regexp), if you are required to build this without using 3d party libs.

The alternative is to parse the html as XML, either using a SAX parser to capture and handle each instance of an "a" element or as a DOM Document and then searching it using XPATH (see http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/package-summary.html). This is problematic though, since it requires the HTML page to be fully XML compliant in markup, a very dangerous assumption and not an approach I would recommend since most "real" html pages are not XML compliant.

Still, I would recommend also looking at existing frameworks out there built for this purpose (like JSoup, also mentioned above). No need to reinvent the wheel.

Collectives™ on Stack Overflow

Extracting anchor tag from html using Java

5 Answers 5

2 Comments

11 Comments

3 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

11 Comments

3 Comments

Comments

Comments

Linked

Related