1

I have a string as html source code. I want to get only the links from that string and put these links into an ArrayList. As you know, I want to get some strings between <a href="THE LINK I WANT">But I want to do this without using any external libraries. How can I do it with simple algorithm using String classes and loops? Thank you!

6
  • 8
    Why would you not want to use a HTML parsing library for this? Doing this properly without a library will be reinventing a hugely complicated wheel. Commented Mar 6, 2012 at 10:46
  • Because it is an assignment and my instructor want me to do this with simple algorithm. Is it simple? Commented Mar 6, 2012 at 10:49
  • It is not that complicated, you can search through the html for <a and then skip the characters before you encounter the href (or > in which case there is no href and you have to again start looking for the <a) and then from there you store the characters after the " to the next " and that will be "THE_LINK_YOU_WANT". Commented Mar 6, 2012 at 10:59
  • @aphex: No, it isn't simple. HTML parsing isn't trivial. Any "simple" solution will break with non-trivial input such as <a title='href="' class='"'>. Commented Mar 6, 2012 at 11:06
  • @RoToRa actually it was simple. I found the answer. Even so, thanks for your effort Commented Mar 6, 2012 at 16:12

2 Answers 2

5

Java Regex API is not a proper tool to achieve your goal. Use efficient, secure and well tested high-level tools mentioned in the other answers.

If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code:

String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
   System.out.println(m.group(0));
   System.out.println(m.group(1));
}

And the output is:

<a href='link1'>
link1
<a href='link2'>
link2

Please note that lazy/reluctant qualifier *? must be used in order to reduce the grouping to the single tag. Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis).


A NOTE to Consider :

Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.

Use an HTML Parser instead. See also What are the pros and cons of the leading Java HTML parsers?

Sign up to request clarification or add additional context in comments.

4 Comments

As I stated my question, I don't want to use any external libraries. I found the answer. Even so, thanks for your answer
your method, as u stated in your answer, is just a workaround..not a proper method..!!..You can at least use "regex" to solve your problem..!! (and its not an external library)
Actually It is not necessarily proper, because I want you just a simple algorithm. I've solved though :D
Its your call..!!...but if you would show my answer to your instructor, he would be definitely surprised and happy..!!.. ;)
1

I've found the answer!!!!!

public ArrayList<String> getLinks() {

    String link = "";

    for(int i = 0; i<url.length()-6; i++) {
        if(url.charAt(i) == 'h' && url.charAt(i+1) == 'r') {
            for(int k = i; k<url.length();k++ ){
                if(url.charAt(k) == '>'){
                    link = url.substring(i+6,k-1);
                    links.add(link);
                    // Break the loop 
                    k = url.length();
                }
            }
        }
    }
    return links;

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.