4

I'm looking a regular expression which must extract text between HTML tag of different types.

For ex:

<span>Span 1</span> - O/p: Span 1

<div onclick="callMe()">Span 2</div> - O/p: Span 2

<a href="#">HyperText</a> - O/p: HyperText

I found this particular piece <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1> from here But this one is not working.

11
  • 1
    Please state exactly how it is not working. Commented Mar 28, 2013 at 15:19
  • 3
    I would like to refer you to the legendary top answer of this question: stackoverflow.com/questions/1732348/… Commented Mar 28, 2013 at 15:20
  • Your best bet is to use a HTML parser. Something like jsoup.org. Commented Mar 28, 2013 at 15:21
  • 1
    @Sriram The exact answer is: Don't use regular expressions to parse HTML, in case that wasn't obvious enough. Commented Mar 28, 2013 at 15:28
  • 1
    Use "<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>". Commented Mar 28, 2013 at 15:50

4 Answers 4

10

Your comment shows that you have neglected to escape the backslashes in your regex string.

And if you want to match lowercase letters add a-z to the character classes or use Pattern.CASE_INSENSITIVE (or add (?i) to the beginning of the regex)

"<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>"

If the tag contents may contain newlines, then use Pattern.DOTALL or add (?s) to the beginning of the regex to turn on dotall/singleline mode.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for this. Yep I missed to add backslash in the expression. I'm looking for one more option in that expression which would recursively check the html tags and ultimately get the text between these tags. Ex: <span><strong>test</strong></span> I hope this time I'm very clear.
@Sriram. To get the inner tags you would have to use the above regex in a loop, but I think you would be better to ask a new question for that.
I am unable to retrive the content between the below tag <h1><h1>Ajay has no watch</h1></h1><par>So wait for a while to get time </par> Please provide some solution
1

This should suit your needs:

<([a-zA-Z]+).*?>(.*?)</\\1>

The first group contains the tag name, the second one the value inbetween.

1 Comment

If multi tags are there the reg expression is not valid
1
Matcher matcher = Pattern.compile("<([a-zA-Z]+).*>(.+)</\\1+>")
    .matcher("<a href=\"#\">HyperText</a>");

while (matcher.find())
{
    String matched = matcher.group(2);

    System.out.println(matched + " found at "
        + "\n"
        + "start at :- " + matcher.start()
        + "\n"
        + "end at :- " + matcher.end()
        + "\n");
}

Comments

-1

A very specific way:

(<span>|<a href="#">|<div onclick="callMe\(\)">)(.*)(</span>|</a>|</div>)

but yeah, this will only work for those 3 examples. You'll need to use an HTML parser.

1 Comment

the case may be with any of the HTML tag. can't say.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.