0

I'm working on a regular expression pattern to extract tag and attributes from an html element. But I have some problems with matching the attributes :s. Only the last attribute is stored into the matches array.

Here is the code:

<?php
    $subject = '<font face="arial" size="1" color="red">hello world!</font>';
    $find= '/<(?P<tag>\w+)\s+((?P<attr>\w+)=(?P<value>[^\s""\'>]+|"[^"]*"|\'[^\']*\')\s*)*\/?>/si';

    preg_match_all( $find, $subject, $matches );
?>

Can someone help me out?

Many thanks

3
  • Drop that and use XPath instead. Commented Jul 12, 2010 at 15:55
  • You can't reliably parse HTML with regular expressions. See the awesome rant on this subject here: stackoverflow.com/questions/1732348/… Commented Jul 12, 2010 at 15:57
  • But what if I want to parse html to xhtml? I read that xpath is xhtml compatible. Commented Jul 12, 2010 at 16:11

1 Answer 1

1

Some important points:

  • You shouldn't use regex to parse HTML. PHP has many excellent HTML parsing libraries.
  • A group that captures repeatedly in a match only keeps the last capture.
    • One notable exception is .NET regex

References

Related questions

Sign up to request clarification or add additional context in comments.

1 Comment

This is the better read: regular-expressions.info/captureall.html - Capturing a repeated group vs repeating a capturing group.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.