How to Parse an HTML STYLE Attribute with Regex?

Question

What is the proper Regex construction (.NET flavor) to extract the attribute/value pairs from an HTML style string, while ignoring HTML entities?

margin-top:0pt;margin:0;color:#000000;margin-left:0;font-size:26pt;margin-bottom:3pt;line-height:1.15;page-break-after:avoid;font-family:&quot;Arial&quot;;orphans:2;widows:2;text-align:left;margin-right:0

Splitting on ; and then on : would be simplest but as HTML Entities contain semicolons, this breaks on some strings. For example, entities can exist in the font-family style attribute.

font-family:&quot;Arial&quot;;

The style string is isolated (no style="), and single-line.

Ultimately I'll be regex-grouping them in this arrangement;

match:( 
    group:( style-attribute-name ) 
    group:( style-attribute-value ) 
    )

Iterating through the groups to create a dictionary (with duplicate keys getting replaced).

My current Regex looks like this-

\s*(?<attr>[^:\s]*)\s*:\s*(?<val>[^;]*)[;]\s*

And results in mis-matches when it hits the HTML entities.

@ThomasMoors He does not want to parse HTML here... Just a list of attributes. Don't link this comment every single time "HTML" and "regex" are in the same sentence. — Gawil
– Gawil, Commented Aug 9, 2017 at 13:05
As far as I know, all HTML entities begin with & and end with ;, am I wrong ? We could use that. — Gawil
– Gawil, Commented Aug 9, 2017 at 13:11
Thanks @Gawil - correct, however the style string delimiter is also ;. I'm pretty familiar with doing basic regex, but I'm sure how to define a sort of sub-pattern that ignores entities and handles them as style-value content. — Memetican
– Memetican, Commented Aug 9, 2017 at 13:51

Gawil · Accepted Answer · 2017-08-09 13:34:16Z

1

I updated your regex, using balancing groups to skip ; when it is preceded by &.

Here is the regex :
(?<attr>[^:\s]*)\s*:\s*(?<val>(?:[^;&]*(?<html>&)?[^;&]*(?(html);(?<-html>)))+)(?:;|$)

Demo here

Note : I have mostly replaced [^;]* by (?:[^;&]*(?<html>&)?[^;&]*(?(html);(?<-html>)))+ in the groupe val from your regex.

edited Aug 9, 2017 at 13:34

answered Aug 9, 2017 at 13:22

Gawil

1,2218 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Memetican Over a year ago

Exactly what I needed- I don't think I have seen a balancing group construction before. Thanks for the tip!

Gawil Over a year ago

@Memetican A pleasure ! Yeah balancing group is a feature specific to .NET flavour. It can be seen like a stack. Read this if you're interested, it's very helpful : regular-expressions.info/balancing.html

Dominic Mazur · Accepted Answer · 2017-08-09 12:48:33Z

0

http://www.regextester.com https://www.mikesdotnetting.com/article/46/c-regular-expressions-cheat-sheet

These helped me when I was screwing around with regex in school, not near my computer rn so I can't easily write it for ya :/

Hope it helped!

answered Aug 9, 2017 at 12:48

Dominic Mazur

461 silver badge7 bronze badges

1 Comment

Memetican Over a year ago

Thanks Dom, good stuff. I've built a good regex tester that lets me beat up the .NET variant quite well. The part I don't know is how to get it to identify HTML entities, and simply absorb them into the VAL group without getting killed by the semicolon.

Collectives™ on Stack Overflow

How to Parse an HTML STYLE Attribute with Regex?

2 Answers 2

2 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Related