0

What is the proper Regex construction (.NET flavor) to extract the attribute/value pairs from an HTML style string, while ignoring HTML entities?

margin-top:0pt;margin:0;color:#000000;margin-left:0;font-size:26pt;margin-bottom:3pt;line-height:1.15;page-break-after:avoid;font-family:"Arial";orphans:2;widows:2;text-align:left;margin-right:0

Splitting on ; and then on : would be simplest but as HTML Entities contain semicolons, this breaks on some strings. For example, entities can exist in the font-family style attribute.

font-family:"Arial";

The style string is isolated (no style="), and single-line.

Ultimately I'll be regex-grouping them in this arrangement;

match:( 
    group:( style-attribute-name ) 
    group:( style-attribute-value ) 
    )

Iterating through the groups to create a dictionary (with duplicate keys getting replaced).

My current Regex looks like this-

\s*(?<attr>[^:\s]*)\s*:\s*(?<val>[^;]*)[;]\s*

And results in mis-matches when it hits the HTML entities.

enter image description here

3
  • 1
    @ThomasMoors He does not want to parse HTML here... Just a list of attributes. Don't link this comment every single time "HTML" and "regex" are in the same sentence. Commented Aug 9, 2017 at 13:05
  • As far as I know, all HTML entities begin with & and end with ;, am I wrong ? We could use that. Commented Aug 9, 2017 at 13:11
  • Thanks @Gawil - correct, however the style string delimiter is also ;. I'm pretty familiar with doing basic regex, but I'm sure how to define a sort of sub-pattern that ignores entities and handles them as style-value content. Commented Aug 9, 2017 at 13:51

2 Answers 2

1

I updated your regex, using balancing groups to skip ; when it is preceded by &.

Here is the regex :
(?<attr>[^:\s]*)\s*:\s*(?<val>(?:[^;&]*(?<html>&)?[^;&]*(?(html);(?<-html>)))+)(?:;|$)

Demo here

Note : I have mostly replaced [^;]* by (?:[^;&]*(?<html>&)?[^;&]*(?(html);(?<-html>)))+ in the groupe val from your regex.

Sign up to request clarification or add additional context in comments.

2 Comments

Exactly what I needed- I don't think I have seen a balancing group construction before. Thanks for the tip!
@Memetican A pleasure ! Yeah balancing group is a feature specific to .NET flavour. It can be seen like a stack. Read this if you're interested, it's very helpful : regular-expressions.info/balancing.html
0

http://www.regextester.com https://www.mikesdotnetting.com/article/46/c-regular-expressions-cheat-sheet

These helped me when I was screwing around with regex in school, not near my computer rn so I can't easily write it for ya :/

Hope it helped!

1 Comment

Thanks Dom, good stuff. I've built a good regex tester that lets me beat up the .NET variant quite well. The part I don't know is how to get it to identify HTML entities, and simply absorb them into the VAL group without getting killed by the semicolon.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.