Regex expression C# for HTML

Question

I have following regex:

^(<span style=.*?font-weight:bold.*?>.*?</span>)

It matches the following code:

<span style="font-family:Arial; font-size:10pt"> r.</span></p><p style="margin:0pt"><span style="font-family:Arial; font-size:10pt; font-weight:bold">&#xa0;</span>

But I would like to match only this part (last span containing font-weight:bold style)

<span style="font-family:Arial; font-size:10pt; font-weight:bold">&#xa0;</span>

You can't parse XHTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML... — dav_i
– dav_i, Commented Jul 30, 2013 at 13:54
Do not try to parse HTML with regular expressions. Go get the Html Agility Pack. — Jim Mischel
– Jim Mischel, Commented Jul 30, 2013 at 13:55
Guys! Kamil didn't ask whether parsing HTML using Regex is a good idea. He asked a nice and specific question about how to have his regex match a different part of the provided string. The fact that his string happens to look like HTML is completely irrelevant for this question. No need for the HTML-Regex-kneejerk-reflex... — Mels
– Mels, Commented Jul 30, 2013 at 13:57
@Mels - No, Kamil is about to shoot himself in the foot and various other body parts. We cannot, through inaction, allow a human being to come to harm. — Corak
– Corak, Commented Jul 30, 2013 at 14:07
@Mels The fact that his string happens to look like HTML is completely relevant as it shines light on the classic XY problem happening here. The OP is asking how to make his "solution" work, when he's clearly using the wrong tools for the job. When he comes back an hour later with another question about matching something else, it'll only add to the pollution on SO. — Dan Lugg
– Dan Lugg, Commented Jul 30, 2013 at 14:11

carla · Accepted Answer · 2017-11-24 16:51:36Z

7

Use HTML Agility Pack to parse html:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

var boldSpans = from s in doc.DocumentNode.SelectNodes("//span")
                let style = s.Attributes["style"].Value
                where style.Contains("font-weight:bold")
                select s;

Or even better xpath, which selects all nodes in one line:

doc.DocumentNode.SelectNodes("//span[contains(@style, 'font-weight:bold')]")

edited Nov 24, 2017 at 16:51

carla

2,1471 gold badge34 silver badges48 bronze badges

answered Jul 30, 2013 at 13:59

Sergey Berezovskiy

237k44 gold badges441 silver badges468 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

dav_i Over a year ago

I actually prefer the first - it's easier to read in my opinion.

Sergey Berezovskiy Over a year ago

@dav_i that's why I leaved both options :)

Kamil Over a year ago

Thanks!! I have HTML generated by external library so I assumed that the structure (way of creation) of HTML will be constans. Anyway HTML Agility Pack is better option :)

Robert Fricke · Accepted Answer · 2013-07-30 14:09:41Z

Don't use ^ since the line doesn't start with the span you want to match.

<span style=["'][^'"]*font-weight:bold[^'"]*['"]>[^<]*</span>

Or as escaped string:

"<span style=[\"'][^'\"]*font-weight:bold[^'\"]*['\"]>[^<]*</span>"

This matches strings starting with <span style= followed by single or double quote ', ". Then [^'"]* allows all characters except ending quotes.

Match string font-weight:bold, followed again by any amount of characters except ending qoutes leading up to the real ending qoutes and ending tag: [^'"]*['"]>.

(Note that you might or might not want to allow more attributes before and after the style attribute. In that case you need to alter the regex)

span may contain any amount of any characters except start tag <, then string has to end with closing </span> tag.

Daniel van Dommele · Accepted Answer · 2013-07-30 13:54:56Z

0

remove the ^, because it means beginning of the line. Therefore it will always get the first span. More so because .* means (any characters at all).

doing this the first match may stil be the output you have now, but the second match should be what you're after.

Furthermore tools like regexbuddy and such are good for testing Regex's.

answered Jul 30, 2013 at 13:54

Daniel van Dommele

5502 silver badges14 bronze badges

Collectives™ on Stack Overflow

Regex expression C# for HTML

3 Answers 3

3 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Linked

Related