RegEx matching HTML tags and extracting text

Question

I have a string of test like this:

<customtag>hey</customtag>

I want to use a RegEx to modify the text between the "customtag" tags so that it might look like this:

<customtag>hey, this is changed!</customtag>

I know that I can use a MatchEvaluator to modify the text, but I'm unsure of the proper RegEx syntax to use. Any help would be much appreciated.

the best answer to this question to date.

Scott Chamberlain
– Scott Chamberlain

2012-02-18 00:37:06 +00:00
Commented Feb 18, 2012 at 0:37 — Scott Chamberlain
– Scott Chamberlain, Commented Feb 18, 2012 at 0:37

Tjofras · Accepted Answer · 2008-11-18 20:10:19Z

15

I wouldn't use regex either for this, but if you must this expression should work: <customtag>(.+?)</customtag>

answered Nov 18, 2008 at 20:10

Tjofras

2,09612 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Jon Tackabury Over a year ago

Thanks - this worked perfectly. Normally I wouldn't use RegEx to parse HTML like this, but this HTML is from an internal system and is properly formed.

Tom Leys Over a year ago

As a warning to others: it wouldn't work on the properly formed syntax |<customtag><customtag>Some text</customtag>|</customtag> - the area between the pipe symbols is matched, so the second <customtag> would be deleted leaving badly formed XML.

Tjofras Over a year ago

Yea and this is why you should not try to parse xml with regex. You could limit what can go inbetween the tags and just allow letter, numbers and spaces and it would work a little better. But then its restricted to a specific domain, so something like this: <customtag>([a-zA-Z0-9 ])+</customtag>

Bill Karwin Over a year ago

Or just <customtag>([^<]+)</customtag>. But yes HTML is not a regular language, so in the more general case you can't use regular expressions to match it. It's the same problem as using a regexp to match balanced parentheses.

Bill Karwin · Accepted Answer · 2009-10-13 17:42:18Z

I'd chew my own leg off before using a regular expression to parse and alter HTML.

Use XSL or DOM.

Two comments have asked me to clarify. The regular expression substitution works in the specific case in the OP's question, but in general regular expressions are not a good solution. Regular expressions can match regular languages, i.e. a sequence of input which can be accepted by a finite state machine. HTML can contain nested tags to any arbitrary depth, so it's not a regular language.

What does this have to do with the question? Using a regular expression for the OP's question as it is written works, but what if the content between the <customtag> tags contains other tags? What if a literal < character occurs in the text? It has been 11 months since Jon Tackabury asked the question, and I'd guess that in that time, the complexity of his problem may have increased.

Regular expressions are great tools and I do use them all the time. But using them in lieu of a real parser for input that needs one is going to work in only very simple cases. It's practically inevitable that these cases grow beyond what regular expressions can handle. When that happens, you'll be tempted to write a more complex regular expression, but these quickly become very laborious to develop and debug. Be ready to scrap the regular expression solution when the parsing requirements expand.

XSL and DOM are two standard technologies designed to work with XML or XHTML markup. Both technologies know how to parse structured markup files, keep track of nested tags, and allow you to transform tags attributes or content.

Here are a couple of articles on how to use XSL with C#:

Here are a couple of articles on how to use DOM with C#:

Here's a .NET library that assists DOM and XSL operations on HTML:

http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack

Well, I use them occasionally, on controlled environments, with machine generated code that is known to be consistent, for a quick job...
Then why don't you show us how to do it with XSL or DOM in C#? It's easy to make sweeping statements. Let's see the actual code. Regexes are not suitable for parsing general HTML, but they're perfectly suitable for doing specific things with specific HTML code.
I admit you made me laugh, but let's have an explanation, or a link to a good explanation of why you'd rather chew your leg off. I guess it's really obvious why to some programmers, but maybe not to the novice?

Jan Goyvaerts · Accepted Answer · 2008-11-19 07:29:10Z

1

If there won't be any other tags between the two tags, this regex is a little safer, and more efficient:

<customtag>[^<>]*</customtag>

answered Nov 19, 2008 at 7:29

Jan Goyvaerts

22.1k7 gold badges63 silver badges72 bronze badges

Comments

Jake Drew · Accepted Answer · 2012-02-18 00:15:54Z

0

Most people use HTML Agility Pack for HTML text parsing. However, I find it a little robust and complicated for my own needs. I create a web browser control in memory, load the page, and copy the text from it. (see example below)

You can find 3 simple examples here:

http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/

answered Feb 18, 2012 at 0:15

Jake Drew

11 bronze badge

Comments

Timothy S. Van Haren · Accepted Answer · 2015-04-06 16:47:44Z

0

//This is to replace all HTML Text

var re = new RegExp("<[^>]*>", "g");

var x2 = Content.replace(re,"");

//This is to replace all &nbsp;

var x3 = x2.replace(/\u00a0/g,'');

edited Apr 6, 2015 at 16:47

Timothy S. Van Haren

8,9762 gold badges32 silver badges34 bronze badges

answered May 20, 2010 at 8:31

sajoshi

2,7631 gold badge18 silver badges22 bronze badges

Collectives™ on Stack Overflow

RegEx matching HTML tags and extracting text

5 Answers 5

4 Comments

3 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

3 Comments

Comments

Comments

Comments

Linked

Related