Regex to find content in html tags

Question

I need to parse a html file and extract the NeedThis* strings with C#/.net, sample code is:

<tr class="class">
    <td style="width: 120px">
        <a href="NeedThis1">NeedThis2</a>
    </td>
    <td style="width: 120px">
        <a href="NeedThis3">
            NeedThis4</a>
    </td>
    <td style="width: 30%">
        NeedThis5
    </td>
    <td>
        NeedThis6
    </td>
    <td style="width: 120px">
        NeedThis7
    </td>
</tr>

I know a html parser should be better here, but all I need is to extract these texts, this is just for a temp helper tool...

anyone can help me with this?

thanks!

I would like to cite the first answer of this question : stackoverflow.com/questions/1732348/… — Soravux
– Soravux, Commented Oct 3, 2010 at 4:12
I already seen that, I just don't want to use IndexOf... as I said, this is for a temp helper tool, not a final product... I need to extract these strings from about 50k files, which is stored in my local HDD and insert in a database, once done, ctrl + delete the tool =) — Hans W
– Hans W, Commented Oct 3, 2010 at 4:14
@Soravux: We think alike -- I was about to do the same, then I saw your comment :-) — Cameron
– Cameron, Commented Oct 3, 2010 at 4:14
@Hans W I am not sure how to use RegEx in C#, but the expression is pretty simple (NeedThis\\*) You can generate your own RegEx expressions and test them here: gskinner.com/RegExr — ubiquibacon
– ubiquibacon, Commented Oct 3, 2010 at 4:15
thanks! but needthis can be any arbitrary string, maybe I should have explained it better — Hans W
– Hans W, Commented Oct 3, 2010 at 4:16

Vinay B R · Accepted Answer · 2010-10-03 04:19:56Z

2

If you are sure that you html is valid you could use Linq to Xml else you are better of using a parser like HTML Agility Pack

answered Oct 3, 2010 at 4:19

Vinay B R

8,4913 gold badges33 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

hobbs · Accepted Answer · 2010-10-03 04:20:38Z

2

It doesn't matter whether you're doing this for a one-off or for a "finished project". Your task isn't text extraction and it's not something that a regex can do effectively. The data you're looking for depends on the structure of the HTML. Your task is parsing HTML. When your task is parsing HTML, use an HTML parser. It's not difficult. In fact it's a lot easier than writing the pile of regexes you would need otherwise.

answered Oct 3, 2010 at 4:20

hobbs

244k20 gold badges225 silver badges304 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:04:10Z

0

You seem to have answered your own question. You should use a parser. But if you don't you can use the RE NeedThis.*

Of course, if you want any context with those strings, you should just use a parser.

edited May 23, 2017 at 12:04

CommunityBot

11 silver badge

answered Oct 3, 2010 at 4:12

JoshD

12.9k3 gold badges46 silver badges54 bronze badges

2 Comments

JoshD Over a year ago

In that case, USER A PARSER

jball Over a year ago

@Hans W Glad to see you proving that programmers are still as naturally resistant to good ideas as ever.

ubiquibacon · Accepted Answer · 2010-10-03 04:45:31Z

Hans, as you can see by the other answers using a RegEx is probably not the best way to do what you want to do, but since I need to practice my RegEx anyways I went ahead and made one just in case you wanted to experiment. This will only catch NeedThis2, but it should give you an idea of how you would make your own RegEx when it is an appropriate solution.

<a href="NeedThis1">NeedThis2</a>

RegEx to catch NeedThis2:

(?:<a[^<a]+?>)(\S)*(?:<[^<]+?a>)

Pretty nasty huh? :)

Collectives™ on Stack Overflow

Regex to find content in html tags

4 Answers 4

Comments

Comments

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

2 Comments

Comments

Related