0

I need to parse a html file and extract the NeedThis* strings with C#/.net, sample code is:

<tr class="class">
    <td style="width: 120px">
        <a href="NeedThis1">NeedThis2</a>
    </td>
    <td style="width: 120px">
        <a href="NeedThis3">
            NeedThis4</a>
    </td>
    <td style="width: 30%">
        NeedThis5
    </td>
    <td>
        NeedThis6
    </td>
    <td style="width: 120px">
        NeedThis7
    </td>
</tr>

I know a html parser should be better here, but all I need is to extract these texts, this is just for a temp helper tool...

anyone can help me with this?

thanks!

6
  • 4
    I would like to cite the first answer of this question : stackoverflow.com/questions/1732348/… Commented Oct 3, 2010 at 4:12
  • 1
    I already seen that, I just don't want to use IndexOf... as I said, this is for a temp helper tool, not a final product... I need to extract these strings from about 50k files, which is stored in my local HDD and insert in a database, once done, ctrl + delete the tool =) Commented Oct 3, 2010 at 4:14
  • 1
    @Soravux: We think alike -- I was about to do the same, then I saw your comment :-) Commented Oct 3, 2010 at 4:14
  • @Hans W I am not sure how to use RegEx in C#, but the expression is pretty simple (NeedThis\\*) You can generate your own RegEx expressions and test them here: gskinner.com/RegExr Commented Oct 3, 2010 at 4:15
  • thanks! but needthis can be any arbitrary string, maybe I should have explained it better Commented Oct 3, 2010 at 4:16

4 Answers 4

2

If you are sure that you html is valid you could use Linq to Xml else you are better of using a parser like HTML Agility Pack

Sign up to request clarification or add additional context in comments.

Comments

2

It doesn't matter whether you're doing this for a one-off or for a "finished project". Your task isn't text extraction and it's not something that a regex can do effectively. The data you're looking for depends on the structure of the HTML. Your task is parsing HTML. When your task is parsing HTML, use an HTML parser. It's not difficult. In fact it's a lot easier than writing the pile of regexes you would need otherwise.

Comments

0

You seem to have answered your own question. You should use a parser. But if you don't you can use the RE NeedThis.*

Of course, if you want any context with those strings, you should just use a parser.

2 Comments

In that case, USER A PARSER
@Hans W Glad to see you proving that programmers are still as naturally resistant to good ideas as ever.
0

Hans, as you can see by the other answers using a RegEx is probably not the best way to do what you want to do, but since I need to practice my RegEx anyways I went ahead and made one just in case you wanted to experiment. This will only catch NeedThis2, but it should give you an idea of how you would make your own RegEx when it is an appropriate solution.

<a href="NeedThis1">NeedThis2</a>

RegEx to catch NeedThis2:

(?:<a[^<a]+?>)(\S)*(?:<[^<]+?a>)

Pretty nasty huh? :)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.