Extracting value of html element with C#

Question

In Wordpress generated pages, there is the following meta tag:

<meta name="generator" content="WordPress 3.4.2" />

I'm looking for a way to easily extract, "3.4.2" (in the above example)

Would using XmlDocument or Regular Expression be faster?

I found JSoup, but that's overkill for what I'm trying to do.

EDIT

Just to clarify - I don't want to include any external libraries.
Also, this is running in a class library, so using powershell isn't going to be an option either.

Rawling · Accepted Answer · 2012-10-19 23:44:08Z

3

As you're not trying to match paired tags or anything, a regular expression should be fine. Just search for content="WordPress (\d\.\d\.\d) or similar. (If it's really consistent, you could search for the whole meta tag.)

Trying to parse an HTML page as an XmlDocument might not work out; not all valid (or browser-supported) HTML is valid XML.

answered Oct 19, 2012 at 23:44

Rawling

50.3k7 gold badges93 silver badges131 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alex Over a year ago

yeah, this seems to work... is there a way of selecting just the value of (\d\.\d\.\d)

Rawling Over a year ago

Round brackets in a regular expression create a capture group. If you get the Match instance returned by yourRegex.Match, the value of that capture group should be in match.Groups[1].Value.

Prashanth Thurairatnam · Accepted Answer · 2012-10-20 08:56:05Z

Make use of HTML Agility Pack to parse the HTML

enter image description here

EDIT (code to copy)

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace HTMLAgilityExample
{
    class Program
    {
        static void Main(string[] args)
        {
            string contentValue;

            HtmlDocument document = new HtmlDocument();
            document.Load("C:/test.html");
            foreach(HtmlNode link in document.DocumentNode.SelectNodes("//meta[@content]"))
            {
                HtmlAttribute attribute = link.Attributes["content"];
                if(attribute.Value.Contains("WordPress"))
                {
                    contentValue = attribute.Value.Replace("WordPress", "").Trim();
                }
            }
        }
    }
}

please post you answers as text so that people can easily copy/paste and test
@alexjamesbrown If you don't want to use a external library that's fine (I don't know the limitations you have in adding this). In this case Regex is the way to go. However you need to be very precise on what you want to search. for example if content="WordPress 3.4.2" is part of body text still the regex will match it.

Darryl · Accepted Answer · 2012-10-20 15:55:15Z

I guess that since you have to parse the version out of the attribute value anyway, and since it sounds like you're not looking to do any extensive HTML parsing beyond this task, I'd suggest a regular expression.

This should give you a start. The expression can be simplified a bit; maybe it is unnecessary to specify that the attribute value is within a meta tag. Or it can be tightened up a bit; maybe it would be better to specify the "content" attribute. Either way, this worked in my quick testing.

Note that for better readability, I like to leave whitespace within the regular expression and include the IgnorePatternWhitespace option.

var html = ""; // Populate the html string here

var options = RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace;
var regx = new Regex( "<meta\\s+? .*? WordPress\\s*? (?<version> [\\d\\.]+) [^\\d\\.] .*? />", options );

var match = regx.Match( html );

if ( match.Success ) {
    var version = match.Groups["version"].Value;
}

David · Accepted Answer · 2012-10-19 23:59:07Z

0

You could use powershell:

PS> [xml]$xml = '<meta name="generator" content="WordPress 3.4.2" />'
PS> ($xml.meta.content) -match "[\d\.]+"
True
PS> $matches[0]
3.4.2

answered Oct 19, 2012 at 23:59

David

6,5912 gold badges27 silver badges22 bronze badges

2 Comments

Alex Over a year ago

is there a way to do this just using regex?

David Over a year ago

@alexjamesbrown Yes - you could just assign the string to a variable, then call match on the variable.

Collectives™ on Stack Overflow

Extracting value of html element with C#

4 Answers 4

2 Comments

2 Comments

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

Comments

2 Comments

Related