1

In Wordpress generated pages, there is the following meta tag:

<meta name="generator" content="WordPress 3.4.2" />

I'm looking for a way to easily extract, "3.4.2" (in the above example)

Would using XmlDocument or Regular Expression be faster?

I found JSoup, but that's overkill for what I'm trying to do.

EDIT

Just to clarify - I don't want to include any external libraries.
Also, this is running in a class library, so using powershell isn't going to be an option either.

4 Answers 4

3

As you're not trying to match paired tags or anything, a regular expression should be fine. Just search for content="WordPress (\d\.\d\.\d) or similar. (If it's really consistent, you could search for the whole meta tag.)

Trying to parse an HTML page as an XmlDocument might not work out; not all valid (or browser-supported) HTML is valid XML.

Sign up to request clarification or add additional context in comments.

2 Comments

yeah, this seems to work... is there a way of selecting just the value of (\d\.\d\.\d)
Round brackets in a regular expression create a capture group. If you get the Match instance returned by yourRegex.Match, the value of that capture group should be in match.Groups[1].Value.
1

Make use of HTML Agility Pack to parse the HTML

enter image description here

EDIT (code to copy)

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace HTMLAgilityExample
{
    class Program
    {
        static void Main(string[] args)
        {
            string contentValue;

            HtmlDocument document = new HtmlDocument();
            document.Load("C:/test.html");
            foreach(HtmlNode link in document.DocumentNode.SelectNodes("//meta[@content]"))
            {
                HtmlAttribute attribute = link.Attributes["content"];
                if(attribute.Value.Contains("WordPress"))
                {
                    contentValue = attribute.Value.Replace("WordPress", "").Trim();
                }
            }
        }
    }
}

2 Comments

please post you answers as text so that people can easily copy/paste and test
@alexjamesbrown If you don't want to use a external library that's fine (I don't know the limitations you have in adding this). In this case Regex is the way to go. However you need to be very precise on what you want to search. for example if content="WordPress 3.4.2" is part of body text still the regex will match it.
1

I guess that since you have to parse the version out of the attribute value anyway, and since it sounds like you're not looking to do any extensive HTML parsing beyond this task, I'd suggest a regular expression.

This should give you a start. The expression can be simplified a bit; maybe it is unnecessary to specify that the attribute value is within a meta tag. Or it can be tightened up a bit; maybe it would be better to specify the "content" attribute. Either way, this worked in my quick testing.

Note that for better readability, I like to leave whitespace within the regular expression and include the IgnorePatternWhitespace option.

var html = ""; // Populate the html string here

var options = RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace;
var regx = new Regex( "<meta\\s+? .*? WordPress\\s*? (?<version> [\\d\\.]+) [^\\d\\.] .*? />", options );

var match = regx.Match( html );

if ( match.Success ) {
    var version = match.Groups["version"].Value;
}

Comments

0

You could use powershell:

PS> [xml]$xml = '<meta name="generator" content="WordPress 3.4.2" />'
PS> ($xml.meta.content) -match "[\d\.]+"
True
PS> $matches[0]
3.4.2

2 Comments

is there a way to do this just using regex?
@alexjamesbrown Yes - you could just assign the string to a variable, then call match on the variable.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.