The stripping extensions

Question

In order to make my life easier with parsing some VBA code, I have written a few extension methods to extend the string type, and put them in this StringExtensions class:

[ComVisible(false)]
public static class StringExtensions
{
    public static readonly char StringDelimiter = '"';
    public static readonly char CommentMarker = '\'';

    /// <summary>
    /// Strips any trailing comment from specified line of code.
    /// </summary>
    /// <param name="line"></param>
    /// <returns>Returns a new string, without the trailing comment.</returns>
    public static string StripTrailingComment(this string line)
    {
        int index;
        if (line.HasComment(out index))
        {
            return line.Substring(0, index).TrimEnd();
        }

        return line;
    }

    /// <summary>
    /// Returns a value indicating whether line of code is/contains a comment.
    /// </summary>
    /// <param name="line"></param>
    /// <param name="index">Returns the start index of the comment string, including the comment marker.</param>
    /// <returns></returns>
    public static bool HasComment(this string line, out int index)
    {
        index = -1;
        var instruction = line.StripStringLiterals();

        for (int cursor = 0; cursor < instruction.Length - 1; cursor++)
        {
            if (instruction[cursor] == CommentMarker)
            {
                index = cursor;
                return true;
            }
        }

        return false;
    }

    /// <summary>
    /// Strips all string literals from a line of code or instruction.
    /// Replaces string literals with whitespace characters, to maintain original length.
    /// </summary>
    /// <param name="line"></param>
    /// <returns>Returns a new string, stripped of all string literals and string delimiters.</returns>
    public static string StripStringLiterals(this string line)
    {
        var builder = new StringBuilder(line.Length);
        var isInsideString = false;
        for (int cursor = 0; cursor < line.Length; cursor++)
        {
            if (line[cursor] == StringDelimiter)
            {
                if (isInsideString)
                {
                    isInsideString = cursor + 1 == line.Length || line[cursor + 1] == StringDelimiter || cursor > 0 && (line[cursor - 1] == StringDelimiter);
                }
                else
                {
                    isInsideString = true;
                }
            }

            if (!isInsideString && line[cursor] != StringDelimiter)
            {
                builder.Append(line[cursor]);
            }
            else
            {
                builder.Append(' ');
            }
        }

        return builder.ToString();
    }
}

This works as intended, at least per simple unit tests I've written for it:

[TestMethod]
public void StripsStringLiteral()
{
    var value = "\"Hello, World!\"";
    var instruction = "Debug.Print " + value;

    var result = instruction.StripStringLiterals();

    var replacement = new string(' ', value.Length);
    Assert.AreEqual("Debug.Print " + replacement, result);
}

[TestMethod]
public void StripsAllStringLiterals()
{
    var value = "\"Hello, World!\"";
    var instruction = "Debug.Print " + value + " & " + value;

    var result = instruction.StripStringLiterals();

    var replacement = new string(' ', value.Length);
    Assert.AreEqual("Debug.Print " + replacement + " & " + replacement, result);
}

[TestMethod]
public void IsComment()
{
    var instruction = "'Debug.Print mwahaha this is just a comment.";

    int index;
    var result = instruction.HasComment(out index);

    Assert.IsTrue(result);
    Assert.AreEqual(index, 0);
}

[TestMethod]
public void HasComment()
{
    var comment = "'but this is one.";
    var instruction = "Debug.Print \"'this isn't a comment\" " + comment;

    int index;
    var result = instruction.HasComment(out index);

    Assert.IsTrue(result);
    Assert.AreEqual(comment, instruction.Substring(index));
}

There has to be a better way to implement this code - and perhaps even to test it.

Anything weird in sight?

svick · Accepted Answer · 2014-11-12 14:13:51Z

7

Others mentioned regular expressions, I think using them can make StripStringLiterals much simpler and easier to understand:

public static string StripStringLiterals(this string line)
{
    return Regex.Replace(line, "\"[^\"]*\"", match => new string(' ', match.Length));
}

If you wanted, you could replace the whole StripTrailingComment with a single regex, but I think this reaches the point where regexes become unreadable:

public static string StripTrailingComment(this string line)
{
    return Regex.Replace(line, "^([^\"]*(\"[^\"]*\"[^\"]*?)*) *'.*$", "$1");
}

edited Nov 12, 2014 at 14:13

answered Nov 12, 2014 at 12:17

svick

24.5k4 gold badges53 silver badges89 bronze badges

Add a comment |

svick · Accepted Answer · 2014-11-12 14:13:15Z

In HasComment, I find it odd that you have to iterate over the characters. Isn't there something like indexOf that returns the index or -1 otherwise? So that the loop could be replaced with this:

index = instruction.IndexOf(CommentMarker);
return index > -1;

A test for StripTrailingComment would be good too:

[TestMethod]
public void StripTrailingComment()
{
    var comment = "       'but this is one.";
    var instruction = "Debug.Print \"'this isn't a comment\"";
    var instructionWithComment = instruction + comment;

    var result = instructionWithComment.StripTrailingComment();

    Assert.AreEqual(instruction, result);
}

I'm guessing the part you dislike the most is the implementation of StripStringLiterals. I'd look for a clever regex for this, if that's even possible.

Short of that, it seems that these two if can be simplified by joining with else if:

if (line[cursor] == StringDelimiter)
{
    // ...
}

if (!isInsideString && line[cursor] != StringDelimiter)
{
    // ...
}

to get:

if (line[cursor] == StringDelimiter)
{
    // ...
}
else if (!isInsideString)
{
    // ...
}

And this line is too long to read:

isInsideString = cursor + 1 == line.Length || line[cursor + 1] == StringDelimiter || cursor > 0 && (line[cursor - 1] == StringDelimiter);

I find it odd that you have to iterate over the characters. - indeed, I don't have to anymore; that loop is a relic from before I extracted StripStringLiterals from HasComment! — Mathieu Guindon
– Mathieu Guindon, Commented Nov 12, 2014 at 11:37

Heslacher · Accepted Answer · 2014-11-12 13:18:58Z

HasComment()

This method is named poorly. A HasComment() method implies only to return if a String has/contains a comment.

This method should be named following the Try pattern. So TryGetCommentIndex() would better reflect what the method does.

Why don't you use the built in methods ? The method can be simplified to

public static bool TryGetCommentIndex(this string line, out int index)
{
    index = line.StripStringLiterals().IndexOf(CommentMarker);

    return (index != -1);
}

StripStringLiterals()

The assignment of the isInsideString can be simplified and broken into multiple lines like

isInsideString = !isInsideString
                        || cursor + 1 == line.Length
                        || line[cursor + 1] == StringDelimiter
                        || cursor > 0 && (line[cursor - 1] == StringDelimiter);

together with using an else if with a continue this method can be refactored to

public static string StripStringLiterals(this string line)
{
    var builder = new StringBuilder(line.Length);
    var isInsideString = false;
    for (int cursor = 0; cursor < line.Length; cursor++)
    {
        if (line[cursor] == StringDelimiter)
        {
            isInsideString = !isInsideString
                                    || cursor + 1 == line.Length
                                    || line[cursor + 1] == StringDelimiter
                                    || cursor > 0 && (line[cursor - 1] == StringDelimiter);
        }
        else if (!isInsideString)
        {
            builder.Append(line[cursor]);
            continue;
        }

        builder.Append(' ');
    }

    return builder.ToString();
}

But wait, we can do better. Let us check what values will be passed to TryGetCommentIndex() method.

Possibilities

' is missing -> no comment
' with trailing " -> comment
' without trailing " -> comment
' with a leading " -> maybe a comment after the trailing "

Let us express these possibilities in code

line.IndexOf(CommentMarker) == -1
-1 < line.IndexOf(CommentMarker) < line.IndexOf(StringDelimiter)
(line.IndexOf(CommentMarker) == 0) || (-1 < line.IndexOf(CommentMarker) && line.IndexOf(StringDelimiter) == -1)
-1 < line.IndexOf(StringDelimiter) < line.IndexOf(CommentMarker)

This leads to

public static bool TryGetCommentIndex(this string line, out int index)
{
    index = line.IndexOf(CommentMarker);

    if (index == -1) { return false; } // possibility 1
    if (index == 0) { return true; }   // possibility 3.1

    int quoteIndex = line.IndexOf(StringDelimiter);

    if (quoteIndex > -1 && index > quoteIndex)
    {
        // possibility 4
    }

    // possibility 2 + 3.2
    return true;
}

So we only need to implement possibility 4.

What we know is that after a leading " there needs to be a trailing " because it is either a code line like Debug Print "Something I don't know" or it is a code line followed by a comment.

So we need to get the index of the next " which will be > -1 because otherwise it would be code which wouldn't compile ;-)

quoteIndex = line.IndexOf(StringDelimiter, quoteIndex + 1);

These leads to possibilities

quoteIndex > index
quoteIndex < index

Both possibilites can be a comment or not. We add 1 to quoteIndex and take the input string from there and call the method recursive.

which leads to

quoteIndex = line.IndexOf(StringDelimiter, quoteIndex + 1) + 1;

Boolean success = line.Substring(quoteIndex).TryGetCommentIndex(out index);

if the recursive call returns true, we need to adjust the index to reflect the real position.

index = success ? index + quoteIndex : -1;

And altogether

public static bool TryGetCommentIndex(this string line, out int index)
{
    index = line.IndexOf(CommentMarker);

    if (index == -1) { return false; }
    if (index == 0) { return true; }

    int quoteIndex = line.IndexOf(StringDelimiter);

    if (quoteIndex > -1 && index > quoteIndex)
    {
        quoteIndex = line.IndexOf(StringDelimiter, quoteIndex + 1) + 1;

        Boolean success = line.Substring(quoteIndex).TryGetCommentIndex(out index);

        index = success ? index + quoteIndex : -1;

        return success;
    }

    return true;
}

Now we can skip the StripStringLiterals() method.

CommentMarker

This name is not specific enough. You should name it maybe CommentMarkerVB.

I didn't mention, but this whole project is a COM-visible add-in specifically for the VBA IDE - CommentMarkerVB sounds like overkill/redundant. — Mathieu Guindon
– Mathieu Guindon, Commented Nov 12, 2014 at 11:40
Now we can skip the StripStringLiterals() method ...except I need it elsewhere ;) — Mathieu Guindon
– Mathieu Guindon, Commented Nov 12, 2014 at 12:36
im not sure this logic is completely sound. VB uses double quotes to escape double quotes, so wouldn't ""'Not a comment""" return a false positive with this method?? I'm not sure. — RubberDuck
– RubberDuck, Commented Nov 12, 2014 at 12:54
Sorry, I missed a quote.. I meant this. """ ' Not a comment """ — RubberDuck
– RubberDuck, Commented Nov 12, 2014 at 13:03

Stack Exchange Network

The stripping extensions

3 Answers 3

You must log in to answer this question.

Hot Network Questions

The stripping extensions

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions