Simple tokenizer v2 - reading all matching chars at once

Question

I have rewritten my tokenizer according to most of the suggestions from the previous question here.

API

It now reads all chars as long as they match the pattern. I use three types of attributes to achieve this.

Regex - reads by regular expressions; this one requires a single group that is the value of the token; it can match more but only the value of Groups[1] is used as a result
Const - reads a constant pattern where the entire length must match
QText - reads quoted text or falls back to regex. I chose not to use regex for quoted strings because this is pretty damn tricky.

They return a tuple where:

Success - indicates whther a pattern was matched
Token - the actual value of the token
Length - the total length of the match; I use this to advance the index to the next token

These are the tree attributes:

public delegate (bool Success, string Token, int Length) MatchDelegate(string value, int offset);

public abstract class MatcherAttribute : Attribute
{
    public abstract (bool Success, string Token, int Length) Match(string value, int offset);
}

public class RegexAttribute : MatcherAttribute
{
    private readonly Regex _regex;

    public RegexAttribute([RegexPattern] string pattern)
    {
        _regex = new Regex(pattern);
    }

    public override (bool Success, string Token, int Length) Match(string value, int offset)
    {
        var match = _regex.Match(value, offset);
        // Make sure the match was at the offset.
        return (match.Success && match.Index == offset, match.Groups[1].Value, match.Length);
    }
}

public class ConstAttribute : MatcherAttribute
{
    private readonly string _pattern;

    public ConstAttribute(string pattern) => _pattern = pattern;

    public override (bool Success, string Token, int Length) Match(string value, int offset)
    {
        var matchCount = _pattern.TakeWhile((t, i) => value[offset + i].Equals(t)).Count();
        // All characters have to be matched.
        return (matchCount == _pattern.Length, _pattern, matchCount);
    }
}

// "foo \"bar\" baz"
// ^ starts here   ^ ends here
public class QTextAttribute : RegexAttribute
{
    public static readonly IImmutableSet<char> Escapables = new[] { '\\', '"' }.ToImmutableHashSet();

    public QTextAttribute([RegexPattern] string pattern) : base(pattern) { }

    public override (bool Success, string Token, int Length) Match(string value, int offset)
    {
        return
            value[offset] == '"'
                ? MatchQuoted(value, offset)
                : base.Match(value, offset);
    }

    private (bool Success, string Token, int Length) MatchQuoted(string value, int offset)
    {
        var token = new StringBuilder();
        var escapeSequence = false;
        var quote = false;

        for (var i = offset; i < value.Length; i++)
        {
            var c = value[i];

            switch (c)
            {
                case '"' when !escapeSequence:

                    switch (i == offset)
                    {
                        // Entering quoted text.
                        case true:
                            quote = !quote;
                            continue; // Don't eat quotes.

                        // End of quoted text.
                        case false:
                            return (true, token.ToString(), i - offset + 1);
                    }

                    break; // Makes the compiler happy.

                case '\\' when !escapeSequence:
                    escapeSequence = true;
                    break;

                default:

                    switch (escapeSequence)
                    {
                        case true:
                            switch (Escapables.Contains(c))
                            {
                                case true:
                                    // Remove escape char.
                                    token.Length--;
                                    break;
                            }

                            escapeSequence = false;
                            break;
                    }

                    break;
            }

            token.Append(c);
        }

        return (false, token.ToString(), 0);
    }
}

The tokenizer is now an instantiable class with an interface. It can be used raw or be derived to create a specific tokenizer. When created, it turns state transitions into a dictionary. This is what the StateTransitionMapper is for. The tokenizer picks the first non-empty token. I guess I probably should use the longest one - as this is what different websites suggest - so I might change this later. What do you think? Would that be better?

It starts with the default state which is by convention 0 becuase TToken is constrained to be Enum and its default value is 0. I named this dummy state simply Start.

public static class StateTransitionMapper
{
    public static IImmutableDictionary<TToken, IImmutableList<State<TToken>>> CreateTransitionMap<TToken>(IImmutableList<State<TToken>> states) where TToken : Enum
    {
        return states.Aggregate(ImmutableDictionary<TToken, IImmutableList<State<TToken>>>.Empty, (mappings, state) =>
        {
            var nextStates =
                from n in state.Next
                join s in states on n equals s.Token
                select s;

            return mappings.Add(state.Token, nextStates.ToImmutableList());
        });
    }
}

public interface ITokenizer<TToken> where TToken : Enum
{
    IEnumerable<Token<TToken>> Tokenize(string value);
}

public class Tokenizer<TToken> : ITokenizer<TToken> where TToken : Enum
{
    private readonly IImmutableDictionary<TToken, IImmutableList<State<TToken>>> _transitions;

    public Tokenizer(IImmutableList<State<TToken>> states)
    {
        _transitions = StateTransitionMapper.CreateTransitionMap(states);
    }

    public IEnumerable<Token<TToken>> Tokenize(string value)
    {
        var current = _transitions[default];

        for (var i = 0; i < value.Length;)
        {
            var matches =
                from state in current
                let token = state.Consume(value, i)
                // Consider only non-empty tokens.
                where token.Length > 0
                select (state, token);

            if (matches.FirstOrDefault() is var match && match.token is null)
            {
                throw new ArgumentException($"Invalid character '{value[i]}' at {i}.");
            }
            else
            {
                if (match.state.IsToken)
                {
                    yield return match.token;
                }

                i += match.token.Length;
                current = _transitions[match.state.Token];
            }
        }
    }
}

The tokenizer is supported by the State and Token classes where the State now reads all matching chars and caches the MatchDelegate it gets from the MatcherAttribute. IsToken property is used to ignore tokens that aren't actually real or usable tokens. I use this with the CommandLineTokenizer.

public class State<TToken> where TToken : Enum
{
    private readonly MatchDelegate _match;

    public State(TToken token, params TToken[] next)
    {
        Token = token;
        Next = next;
        _match =
            typeof(TToken)
                .GetField(token.ToString())
                .GetCustomAttribute<MatcherAttribute>() is MatcherAttribute matcher
                ? (MatchDelegate)(matcher.Match)
                : (MatchDelegate)((value, offset) => (false, string.Empty, 0));
    }

    public bool IsToken { get; set; } = true;

    public TToken Token { get; }

    public IEnumerable<TToken> Next { get; }

    public Token<TToken> Consume(string value, int offset)
    {
        return new Token<TToken>(_match(value, offset))
        {
            Type = Token,
            Index = offset
        };
    }

    public override string ToString() => $"{Token} --> [{string.Join(", ", Next)}]";
}

public class Token<TToken> where TToken : Enum
{
    public Token((bool Success, string Token, int Length) match)
    {
        Length = match.Success ? match.Length : 0;
        Text = match.Success ? match.Token : string.Empty;
    }

    public int Index { get; set; }

    public int Length { get; set; }

    public string Text { get; set; }

    public TToken Type { get; set; }

    public override string ToString() => $"{Index}: {Text} ({Type})";
}

Examples and tests

I tested it with two tokenizers. They are very simple because just derived from the Tokenizer. They define their own state transitions and tokens.

One if for a UriString:

using static UriToken;

public class UriStringParserTest
{
    private static readonly ITokenizer<UriToken> Tokenizer = new UriStringTokenizer();

    [Theory]
    [InlineData(
        "scheme://user@host:123/pa/th?key-1=val-1&key-2=val-2#f",
        "scheme //user host 123/pa/th key-1 val-1 key-2 val-2 f")]
    [InlineData(
        "scheme://user@host:123/pa/th?key-1=val-1&key-2=val-2",
        "scheme //user host 123/pa/th key-1 val-1 key-2 val-2")]
    [InlineData(
        "scheme://user@host:123/pa/th?key-1=val-1",
        "scheme //user host 123/pa/th key-1 val-1")]
    [InlineData(
        "scheme://user@host:123/pa/th",
        "scheme //user host 123/pa/th")]
    [InlineData(
        "scheme:///pa/th",
        "scheme ///pa/th"
    )]
    public void Can_tokenize_URIs(string uri, string expected)
    {
        var tokens = Tokenizer.Tokenize(uri).ToList();
        var actual = string.Join("", tokens.Select(t => t.Text));
        Assert.Equal(expected.Replace(" ", string.Empty), actual);
    }

    [Fact]
    public void Throws_when_invalid_character()
    {
        // Using single letters for faster debugging.
        var uri = "s://:u@h:1/p?k=v&k=v#f";
        //             ^ - invalid character

        var ex = Assert.Throws<ArgumentException>(() => Tokenizer.Tokenize(uri).ToList());
        Assert.Equal("Invalid character ':' at 4.", ex.Message);
    }
}

public class UriStringTokenizer : Tokenizer<UriToken>
{
    /*

     scheme:[//[userinfo@]host[:port]]path[?key=value&key=value][#fragment]
            [ ----- authority ----- ]     [ ----- query ------ ]

     scheme: ------------------------ '/'path -------------------------  --------- UriString
            \                         /      \                         /\         /
             // --------- host ----- /        ?key ------ &key ------ /  #fragment
               \         /    \     /             \      /    \      /
                userinfo@      :port               =value      =value             

    */

    private static readonly State<UriToken>[] States =
    {
        new State<UriToken>(default, Scheme),
        new State<UriToken>(Scheme, AuthorityPrefix, Path),
        new State<UriToken>(AuthorityPrefix, UserInfo, Host, Path),
        new State<UriToken>(UserInfo, Host),
        new State<UriToken>(Host, Port, Path),
        new State<UriToken>(Port, Path),
        new State<UriToken>(Path, Key, Fragment),
        new State<UriToken>(Key, UriToken.Value, Fragment),
        new State<UriToken>(UriToken.Value, Key, Fragment),
        new State<UriToken>(Fragment, Fragment),
    };

    public UriStringTokenizer() : base(States.ToImmutableList()) { }
}

public enum UriToken
{
    Start = 0,

    [Regex(@"([a-z0-9\+\.\-]+):")]
    Scheme,

    [Const("//")]
    AuthorityPrefix,

    [Regex(@"([a-z0-9_][a-z0-9\.\-_:]+)@")]
    UserInfo,

    [Regex(@"([a-z0-9\.\-_]+)")]
    Host,

    [Regex(@":([0-9]*)")]
    Port,

    [Regex(@"(\/?[a-z_][a-z0-9\/:\.\-\%_@]+)")]
    Path,

    [Regex(@"[\?\&\;]([a-z0-9\-]*)")]
    Key,

    [Regex(@"=([a-z0-9\-]*)")]
    Value,

    [Regex(@"#([a-z]*)")]
    Fragment,
}

and the other for a CommandLine:

using static CommandLineToken;

public class CommandLineTokenizerTest
{
    private static readonly ITokenizer<CommandLineToken> Tokenizer = new CommandLineTokenizer();

    [Theory]
    [InlineData(
        "command -argument value -argument",
        "command  argument value argument")]
    [InlineData(
        "command -argument value value",
        "command  argument value value")]
    [InlineData(
        "command -argument:value,value",
        "command  argument value value")]
    [InlineData(
        "command -argument=value",
        "command  argument value")]
    [InlineData(
        @"command -argument=""foo--bar"",value -argument value",
        @"command  argument   foo--bar   value  argument value")]
    [InlineData(
        @"command -argument=""foo--\""bar"",value -argument value",
        @"command  argument   foo-- ""bar   value  argument value")]
    public void Can_tokenize_command_lines(string uri, string expected)
    {
        var tokens = Tokenizer.Tokenize(uri).ToList();
        var actual = string.Join("", tokens.Select(t => t.Text));
        Assert.Equal(expected.Replace(" ", string.Empty), actual);
    }
}

public enum CommandLineToken
{
    Start = 0,

    [Regex(@"\s*(\?|[a-z0-9][a-z0-9\-_]*)")]
    Command,

    [Regex(@"\s*[\-\.\/]([a-z0-9][a-z\-_]*)")]
    Argument,

    [Regex(@"[\=\:\,\s]")]
    ValueBegin,

    [QText(@"([a-z0-9\.\;\-]*)")]
    Value,
}

public class CommandLineTokenizer : Tokenizer<CommandLineToken>
{
    /*

     command [-argument][=value][,value]

     command --------------------------- CommandLine
            \                           /
             -argument ------   ------ /    
                      \      / \      /
                       =value   ,value

    */

    private static readonly State<CommandLineToken>[] States =
    {
        new State<CommandLineToken>(default, Command),
        new State<CommandLineToken>(Command, Argument),
        new State<CommandLineToken>(Argument, Argument, ValueBegin),
        new State<CommandLineToken>(ValueBegin, Value) { IsToken = false },
        new State<CommandLineToken>(Value, Argument, ValueBegin),
    };

    public CommandLineTokenizer() : base(States.ToImmutableList()) { }
}

Questions

Would you say this is an improvement?
Maybe something is still too unconventional? I guess this is probably still not a true state-machine becuase of the loop inside the tokenizer. Am I right?
Did I miss any important suggestion or misinterpreted it?

Impressive refactoring, I like this approach. The loop in tokenizer is fine, the important part is that state decides how much to consume at i. A true state machine is hard to define because many different models exist. Even the simplest yield generator could very well be a proper state machine. — dfhwze
– dfhwze, Commented Aug 26, 2019 at 17:41
Since there is no clear spec for command line, I take it we can do requests :p there is a second quote 'my quoted string that allows "double quotes" as part of the literal'. — dfhwze
– dfhwze, Commented Aug 26, 2019 at 18:06
@dfhwze not in my matrix :-P too many standards is unhealthy ;-] I like it pragmatic and not let's support everything we can only think of, no matter how insane and useless this is alghouth the framework could handle that too. I will try it with json in a couple of days but for this I will need a new matcher for comments. This will then definitely require both types of quotes. — t3chb0t
– t3chb0t, Commented Aug 26, 2019 at 18:07
One more thing about command line: "my value" is a quoted value, but is my" value" also a proper value? Or are double quotes not just literal regions, but also value delimiters? — dfhwze
– dfhwze, Commented Aug 26, 2019 at 18:12

VisualMelon · Accepted Answer · 2019-08-27 09:39:57Z

`MatchDelegate`

Much as I love .NET's nominal delegates, I almost always regret using a delegate rather than an interface, so I would introduced an IMatcher (which MatcherAttribute can implement directly) in its place. Granted delegates usually go wrong because I need to serialise them, which won't be an issue here, but the ability to attach meta data could be useful.

The Tuple

And as you know, I loathe tuples with a passion (when part of a public API), and would instead provide a dedicated MatchResult type, which can provide the same accessors but a nicer API for creation (e.g. providing one constructor for Token and Length (corresponding to success), and a static readonly corresponding to failure. The 'success' constructor can do all manner of wonderful checks to ensure that when you try to return nonsense that you are shouted at before it can do any damage (e.g. Token != null && Length >= Token.Length). This will also significantly declutter the code (which is full of (bool Success, string Token, int Length) at the moment), improve maintainability (you can modify the type in future without having to 'fix' everything that uses it), and you'll make me less miserable, which will make you feel warm and fuzzy inside. You can even add a Deconstructor magic-method if you really wish to access the tree attributes in such a manner. I'd also expect MatchResult to be immutable, which a ValueTuple cannot give you.

`RegexTextAttribute`

You might want to look at the \G regex token, which forces the match to occur at the exact position: this will avoid the match position check, and significantly improve performance for failed matches. I'm not sure how versatile \G is, but combined with lookaheads I doubt there is anything it can't give you. See the remarks on Regex.Match (ctrl-f for "\G").

`QTextAttribute`

You could make the compiler happy by using if (i == offset) instead of the switch, which will be easier to maintain because it won't have code lying around for the sole purpose of making the compiler happy.

Regarding // Don't eat quotes, it seems that you an I have different definitions of 'eat', which suggests maybe a clearer term is in order.

I don't understand this: return (false, token.ToString(), 0);

`Tokenize`

I think if (matches.FirstOrDefault() is var match ...) might as wall be match = matches.FirstOrDefault(). This would have the benefit of not being thoroughly confusing, since if that conditions was to fail the code would crash, but I don't believe it ever can.

I don't see the point in generating the tuple when you generate matches: I would find the match first, then generate the token if there was a successful match. This removes the tuple (did I mention I don't like tuples?), and would rip up Consume.

You might as well provide the parameter name for the ArgumentException: it just gives you that little bit more confidence that Tokenize is throwing the exception, and it isn't some re-packaged message.

I think the increment should be i += match.Length.

`State<TToken>`

I don't see the need to restrict TToken to an Enum, and I don't understand why IsToken isn't readonly and assigned in the constructor. Following on, I don't like that State<TToken> is tied to the attributes: why not provide a constructor that allows you to determine the matcher as well?

Consume should return null for a failed match, so that anyone trying to use it finds out sooner than later. I don't think Token<TToken>..ctor should take a MatchResult (tuple thing): why does it care it came from a match? If it will take a MatchResult, then it should throw on an unsuccessful match. I also think it is bad that you don't allow empty matches: they could be misused to create misery, but equally there is no documentation saying the match must be non-empty, and they could be useful for 'optional' components.

Misc

As always, inline documentation would be appreciated.

Reading this at 4.47 in the morning :-) I like all the suggestions. About that making the compiler happy, I like it sometimes switchy - it's on purpose without _if_s lol I'll use more comments in future that document these conventions. Your review shows me how many assumptions I made that I wasn't really aware of. This means again, more comments to the rescue! — t3chb0t
– t3chb0t, Commented Aug 27, 2019 at 2:52

dfhwze · Accepted Answer · 2019-08-27 06:02:06Z

5

General thoughts

You have managed to create a somewhat elegant API that balances between a state machine pattern and a regex engine. This is reusable for small and context-free use cases, but will get to haunt you if you need to tokenize more complex and context-bound grammars.

I can only add to VisualMelon's spot-on review:

Tuples are fantastic constructs for internal data representation of an API, utility classes to avoid boiler-plate classes/structs. For the public connection points of any API however, they are more of a code smell. They somehow hurt readability. I feel a class name adds so much more to an input or result argument.
There is room for improvement when dealing with escape characters and sequences. Currently only the double quote gets escaped. You could make a mini API for this.

Commandline API

Although this API is kept very simple, it already shows how you'd have to manage/corrupt your token design, just to be able to maintain simple regex patterns.

public enum CommandLineToken
{
    // .. other

    [Regex(@"[\=\:\,\s]")]
    ValueBegin,

    [QText(@"([a-z0-9\.\;\-]*)")]
    Value,
}

In my opinion, there should not be a distinction between ValueBegin and Value. They are both Value syntactically, only their semantics differ. I would never allow semantics to hurt my API design. This is a good example to show that regex only has benefits for the simpler grammars. Another proof to that point is that you required to make a custom pattern matcher QTextAttribute, because a regex would be too much pain to write (if even possible for balanced and escaped delimiters).

I like the API for its simplicity, and I see use cases for it. However, I'm afraid for most use cases, as more functionality is added over time, you'd end up with convoluted tokens and complex regexes to maintain. A next step is to ditch the regex engine and go for a full blown lexer.

edited Aug 27, 2019 at 6:02

answered Aug 27, 2019 at 5:13

dfhwze

14.2k3 gold badges40 silver badges101 bronze badges

\$\begingroup\$ I always wonder about one thing. When there are already professional lexers that can do a lot of magic - why does everybody have to write their own parser every single time? It's like nobody uses these academic tools that can do virtually everything. I believe they aren't at all that good. Does json.net use any of these? nope. Does nlog use any of these? nope? does anyone at all use any of these? I don't think so. I know I might be reinventing the wheel but I'd like to have a wheel this is as round as possible and just some polygon for students ;-] and it's a great exercise. \$\endgroup\$

t3chb0t
– t3chb0t

2019-08-27 05:24:22 +00:00
Commented Aug 27, 2019 at 5:24
\$\begingroup\$ I would not use a professional lexer, I would write my own, using a professional lexer generator ;-) There is a nice API you could check out NCalc. This shows you the power of generating a lexer and parser from a grammar using ANTLR and it allows for visitors and tree walking. \$\endgroup\$

dfhwze
– dfhwze

2019-08-27 05:26:46 +00:00
Commented Aug 27, 2019 at 5:26
\$\begingroup\$ tomato tomatho ... do you know any such tools? NCalc is exactly what I mean, an academic example. There aren't any real products using these generators. \$\endgroup\$

t3chb0t
– t3chb0t

2019-08-27 05:28:58 +00:00
Commented Aug 27, 2019 at 5:28
\$\begingroup\$ Tools? Like lexer generators?: en.wikipedia.org/wiki/Lexical_analysis#Lexer_generator. Or what do you mean? \$\endgroup\$

dfhwze
– dfhwze

2019-08-27 05:32:53 +00:00
Commented Aug 27, 2019 at 5:32
\$\begingroup\$ yes, I mean the generators. Who is using them for real? Not as an experiment but for real real like in a very popular IDE/editor/framework/etc? \$\endgroup\$

t3chb0t
– t3chb0t

2019-08-27 05:34:23 +00:00
Commented Aug 27, 2019 at 5:34

| Show 3 more comments

JAD · Accepted Answer · 2019-08-27 10:52:53Z

4

Unnecessary `switch`-statements

switch statements are nice as a way of avoiding long chains of if (){} else if(){} .... else {} statements. Switching on a bool doesn't make much sense, as is much more unclear than using if statements. So replace this

switch (Escapables.Contains(c))
{
    case true:
        // Remove escape char.
        token.Length--;
        break;
}

for

if (Escapables.Contains(C))
{
    // Remove escape char.
    token.Length--;
}

and this

switch (i == offset)
{
    // Entering quoted text.
    case true:
        quote = !quote;
        continue; // Don't eat quotes.

    // End of quoted text.
    case false:
        return (true, token.ToString(), i - offset + 1);
}

for

if (i === offset)
{
    // Entering quoted text.
    quote = !quote;
    continue; // Don't eat quotes.
}
else 
{
    // End of quoted text.
    return (true, token.ToString(), i - offset + 1);
}

answered Aug 27, 2019 at 10:52

JAD

2,9592 gold badges16 silver badges30 bronze badges

1

\$\begingroup\$ This part was taken from my other question here. You would love it too haha ;-] \$\endgroup\$

t3chb0t
– t3chb0t

2019-08-27 11:15:18 +00:00
Commented Aug 27, 2019 at 11:15
\$\begingroup\$ @t3chb0t I was wondering what it was about. ;) \$\endgroup\$

JAD
– JAD

2019-08-27 12:19:00 +00:00
Commented Aug 27, 2019 at 12:19

Add a comment |

t3chb0t · Accepted Answer · 2019-08-27 16:59:37Z

(self-answer)

I'll post another question when I made some more signifficant changes and for now I'll just summarize your feedback:

Suggestions by @VisualMelon

✔ - no public tuples (but one small extension) (you need to forgive me)
✔ - I must use the \G anchor more often; this simplfied the Regex matching
✔ - no more Making the compiler happy - removed weird switches
✔ - replaced mysterious return (false, token.ToString(), 0) with MatchResult<T>.Failure
✔ - Tokenize - a clean small while with a good looking switch
✔ - not generating tuples anymore; replaces with MatchResult<T>
✔ - State<TToken> is no longer restricted to Enum; instead, it now handles TToken via the new MatcherProviderAttribute that knows more about TToken and how to get IMatcher
✔ - MatchDelegate replaced with IMacher interface
✔/✖ - inline documentation - I'm trying ;-]

Suggestions by @dfhwze

✔ - both double and single quotes can be used; the first found is the one that must close a string
✔ - no more helper tokens like ValueBegin that weren't returned
✖ - context-bound grammars - maybe another time;
✖ - use a full blown lexer - maybe another time; for now this is fun

Suggestions by @JAD

✔ - no more switch flood

Conventions I might use some unusual conventions in my code and I think it's good to know them so that you're not surprised
- else if - this is worse than a goto
- is var x - I like this expression so I often use it to create inline variables
- ?: - I use this only for single expressions; who would want to debug a giant ternary; I prefer if/else with multiple conditions
- beware of var str = default(string) because I never define variables explicitly; this is not negotiable ;-P
- I use local functions to encapsulate small expressions
- I tend to (over)use System.Collections.Immutable because these classes have very convenient APIs
- I usually don't include parameter checking in proof-of-concept code

API

The Tokenizer is now only a small loop:

public interface ITokenizer<TToken> where TToken : Enum
{
    IEnumerable<Token<TToken>> Tokenize(string value);
}

public class Tokenizer<TToken> : ITokenizer<TToken> where TToken : Enum
{
    private readonly IImmutableDictionary<TToken, IImmutableList<State<TToken>>> _transitions;

    public Tokenizer(IImmutableList<State<TToken>> states)
    {
        _transitions = StateTransitionMapper.CreateTransitionMap(states);
    }

    public IEnumerable<Token<TToken>> Tokenize(string value)
    {
        var state = _transitions[default];
        var offset = 0;

        while (Any())
        {
            // Using a switch because it looks good here. 
            switch (state.Select(s => s.Match(value, offset)).FirstOrDefault(m => m.Success))
            {
                case null:
                    throw new ArgumentException($"Invalid character '{value[offset]}' at {offset}.");

                case MatchResult<TToken> match:
                    yield return new Token<TToken>(match.Token, match.Length, offset, match.TokenType);
                    offset += match.Length;
                    state = _transitions[match.TokenType];
                    break;
            }
        }

        // Let's hide this ugly expression behind this nice helper.
        bool Any() => offset < value.Length - 1;
    }
}

public static class StateTransitionMapper
{
    // Turns the adjacency-list of states into a dictionary for faster lookup.
    public static IImmutableDictionary<TToken, IImmutableList<State<TToken>>> CreateTransitionMap<TToken>(IImmutableList<State<TToken>> states) where TToken : Enum
    {
        return states.Aggregate(ImmutableDictionary<TToken, IImmutableList<State<TToken>>>.Empty, (mappings, state) =>
        {
            var nextStates =
                from n in state.Next
                join s in states on n equals s.Token
                select s;

            return mappings.Add(state.Token, nextStates.ToImmutableList());
        });
    }
}

Supporting types

All other supporting types implementing the changes listed in the summary above.

public class MatchResult<TToken>
{
    public MatchResult(string token, int length, TToken tokenType)
    {
        Success = true;
        Token = token;
        Length = length;
        TokenType = tokenType;
    }

    public static MatchResult<TToken> Failure(TToken tokenType) => new MatchResult<TToken>(string.Empty, 0, tokenType) { Success = false };

    public bool Success { get; private set; }

    public string Token { get; }

    public int Length { get; }

    public TToken TokenType { get; }
}

public interface IMatcher
{
    MatchResult<TToken> Match<TToken>(string value, int offset, TToken tokenType);
}

public abstract class MatcherAttribute : Attribute, IMatcher
{
    public abstract MatchResult<TToken> Match<TToken>(string value, int offset, TToken tokenType);
}

// Can recognize regexable patterns.
// The pattern requires one group that is the token to return. 
public class RegexAttribute : MatcherAttribute
{
    private readonly Regex _regex;

    public RegexAttribute([RegexPattern] string prefixPattern)
    {
        _regex = new Regex($@"\G{prefixPattern}");
    }

    public override MatchResult<TToken> Match<TToken>(string value, int offset, TToken tokenType)
    {
        return
            _regex.Match(value, offset) is var match && match.Success
                ? new MatchResult<TToken>(match.Groups[1].Value, match.Length, tokenType)
                : MatchResult<TToken>.Failure(tokenType);
    }
}

// Can recognize constant patterns.
public class ConstAttribute : MatcherAttribute
{
    private readonly string _pattern;

    public ConstAttribute(string pattern) => _pattern = pattern;

    public override MatchResult<TToken> Match<TToken>(string value, int offset, TToken tokenType)
    {
        return
            // All characters have to be matched.
            MatchLength() == _pattern.Length
                ? new MatchResult<TToken>(_pattern, _pattern.Length, tokenType)
                : MatchResult<TToken>.Failure(tokenType);

        int MatchLength() => _pattern.TakeWhile((t, i) => value[offset + i].Equals(t)).Count();
    }
}

// Assists regex in tokenizing quoted strings because regex has no memory of what it has seen.
// Requires two patterns:
// - one for the separator because it has to know where the value begins
// - the other for an unquoted value if it's not already quoted
public class QTextAttribute : MatcherAttribute
{
    public static readonly IImmutableSet<char> Escapables = new[] { '\\', '"', '\'' }.ToImmutableHashSet();

    private readonly Regex _prefixRegex;
    private readonly Regex _unquotedValuePattern;

    public QTextAttribute([RegexPattern] string separatorPattern, [RegexPattern] string unquotedValuePattern)
    {
        _prefixRegex = new Regex($@"\G{separatorPattern}");
        _unquotedValuePattern = new Regex($@"\G{unquotedValuePattern}");
    }

    public override MatchResult<TToken> Match<TToken>(string value, int offset, TToken tokenType)
    {
        if (_prefixRegex.Match(value, offset) is var prefixMatch && prefixMatch.Success)
        {
            if (MatchQuoted(value, offset + prefixMatch.Length, tokenType) is var matchQuoted && matchQuoted.Success)
            {
                return matchQuoted;
            }
            else
            {
                if (_unquotedValuePattern.Match(value, offset + prefixMatch.Length) is var valueMatch && valueMatch.Groups[1].Success)
                {
                    return new MatchResult<TToken>(valueMatch.Groups[1].Value, prefixMatch.Length + valueMatch.Length, tokenType);
                }
            }
        }

        return MatchResult<TToken>.Failure(tokenType);
    }

    // "foo \"bar\" baz"
    // ^ start         ^ end
    private static MatchResult<TToken> MatchQuoted<TToken>(string value, int offset, TToken tokenType)
    {
        var token = new StringBuilder();
        var escapeSequence = false;
        var quote = '\0'; // Opening/closing quote.

        foreach (var (c, i) in value.SkipFastOrDefault(offset).SelectIndexed())
        {
            if (i == 0)
            {
                if (@"'""".Contains(c))
                {
                    quote = c;
                }
                else
                {
                    // It doesn't start with a quote. This is unacceptable. Either an empty value or an unquoted one.
                    return MatchResult<TToken>.Failure(tokenType);
                }
            }
            else
            {
                if (c == '\\' && !escapeSequence)
                {
                    escapeSequence = true;
                }
                else
                {
                    if (escapeSequence)
                    {
                        if (Escapables.Contains(c))
                        {
                            // Remove escape char. We don't need them in the result.
                            token.Length--;
                        }

                        escapeSequence = false;
                    }
                    else
                    {
                        if (c == quote)
                        {
                            // +2 because there were two quotes.
                            return new MatchResult<TToken>(token.ToString(), i + 2, tokenType);
                        }
                    }
                }

                token.Append(c);
            }
        }

        return MatchResult<TToken>.Failure(tokenType);
    }
}

public static class StringExtensions
{
    // Doesn't enumerate the string from the beginning for skipping.
    public static IEnumerable<char> SkipFastOrDefault(this string source, int offset)
    {
        // Who uses for-loop these days? Let's hide it here so nobody can see this monster.
        for (var i = offset; i < source.Length; i++)
        {
            yield return source[i];
        }
    }

    // Doesn't enumerate a collection from the beginning if it implements `IList<T>`.
    // Falls back to the default `Skip`.
    public static IEnumerable<T> SkipFastOrDefault<T>(this IEnumerable<T> source, int offset)
    {
        // Even more for-loops to hide.
        switch (source)
        {
            case IList<T> list:
                for (var i = offset; i < list.Count; i++)
                {
                    yield return list[i];
                }

                break;

            default:
                foreach (var item in source.Skip(offset))
                {
                    yield return item;
                }

                break;
        }
    }
}

public static class EnumerableExtensions
{
    // This is so common that it deserves its own extension.
    public static IEnumerable<(T Item, int Index)> SelectIndexed<T>(this IEnumerable<T> source)
    {
        return source.Select((c, i) => (c, i));
    }
}

public abstract class MatcherProviderAttribute : Attribute
{
    public abstract IMatcher GetMatcher<TToken>(TToken token);
}

public class EnumMatcherProviderAttribute : MatcherProviderAttribute
{
    public override IMatcher GetMatcher<TToken>(TToken token)
    {
        if (!typeof(TToken).IsEnum) throw new ArgumentException($"Token must by of Enum type.");

        return
            typeof(TToken)
                .GetField(token.ToString())
                .GetCustomAttribute<MatcherAttribute>();
    }
}

public class State<TToken> where TToken : Enum
{
    private readonly IMatcher _matcher;

    public State(TToken token, params TToken[] next)
    {
        Token = token;
        Next = next;
        _matcher =
            typeof(TToken)
                .GetCustomAttribute<MatcherProviderAttribute>()
                .GetMatcher(token);
    }

    public TToken Token { get; }

    public IEnumerable<TToken> Next { get; }

    public MatchResult<TToken> Match(string value, int offset) => _matcher.Match(value, offset, Token);

    public override string ToString() => $"{Token} --> [{string.Join(", ", Next)}]";
}

public class Token<TToken>
{
    public Token(string token, int length, int index, TToken type)
    {
        Text = token;
        Length = length;
        Index = index;
        Type = type;
    }

    public int Index { get; }

    public int Length { get; }

    public string Text { get; }

    public TToken Type { get; }

    public override string ToString() => $"{Index}: {Text} ({Type})";
}

Tests & Examples

This is how I use it with a simplfied commad-line syntax:

using static CommandLineToken;

public class CommandLineTokenizerTest
{
    private static readonly ITokenizer<CommandLineToken> Tokenizer = new CommandLineTokenizer();

    [Theory]
    [InlineData(
        "command -argument value -argument",
        "command  argument value argument")]
    [InlineData(
        "command -argument value value",
        "command  argument value value")]
    [InlineData(
        "command -argument:value,value",
        "command  argument value value")]
    [InlineData(
        "command -argument=value",
        "command  argument value")]
    [InlineData(
        "command -argument:value,value",
        "command  argument value value")]
    [InlineData(
        @"command -argument=""foo--bar"",value -argument value",
        @"command  argument   foo--bar   value  argument value")]
    [InlineData(
        @"command -argument=""foo--\""bar"",value -argument value",
        @"command  argument   foo-- ""bar   value  argument value")]
    public void Can_tokenize_command_lines(string uri, string expected)
    {
        var tokens = Tokenizer.Tokenize(uri).ToList();
        var actual = string.Join("", tokens.Select(t => t.Text));
        Assert.Equal(expected.Replace(" ", string.Empty), actual);
    }
}

[EnumMatcherProvider]
public enum CommandLineToken
{
    Start = 0,

    [Regex(@"\s*(\?|[a-z0-9][a-z0-9\-_]*)")]
    Command,

    [Regex(@"\s*[\-\.\/]([a-z0-9][a-z\-_]*)")]
    Argument,

    [QText(@"([\=\:\,]|\,?\s*)", @"([a-z0-9\.\;\-]+)")]
    Value,
}

public class CommandLineTokenizer : Tokenizer<CommandLineToken>
{
    /*

     command [-argument][=value][,value]

     command --------------------------- CommandLine
            \                           /
             -argument ------   ------ /    
                      \      / \      /
                       =value   ,value

    */
    private static readonly State<CommandLineToken>[] States =
    {
        new State<CommandLineToken>(default, Command),
        new State<CommandLineToken>(Command, Argument),
        new State<CommandLineToken>(Argument, Argument, Value),
        new State<CommandLineToken>(Value, Argument, Value),
    };

    public CommandLineTokenizer() : base(States.ToImmutableList()) { }
}

I think I might borrow these (✔/✖) from you for future purposes yet unknown :p — dfhwze
– dfhwze, Commented Aug 27, 2019 at 17:02

Jesse C. Slicer · Accepted Answer · 2019-08-27 16:57:47Z

A couple of tiny tidbits:

You could easily make Token immutable (removing the property setters) by passing type and index into the constructor as such:

public Token((bool Success, string Token, int Length) match, TToken type, int index)
{
    (bool success, string token, int length) = match;
    this.Length = success ? length : 0;
    this.Text = success ? token : string.Empty;
    this.Type = type;
    this.Index = index;
}

then you just have to adjust Consume in the State class like so:

public Token<TToken> Consume(string value, int offset)
{
    return new Token<TToken>(_match(value, offset), Token, offset);
}

Token and State are, in my opinion, screaming to have their own interfaces:

    public interface IState<TToken> where TToken : Enum
    {
        bool IsToken { get; }

        TToken Token { get; }

        IEnumerable<TToken> Next { get; }

        IToken<TToken> Consume(string value, int offset);
    }

    public interface IToken<TToken> where TToken : Enum
    {
        int Length { get; }

        string Text { get; }
    }

(adjust accordingly in the bunch of places they're used)

Stack Exchange Network

Simple tokenizer v2 - reading all matching chars at once

API

Examples and tests

Questions

5 Answers 5

`MatchDelegate`

The Tuple

`RegexTextAttribute`

`QTextAttribute`

`Tokenize`

`State<TToken>`

Misc

General thoughts

Commandline API

Unnecessary `switch`-statements

API

Supporting types

Tests & Examples

You must log in to answer this question.

Linked

Hot Network Questions

Simple tokenizer v2 - reading all matching chars at once

API

Examples and tests

Questions

5 Answers 5

MatchDelegate

The Tuple

RegexTextAttribute

QTextAttribute

Tokenize

State<TToken>

Misc

General thoughts

Commandline API

Unnecessary switch-statements

API

Supporting types

Tests & Examples

You must log in to answer this question.

Linked

Related

Hot Network Questions

`MatchDelegate`

`RegexTextAttribute`

`QTextAttribute`

`Tokenize`

`State<TToken>`

Unnecessary `switch`-statements