A lexer in C++ for analysing regex-like text

Question

I'm creating a parser, and I have just finished my lexer. I wanted to ask if there is anything I should change, add, or reconsider in my code! (I don't think the grammars matter much, since it is only a lexer)

lexical_analyzer.h

#pragma once

#include <iostream>
#include <string>
#include <vector>

// kinds of lexemes (i.e. tokens)
namespace la_enum
{
    enum token
    {
        STRING              // anything that isn't something below
        , AND               // +
        , CHAR_REPEATED     // *
        , LEFT_PARENTHESIS  // (
        , RIGHT_PARENTHESIS // )
        , ANY_CHAR          // .
        , COUNTER           // {N}
        , IGNORE_CASE       // \I
        , SINGLE_CAPTURE    // \O{N}
    };
}

class lexical_analyzer
{
public:
    lexical_analyzer(std::string patternInput) :
        pattern(patternInput)
    {
        addLexemes();
    }

private:
    void addLexemes(); // adds lexemes and tokens from the pattern to the vectors

    bool isSingleSymbol(char); // checks if it is an operator with ONE symbol

    void addString(); // is used to store strings (operands)

    void addOperator(la_enum::token, std::string&, int, int); // is used to store operators

    std::string pattern; // input pattern
    std::string characterBuffer; // buffer for operators with more than one symbol

    // 'lexemes' and 'tokens' have synched indexes
    std::vector<std::string> lexemes; // stores lexemes
    std::vector<la_enum::token> tokens; // stores tokens
};

lexical_analyzer.cpp

#include "lexical_analyzer.h"

// adds lexemes and tokens from the pattern to the vectors
void lexical_analyzer::addLexemes()
{
    for (int i = 0; i != pattern.size(); i++)
    {
        // adds lexems and tokens from the char buffer (strings)
        if (isSingleSymbol(pattern[i]))
            addString();

        switch (pattern[i])
        {
        case ('+') :
            addOperator(la_enum::AND, pattern, i, 1);
            break;
        case ('*') :
            addOperator(la_enum::CHAR_REPEATED, pattern, i, 1);
            break;
        case ('.') :
            addOperator(la_enum::ANY_CHAR, pattern, i, 1);
            break;
        case ('(') :
            addOperator(la_enum::LEFT_PARENTHESIS, pattern, i, 1);
            break;
        case (')') :
            addOperator(la_enum::RIGHT_PARENTHESIS, pattern, i, 1);
            break;
        default :
            if (pattern[i] == '{')
            {
                // checks if it's the right syntax '{N}'...
                if (isdigit(pattern[i + 1]) && pattern[i + 2] == '}')
                {
                    addString();
                    addOperator(la_enum::COUNTER, pattern, i, 3);
                    i += 2;
                }
                else
                {   // ...otherwise it counts as a string and is added to the buffer
                    characterBuffer.push_back(pattern[i]);
                }
            }
            else if (pattern[i] == '\\')
            {
                // checks if it's the right syntax '\I'...
                if (pattern[i + 1] == 'I')
                {
                    addString();
                    addOperator(la_enum::IGNORE_CASE, pattern, i, 2);
                    i++;
                }
                // checks if it's the right syntax '\O{N}'...
                else if (pattern[i + 1] == 'O' && pattern[i + 2] == '{' && isdigit(pattern[i + 3]) && pattern[i + 4] == '}')
                {
                    addString();
                    addOperator(la_enum::SINGLE_CAPTURE, pattern, i, 5);
                    i += 4;
                }
                else
                {   // ...otherwise it counts as a string and is added to the buffer
                    characterBuffer.push_back(pattern[i]);
                }
            }
            else // If the symbol isn't one of those above
                // it counts as a string (operand) and is added to the buffer
            {
                characterBuffer.push_back(pattern[i]);
            }
        }
    }
    // check one last time if the buffer has content which is then added to the vectors
    addString();
    // prints tokens and lexemes
    for (int i = 0; i != lexemes.size(); i++)
    {
        std::cout << "Token: \"" << tokens[i] << "\" Lexeme: \"" << lexemes[i] << "\"" << std::endl;
    }
}

// checks if it is an operator with ONE symbol
bool lexical_analyzer::isSingleSymbol(char c)
{
    if (c == '+' || c == '*' || c == '(' || c == ')' || c == '.')
        return true;
    else
        return false;
}

// checks if the buffer has content
// which is then added as lexeme and token
void lexical_analyzer::addString()
{
    if (!characterBuffer.empty())
    {
        lexemes.push_back(characterBuffer);
        tokens.push_back(la_enum::STRING);
    }
}

// adds an operator as a lexeme in the 'lexeme' vector, and token in 'tokens' vector
void lexical_analyzer::addOperator(la_enum::token tok, std::string& str, int pos, int sz)
{
    lexemes.push_back(std::string(str, pos, sz));
    tokens.push_back(tok);
    characterBuffer.clear();
}

main.cpp

#include <iostream>

#include "lexical_analyzer.h"

int main()
{
    //std::string in;
    //std::getline(std::cin, in);
    //lexical_analyzer(std::move(in));

    lexical_analyzer("Hell. (MY)\I n..e (is+was) Melwin.\O{0}");

    return 0;
}

output:

Token: "0" Lexeme: "Hell"
Token: "5" Lexeme: "."
Token: "0" Lexeme: " "
Token: "3" Lexeme: "("
Token: "0" Lexeme: "MY"
Token: "4" Lexeme: ")"
Token: "0" Lexeme: "I n"
Token: "5" Lexeme: "."
Token: "5" Lexeme: "."
Token: "0" Lexeme: "e "
Token: "3" Lexeme: "("
Token: "0" Lexeme: "is"
Token: "1" Lexeme: "+"
Token: "0" Lexeme: "was"
Token: "4" Lexeme: ")"
Token: "0" Lexeme: " Melwin"
Token: "5" Lexeme: "."
Token: "0" Lexeme: "O"
Token: "6" Lexeme: "{0}"

you could use a scoped enum: en.cppreference.com/w/cpp/language/enum rather than namespaced. And maybe some form of map to get from char to enum rather than enum defined in one place and then a big switch statement. The map might be more sophisticated with lambda/callbacks or similar for a design which could scale to something more complex? — Oliver Schönrock
– Oliver Schönrock, Commented Jan 23, 2020 at 14:14
@OliverSchonrock I had been thinking about using a map in this project, and you might be on to something there. I will see what I can find. But you're saying I could remove the switch statement altogether and use a map instead? — Hampus Lundberg
– Hampus Lundberg, Commented Jan 23, 2020 at 14:34
I have not spent much time on the code, but yes, it look like all the simple - non-default - cases could be handled in one or 2 lines with a map. worth exploring. I agree with @ratchet freak that if the number of tokens will stay at this level you don't need a map. I depends on the future of this code. for 200 tokens a map would probably be more maintainable — Oliver Schönrock
– Oliver Schönrock, Commented Jan 23, 2020 at 14:52
Lexer tools are already prevalent. Have you though about using "Lex"? — Loki Astari
– Loki Astari, Commented Jan 27, 2020 at 23:52

ratchet freak · Accepted Answer · 2020-01-23 14:41:53Z

You should clear characterBuffer at the end of addString instead of addOperator.

There is no need for characterBuffer to be a member field. Instead make it a local in addLexemes and pass it (by ref) when needed.

There is a very consistent patter that every time you call addOperator you call addString right before. Therefor you can put addString in addOperator.

// first adds the string in 'precedingStringBuffer' as string if not empty
// then adds an operator as a lexeme in the 'lexeme' vector, and token in 'tokens' vector
void lexical_analyzer::addOperator(std::string& precedingStringBuffer, la_enum::token tok, const std::string& str, int pos, int sz)
{
    addString(characterBuffer);
    lexemes.push_back(std::string(str, pos, sz));
    tokens.push_back(tok);
}

If that is the amount of tokens you will be using then the switch is fine. You are unlikely to get anything better than what the compiler can generate for the switch.

For the single character tokens you could use the ascii value of the character as value for the token and values from 257 for the multi character values.

The COUNTER and SINGLE_CAPTURE only allow for a single digit between the braces, this can be troublesome if you ever need something more than 9 in there.

Thank you, I will definitely add addString() in addOperator(), which let's me remove the isSingleSymbol() altogether. Clearing characterBuffer in addString makes sense! Changing characterBuffer to local variable, is better? (will look into it). As for the last remark, I was planning on having a limit on 9 anyway so anything else becomes just a string ({97} becomes a string, for example). AND the ASCII characters I can change, also, could make things easier later! Thanks! — Hampus Lundberg
– Hampus Lundberg, Commented Jan 23, 2020 at 15:19

Stack Exchange Network

A lexer in C++ for analysing regex-like text

1 Answer 1

You must log in to answer this question.

Hot Network Questions

A lexer in C++ for analysing regex-like text

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions