6

I acknowledge there are superficially similar questions asked here before, but all of those I've seen are simpler than what I'm trying to achieve. Bash-only solutions are preferred.

I have a variable containing a string that looks like a comparison of some kind, and I'd like to split it into an array. The following are some examples, including how I'd like them to be split:

var='name="value"'                # arr=([0]=name [1]='=' [2]=value)
var="name != '!value='"           # arr=([0]=name [1]='!=' [2]='!value=')
var='"na=me" = value'             # arr=([0]=na=me [1]='=' [2]=value)
var='name >= value'               # arr=([0]=name [1]='>=' [2]=value)
var='name'                        # arr=([0]=name)
var='name = "escaped \"quotes\""' # arr=([0]=name [1]='=' [2]=escaped\ \"quotes\")
var="name = \"nested 'quotes'\""  # arr=([0]=name [1]='=' [2]=nested\ \'quotes\')
var="name = 'nested \"quotes\"'"  # arr=([0]=name [1]='=' [2]=nested\ \"quotes\")

You get the picture. Either side (or neither) may be quoted, with either single or double-quotes. There might be escaped or otherwise nested quotes. The operator between them can be any of a predefined set, but they may also be included within the quoted strings. There may or may not be spaces. There may be no operator at all.

I have to parse a lot of lines, and therefore I'd prefer not to fork a new process each time, which is why Bash-only solutions are preferred. This is an addition to an existing Bash script that does not need to be portable to other shells, and it's running on Bash 5.2, so I do have access to modern Bash features that may be helpful.

IFS=\" read -a arr <<<"$var" is nice in that it understands how to handle escaped quotes, and if I only had to deal with either single or double quotes and not both, I could make this work. As it stands, I'm just hoping I don't have to write a whole tokenizer algorithm in shell script, and that there's some combination of features I haven't considered which can parse this reliably.

2
  • 2
    The way to avoid forking a new process for each line would be to pass the input lines to a single run of the tool to parse the quoted string and have the tool output the same data in some other, more suitable format. E.g. turn "foo" ="bar" into foo#bar, where the # is some character that will not appear in the actual data, e.g. a control character. (Say, tab; or \0x1F, the ASCII Unit separator; or just something random like \x01.) I would expect Bash in particular might be rather slow in single-char fiddling like this. Commented Feb 1 at 22:16
  • That does seem to be a sensible suggestion if attempting to handle all this in the same bash script proves too unwieldy or slow. Commented Feb 1 at 22:23

2 Answers 2

8

You need to write a parser: read the string character by character, based on the current character, extend the current word or start a new one. Keep a flag to indicate the parser is inside a quoted string.

Something like this:

#!/bin/bash
set -eu

validate() {
    size=$1
    shift

    if ((size != $#)) ; then
        echo "Not OK # Wrong size: $size $#"
        return
    fi

    ok=1
    for ((j=1; j <= size; ++j)) ; do
        [[ ${!j} = ${arr[j-1]} ]] || ok=0
    done
    if ((ok)) ; then
        echo $i OK
    else
        echo $i Not OK
    fi
}


i=0
for var in 'name="value"'                \
           "name != '!value='"           \
           '"na=me" = value'             \
           'name >= value'               \
           'name'                        \
           'name = "escaped \"quotes\""' \
           "name = \"nested 'quotes'\""  \
           "name = 'nested \"quotes\"'"  \
; do
    arr=()
    left=""
    quoted=""
    while ! (( ${#arr[@]} )) && [[ $var ]] ; do
        char=${var:0:1}
        var=${var:1}
        if [[ $char = [\'\"] ]] ; then
            if [[ -z $left ]] ; then
                quoted=$char
            elif [[ $quoted = $char ]] ; then
                quoted=${quoted:0:-1}
                arr=("$left")
            else
                echo 'Unexpected quote' >&2
                exit 1
            fi
        elif [[ $char = [\ =!\>] && -z $quoted ]] ; then
            arr=("$left")
            if [[ $char != ' ' ]] ; then
                var=$char$var
            fi
        else
            left+=$char
        fi
    done
    arr=("$left")

    op=""
    arr[1]=""
    while [[ $var && ! ${arr[1]} ]] ; do
        char=${var:0:1}
        var=${var:1}
        if [[ $char = [=\<\>\!] ]] ; then
            op+=$char
        elif [[ $char = ' ' ]] ; then
            if [[ $op ]] ; then
                arr[1]=$op
            else
                :
            fi
        else
            arr[1]=$op
            var=$char$var
        fi
    done
    [[ -z ${arr[1]} ]] && unset arr[1]

    if [[ $var ]] ; then
        quoted=""
        right=""
        while [[ $var ]] ; do
            char=${var:0:1}
            var=${var:1}
            if [[ $quoted ]] ; then
                if [[ $char = ${quoted: -1} ]] ; then
                    quoted=${quoted:0:-1}
                elif [[ $char = \\ ]] ; then
                    nextchar=${var:0:1}
                    if [[ $nextchar = ${quoted: -1} ]] ; then
                        right+=$nextchar
                        var=${var:1}
                    fi
                else
                    right+=$char
                fi
            elif [[ $char = [\"\'] ]] ; then
                quoted+=$char
            else
                right+=$char
            fi
        done
        arr+=("$right")
    fi

    case $i in
        (0) exp=(name = value) ;;
        (1) exp=(name '!=' '!value=') ;;
        (2) exp=(na=me = value) ;;
        (3) exp=(name '>=' value) ;;
        (4) exp=(name) ;;
        (5) exp=(name = 'escaped "quotes"') ;;
        (6) exp=(name = "nested 'quotes'") ;;
        (7) exp=(name = 'nested "quotes"') ;;
        (*) exit 1 ;;
    esac

    validate ${#arr[@]} "${exp[@]}"

    ((++i))
done

It correctly parses all the examples you gave, but it is far from finished (it doesn't check unclosed quotes etc.)

3
  • I appreciate the effort you've gone through to implement this for me! Unfortunately this is, in fact, exactly the complex, hand-written scripting I was desperately hoping to avoid with some clever trick exploiting some fancy bash features. I'll give this time to see if anyone else manages to pull that particular rabbit out of a hat, but if not, I'll accept this as the best answer. Commented Feb 1 at 22:27
  • 1
    @choroba sounds like a task for Perl! Commented Feb 27 at 17:17
  • 1
    @jubilatious1: With things like Marpa::R2 or Parse::RecDescent, it would be much easier. Commented Feb 27 at 17:19
7

As @choroba pointed out, you probably can't avoid writing a lexer to split your input strings. Fortunately, "scanning" them token by token with an ERE is enough. I'd say that using a language with "non-capturing" and "named" groups would be the best choice, but if you're stuck with Bash then here's how you can do it:

edit: moved some improvements previously mentioned in the "for the reader to fix" section to here.

#!/bin/bash

vn='[[:alnum:]_]+'                    # a varname token
sq="'[^']*'"                          # a single-quoted string token
dq='"(\\.|[^"\\])*"'                  # a double-quoted string token
op="[^[:space:][:alnum:]_\"']+"       # an operator token

for var in ...; do

arr=()
while [[ $var =~ ^[[:space:]]*($vn|$sq|$dq|$op) ]]
do
    var=${var:${#BASH_REMATCH[0]}}    # remove the matched part from $var
    tok=${BASH_REMATCH[1]}            # get the matched token
    case ${tok:0:1} in
    ( \" ) tok=${tok//\\\"/\"} ;&     # decode the double-quoted strings
    ( \' ) tok=${tok:1:-1}     ;;     # unquote the quoted strings
    esac
    arr+=("$tok")
done

[[ $var =~ ^[[:space:]]*$ ]] || exit  # exit on parsing error

declare -p arr

done

note: requires bash 4.3+

output:

declare -a arr=([0]="name" [1]="=" [2]="value")
declare -a arr=([0]="name" [1]="!=" [2]="!value=")
declare -a arr=([0]="na=me" [1]="=" [2]="value")
declare -a arr=([0]="name" [1]=">=" [2]="value")
declare -a arr=([0]="name")
declare -a arr=([0]="name" [1]="=" [2]="escaped \"quotes\"")
declare -a arr=([0]="name" [1]="=" [2]="nested 'quotes'")
declare -a arr=([0]="name" [1]="=" [2]="nested \"quotes\"")

For the reader to fix:

  • I've made assumptions as to what a "varname" and an "operator" are. Basically, a "varname" is composed of alphanumeric/underscore characters; and an "operator" is anything that doesn't contain spaces and that is not a "varname" nor a quoted string.

  • While the regex consumes any backslash escape sequence present in a double-quoted string, only \" is interpreted; you may need to implement the decoding of other escape sequences as well.

2
  • I would prefer a lot of things over Bash, but... mistakes were made, and now we're doing this in Bash. I appreciate the relative simplicity of this solution. I may be able to tweak the regex to meet my needs more precisely, but I'll definitely give this approach a try with my test data and see how it fares. Commented Feb 2 at 3:18
  • 2
    I'm surprised by how well regex tokenizing with BASH_REMATCH works for this. I've had to make a small tweak to handle backslash-escaped backslashes, and I've modified the regex to replace your definition of an operator with one that more specifically matches my set, but this solves my problem neatly. Thank you! Commented Feb 2 at 3:44

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.