Trying to extract a substring and version number from a filename using bash

Question

I'm currently trying to extract a substring and version number from a filename using bash.

There are two formats the filenames will be in:

example-substring-1.1.0.tgz
example-substring-1.1.0-branch-name.tgz

For the first scenario I was able to extract the version number using sed like so:

echo example-substring-1.1.0.tgz | sed "s/.*-\(.*\)\.[a-zA-Z0-9]\{3\}$/\1/"

However this won't work for the second scenario.

Eventually I would like to create a script that will store the first substring and version in an associative array like below.

example_array["example-substring"]="1.1.0"
example_array["example-substring"]="1.1.0-branch-name"

This is proving tricky however as I can't seem to find a good way that will work for both scenarios. And for scenarios where the version includes the branch name I can't know before hand how many words the branch name will consist of.

I think variable expansion may be the way to go but wasn't able to get it to output what I want.

Instead of (.*), use ([0-9.]*) to match numbers. Then you don't need to worry about what's after it. — Barmar, Commented Nov 10, 2023 at 17:06
BTW, you can use sed -r to use extended regexp without having to escape it so much. — Barmar, Commented Nov 10, 2023 at 17:07
could you have both formats, with the same prefix, occur at the same time? if the answer is 'yes' then the proposed associative array assignments will lead to a single array entry (ie, the 2nd assignment will overwrite the 1st assignment), in which case you'll need to decide how you wish to store both formats — markp-fuso, Commented Nov 10, 2023 at 17:15
could a file have multiple file extensions, eg, instead of *.tgz could you have *.tar.gz? — markp-fuso, Commented Nov 10, 2023 at 17:41
There's no reason a branch name couldn't contain strings with -<digits> in the branch-name part, e.g. example-substring-1.1.0-branch-1.2.3.tgz so you should include at least one of those in your sample input/output as that'd be an easy match to get wrong in a potential solution. There are probably other rainy day cases you should come up with too. — Ed Morton, Commented Nov 11, 2023 at 12:34

Ed Morton · Accepted Answer · 2023-11-11 12:33:10Z

To be able to really test this we need sample input that contains more problematic cases, e.g. a string like -1.2.3 which looks like a version number appearing in the branch name:

$ cat file
example-substring-foo-1.1.0.tgz
example-substring-bar-1.1.0-branch-name.tgz
example-substring-rainy-1.1.0-branch-1.2.3.tgz

Normally I would do the pattern matching part in sed or awk, e.g. using any awk:

$ awk 'match($0,/-([0-9].*)\.[^.]+$/) {
    printf "\"%s\" \"%s\"\n", substr($0,1,RSTART-1), substr($0,RSTART+1)
}' file
"example-substring-foo" "1.1.0.tgz"
"example-substring-bar" "1.1.0-branch-name.tgz"
"example-substring-rainy" "1.1.0-branch-1.2.3.tgz"

rather than a shell loop but since you want to populate a shell array with the result anyway:

$ cat tst.sh
#!/usr/bin/env bash

declare -A example_array

while IFS= read -r ver; do
    if [[ $ver =~ -([0-9].*)\.[^.]+$ ]]; then
        example_array["${ver::-${#BASH_REMATCH[0]}}"]="${BASH_REMATCH[1]}"
    fi
done < "$@"

for idx in "${!example_array[@]}"; do
    printf 'example_array["%s"]="%s"\n' "$idx" "${example_array[$idx]}"
done

$ ./tst.sh file
example_array["example-substring-rainy"]="1.1.0-branch-1.2.3"
example_array["example-substring-bar"]="1.1.0-branch-name"
example_array["example-substring-foo"]="1.1.0"

zdim · Accepted Answer · 2023-11-12 07:47:18Z

With Perl

echo "example-substring-1.1.0-branch-name.tgz" |
    perl -wne'print join " ", /(.+)\-([0-9]+\.[0-9]+\.[0-9]+.*)\.tgz/'

Prints two words

example-substring 1.1.0-branch-name

This is thus its return to the shell script, from which this would be invoked I presume, and then one can form needed structures in the shell script.^† Tested also without the branch name, and with a few other variations of the input string.

Since the example-substring can contain digits as well (why not?), and so can the branch name (why not?), the regex pattern has no restrictions and both the leading and (possible) trailing parts are matched simply by .+ and .*.

But then we need something more specific for the version number and I've used an assumption that it always consists of three numbers separated by dots. I've also assumed the fixed rest of the string, the file extension .tgz. These can be relaxed somewhat if needed.

^† One can directly read a list (key value key value...) into an associative array

#!/bin/bash

eval declare -A ver=( $( 
    echo "example-substring-1.1.0-branch-name.tgz" | 
    perl -wnE'say join " ", /(.+)\-([0-9]+\.[0-9]+\.[0-9]+.*)\.tgz/' ))

echo ${ver["example-substring"]}

Or it may be more suitable to assign to variables first

str="example-substring-1.1.0-branch-name.tgz"

read -r str val <<< $( 
perl -wE'say join " ", $ARGV[0] =~ /(.+)\-([0-9]+\.[0-9]+\.[0-9]+.+)\.tgz/' 
    -- "$str" )

ver[$str]=$val

or even just using positional parameters

set -- $(
    perl -wE'say join " ", $ARGV[0] =~ /(.+)\-([0-9]+\.[0-9]+\.[0-9]+.+)\.tgz/' 
        -- "$str" )

ver[$1]=$2

There are of course other ways to pass arguments to a Perl script or a command-line program ("one-liner"), and other ways to take its output in bash.

Let me know if this Perl code needs commentary.

Carson · Accepted Answer · 2023-11-10 17:27:43Z

3

If you're willing to use grep instead of sed, then lookaheads and lookbehinds will allow you to define patterns to extract what you care about.

Consider the pattern: .+(?=-\d+\.\d+\.\d+) This will match anything that is followed by -<numbers>.<numbers>.<numbers>. ?= marks a conditional lookahead, which is an expression that must match the next characters, but is excluded from the final match of the pattern. When used with your examples:

$ echo example-substring-1.1.0.tgz | grep -Po '.+(?=-\d+\.\d+\.\d+)'
example-substring
$ echo example-substring-1.1.0-branch-name.tgz | grep -Po '.+(?=-\d+\.\d+\.\d+)'
example-substring

(The P flag enables PCRE2, and the o flag only prints the match)

Also consider the pattern: (?<=-)\d+\.\d+\.\d+.*(?=\.tgz$) It uses lookbehinds to assert that, immediately before the pattern, there is a -, and uses lookaheads to assert that the pattern ends with .tgz. When used with your examples:

echo 'example-substring-1.1.0.tgz' | grep -Po '(?<=-)\d+\.\d+\.\d+.*(?=\.tgz$)'
1.1.0
$ echo 'example-substring-1.1.0-branch-name.tgz' | grep -Po '(?<=-)\d+\.\d+\.\d+.*(?=\.tgz$)'
1.1.0-branch-name

answered Nov 10, 2023 at 17:27

Carson

3,13917 silver badges32 bronze badges

1

I was just going to start writing a grep -Po answer. It is exactly what should be used to extract complex substrings, IMO. Might have landed on '(?<=-)[\d.]+-.*(?=\.tgz$)' but you've done the hard work of getting the lookbehind/lookahead working.
– stevesliva
Commented Nov 10, 2023 at 17:31
@stevesliva regarding "It is exactly what should be used to extract complex substrings" - if you use grep for a task like this where you need to output multiple substrings then you need to call it multiple times, and if you needed to use the non-portable GNU grep -P for PCREs then you may as well just use perl as that's arguably more likely to exist on any given system than GNU grep and then you have PCREs and don't need to call the command multiple times. So far I personally haven't actually come across a use for grep -P as you can do whatever you need with sed, awk, bash or perl.
– Ed Morton
Commented Nov 13, 2023 at 13:00
@EdMorton perl -lne 's/regex/print $&/e' vs grep -Po 'regex'. I know it's possible but it's more inscrutable. (Or, perl -lne 'print $1 if /regex/' is prob more awkish and less sedish)
– stevesliva
Commented Nov 13, 2023 at 14:19
1

@stevesliva but grep -Po regex isn't adequate as the OP needs to produce 2 matching strings so they need perl -lne 's/(regex1).*(regex2)/print $1\n$2/' or whatever the perl syntax is to print 2 capture groups. As shown in the above answer you'd need to call grep -Po twice on the same string to get 2 capture groups output, which is less than ideal and not necessary with other tools.
– Ed Morton
Commented Nov 13, 2023 at 14:22

Add a comment |

potong · Accepted Answer · 2023-11-11 15:35:18Z

2

This might work for you (GNU sed):

sed -E 's/^([^-]+-)+([0-9.]+).*\..*/\2/' file

Match filenames that have one or more words separated by -'s, followed by digits separated by .'s and then end an extension preceded by . and return the digits separated by .'s.

answered Nov 11, 2023 at 15:35

potong

59.2k6 gold badges54 silver badges92 bronze badges

Add a comment |

pjh · Accepted Answer · 2023-11-11 19:55:11Z

It may be possible to to do what you need just with Bash's built-in Pattern Matching. This Shellcheck-clean code demonstrates the idea:

#! /bin/bash -p

shopt -s extglob

files=( example-substring-1.1.0.tgz example-substring2-1.1.0-branch-name.tgz )

declare -A example_array

for f in "${files[@]}"; do
    base=${f%.*}    # remove suffix
    substring=${base%%-+([0-9]).*}
    example_array["$substring"]=${base#"$substring-"}
done

declare -p example_array

This outputs:

declare -A example_array=([example-substring2]="1.1.0-branch-name" [example-substring]="1.1.0" )

shopt -s extglob enables "extended globbing" (including patterns like +([0-9])). See the extglob section in glob - Greg's Wiki.
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of ${f%.*}, ${base%%-+([0-9]).*}, and ${base#"$substring-"}.
In general, declare -p var prints the value of a variable in an unambiguous way. It avoids looping, and pitfalls, when printing the values of, both kinds of, arrays.

Collectives™ on Stack Overflow

Trying to extract a substring and version number from a filename using bash

5 Answers 5

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Related