2

Please see my linux bash script below. I can't achieve my target output with my current code. It keeps reading the whole column 4.

input_file.txt:

REV NUM |SVN PATH         | FILE NAME     |DOWNLOAD OPTIONS
1336    |svn/Repo/PROD    | test2.txt     |PROGRAM APPLICATION_SHORT_NAME="SQLGL" |
1334    |svn/Repo/PROD    | test.txt      |REQUEST_GROUP REQUEST_GROUP_NAME="Program Request Group" APPLICATION_SHORT_NAME="SQLGL" |

my code:

# /bin/bash
REV_NUM=($(awk -F "|" 'NR>1 {print $1}' input_file.txt))
COMPONENT=($(awk -F "|" 'NR>1 {print $3}' input_file.txt))
DL_OPS="$(awk -F "|" 'NR>1 {print $4}' input_file.txt)"

#LOOP
REV_NUM_COUNT=${#REV_NUM[*]}


for (( x=0 ; x<$REV_NUM_COUNT ; x++ ))
do
         echo "${COMPONENT[x]}  ${DL_OPS[x]}"
done

actual output:

Exporting Component from SVN . . .
test2.txt  PROGRAM APPLICATION_SHORT_NAME="SQLGL"
REQUEST_GROUP REQUEST_GROUP_NAME="Program Request Group" APPLICATION_SHORT_NAME="SQLGL"
test.txt

target output:

Exporting Component from SVN . . .  
test2.txt PROGRAM APPLICATION_SHORT_NAME="SQLGL"
test.txt REQUEST_GROUP REQUEST_GROUP_NAME="Program Request Group" APPLICATION_SHORT_NAME="SQLGL"

Thank you so much

7
  • Your script uses the array variable REV_NUM but it has not been defined. Commented Aug 4 at 4:02
  • 3
    DL_OPS is used as an array, but it’s not an array. Commented Aug 4 at 4:04
  • @HaukeLaging - I declared variable REV_NUM as REV_NUM=($(awk -F "|" 'NR>1 {print $1}' input_file.txt)) And I got the output as mentioned above Commented Aug 4 at 5:25
  • @Kusalananda - I tried to declare DL_OPS as DL_OPS=($(awk -F\| 'NR>1 {print $4}' input_file.txt)), and the output is below: Exporting Component from SVN . . . 1336 PROGRAM 1334 APPLICATION_SHORT_NAME="SQLGL" The characters after the space is not read. Commented Aug 4 at 5:28
  • Can any of your quoted strings contain |, e.g. could you have an input line like 1336 |svn/Repo/PROD | test2.txt |whatever="foo|bar" |? File names can contain | (and newlines!) so could you have a file name like this|that.txt so you get an input line like 1336 |svn/Repo/PROD | this|that.txt |whatever="foo|bar" |? Commented Aug 4 at 16:57

6 Answers 6

8

Read the data as pipe-delimited input instead:

while IFS='|' read -r revnum svnpath filename opts junk
do
        printf '%s: %s\n' "$filename" "$opts"
done < <(tail -n +2 file)

This would obviously retain all the extra spaces flanking most values, and would only work if the input fields do not contain embedded pipes.

Assuming all pipes are field delimiters, we can trim off the spaces by passing the read data from tail to sed -e 's/ *| */|/g' before the bash loop sees it.

In any case, we don't need to read all the data before starting to process it, as it would slow down the code and require a lot of memory and I/O if the data file is huge. Note that the code in the question reads all input data three times before starting the loop. (On the other hand, if the data was huge, we would not use a shell script to parse it, but a program written in Python, Perl, or similar scripting language.)

Output of the code above, with sed inserted to clean up the spaces:

test2.txt: PROGRAM APPLICATION_SHORT_NAME="SQLGL"
test.txt: REQUEST_GROUP REQUEST_GROUP_NAME="Program Request Group" APPLICATION_SHORT_NAME="SQLGL"
6

If you do need a shell loop to process that, you could use read's IFS-spliting to split on |s instead of using awk:

#! /bin/bash -
shopt -s extglob # for +(...) ksh-style glob operator
trim() {
  typeset -n _var
  for _var do
    _var=${_var##+([[:space:]])}
    _var=${_var%%+([[:space:]])}
  done
}

{
  IFS= read -ru3 header_discarded
  while IFS='|' read -ru3 rev svn_path file download_options rest_if_any_discarded; do
    trim rev svn_path file download_options
    # do what you need with those variables
    typeset -p rev svn_path file download_options
  done
} 3< input_file.txt

Here, on your sample, that gives:

declare -- rev="1336"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test2.txt"
declare -- download_options="PROGRAM APPLICATION_SHORT_NAME=\"SQLGL\""
declare -- rev="1334"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test.txt"
declare -- download_options="REQUEST_GROUP REQUEST_GROUP_NAME=\"Program Request Group\" APPLICATION_SHORT_NAME=\"SQLGL\""

That input could also be seen as simple CSV with | as field separator, so you could preprocess it with something like:

mlr --csvlite --fs '|' --ho --ragged clean-whitespace then \
  cut -of 'REV NUM,FILE NAME,DOWNLOAD OPTIONS'

Which would take care of extracting the fields you want, however they're positioned in the input and do the whitespace trimming (and then pipe to IFS='|' read -r rev file download_options)


As to why you're only getting the first word of each column, in:

REV_NUM=($(awk -F "|" 'NR>1 {print $1}' input_file.txt))

That unquoted $(...) used in list context (here the assignment to an array variable) is invoking the split+glob operator. The glob part you don't want so should be disabled (with set -o noglob), and the splitting part is done based on the list of characters in the $IFS variable.

By default, that contains space, tab and newline, but here you want to split on newline only.

While you could do IFS=$'\n', that would still not work if there were empty lines in the awk output as those would be discarded.

To store all the lines in an array, you'd use bash's readarray builtin:

readarray -s1 rev < <(
  awk -F'[[:space:]]*[|][[:space:]]*' '{print $1}'
)

Or:

shopt -s lastpipe
awk -F'[[:space:]]*[|][[:space:]]*' '{print $1}' |
  readarray -s1 rev

(-s1 skips the first line (same as using NR>1 in awk); we include whitespace¹ around the | in the field separator, though that would not still trim the leading ones of the first field or the trailing ones of the last field if any).


¹ beware mawk doesn't support POSIX character classes, so on systems that still use that awk implementation, you'd need to replace [[:space:]] with an explicit list of whitespace characters to trim such as [ \t\r\v\f]; mawk also doesn't support multibyte characters, so you can't include non-ASCII whitespace characters in locales using UTF-8 either.

4

It may be overkill for the specific example you provided but I often find a good way to approach selecting input text to then pass to a shell command is to use awk to extract the text and then shell to call the external command, e.g.:

sep='|'
while IFS="$sep" read -r rev opts; do
    echo "$rev -> $opts"
done < <(
    awk -F"[[:space:]]*[$sep][[:space:]]*" -v OFS="$sep" 'NR>1{print $1, $4}' input_file.txt
)
1336 -> PROGRAM APPLICATION_SHORT_NAME="SQLGL"
1334 -> REQUEST_GROUP REQUEST_GROUP_NAME="Program Request Group" APPLICATION_SHORT_NAME="SQLGL"

That lets awk do what it does best, i.e. manipulate text, and shell do what it does best, i.e. sequence calls to tools.

The above assumes no |s or newlines within any of the fields.

3

DL_OPS is a single value, not an array. When you use array indexing with a non-array variable, index 0 is the value, other indexes are empty.

If you just wrap () around the call to awk, it will split the output at the space characters, so each word will be a separate array element. You need to split it just at newlines, which you can do by setting IFS.

You should also disable filename wildcard expansion, in case the file contains wildcards that happen to match filenames.

set -o noglob
IFS=$'\n' DL_OPS=($(awk -F "|" 'NR>1 {print $4}' input_file.txt))
4
  • 1
    As mentioned in my answer, it doesn't split at any whitespace, it splits on characters of $IFS. By default, in bash, $IFS contains space, tab and newline which are some (of many) whitespace characters. And $(...) in list context in bash is subject to globbing which should also be disabled here with set -o noglob. Commented Aug 4 at 18:59
  • The fact that it splits on other whitespace characters is irrelevant here. The file contains spaces, and they're the problem. Setting IFS to just newline solves that. Commented Aug 4 at 19:19
  • But I've reworded to just say that it will also split on the spaces. Commented Aug 4 at 19:21
  • this worked for me. appreciate your help, thank you so much! Commented Aug 5 at 2:00
3

Maybe you can simplify the entire script:

awk -F\| ' BEGIN { print "Exporting Component from SVN . . .  "} NR>1 {print $3, $4}' input_file.txt 
5
  • Hi, By using the script you mentioned, I got the target output. But I really need this is for loop since I will be using the input files to get the values for a command. Thank you so much for your input Commented Aug 4 at 5:44
  • @user765641, please provide the command you want to exec. Based on the above script there is no problem to run command with parameters of result columns Commented Aug 4 at 5:54
  • This is the entire script: REV_NUM=($(awk -F "|" 'NR>1 {print $1}' input_file.txt)) REV_PATH=($(awk -F "|" 'NR>1 {print $2}' input_file.txt)) COMPONENT=($(awk -F "|" 'NR>1 {print $3}' input_file.txt)) DL_OPS=($(awk -F\| 'NR>1 {print $4}' input_file.txt)) REV_NUM_COUNT=${#REV_NUM[*]} for (( x=0 ; x<$REV_NUM_COUNT ; x++ )) do echo "${REV_NUM[x]} ${COMPONENT[x]} ${DL_OPS[x]}" done - Output: Exporting Component from SVN . . . 1336 test2.txt PROGRAM 1334 test.txt APPLICATION_SHORT_NAME="SQLGL" - $4 did not include the chars after the space Commented Aug 4 at 6:03
  • 1
    @user765641, please add the code in question and format it. Commented Aug 4 at 7:11
  • 1
    @user765641 You can invoke commands directly from within an Awk in several ways: system () to run a command; pipe data to a specified command; pipe a stream of commands to a shell. You can even send them off to a command that will run your multiple commands in parallel. Commented Aug 4 at 9:41
3
#!/bin/bash
INFILE=input.txt

while IFS='' read -r LINE || [ -n "${LINE}" ]; do
    awk -F\| '{print $3, $4}' <<< "${LINE}"
done < <(tail -n +2 $INFILE)

Output:

 test2.txt      PROGRAM APPLICATION_SHORT_NAME="SQLGL" 
 test.txt       REQUEST_GROUP REQUEST_GROUP_NAME="Program Request Group" APPLICATION_SHORT_NAME="SQLGL" 

If you want to keep the variables:

#!/bin/bash
INFILE=input.txt

while IFS='' read -r LINE || [ -n "${LINE}" ]; do
    COMPONENT="$(awk -F\| '{print $3}' <<< ${LINE})"
    DL_OPS="$(awk -F\| '{print $4}' <<< ${LINE})"
    echo "$COMPONENT $DL_OPS"
done < <(tail -n +2 $INFILE)

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.