Revisions to Awk prints only first word on the field column

added 444 characters in body

Source Link

edited Aug 4 at 18:56

584.4k
96
1.1k
1.7k

#! /bin/bash -
shopt -s extglob # for +(...) ksh-style glob operator
trim() {
  typeset -n _var
  for _var do
    _var=${_var##+([[:space:]])}
    _var=${_var%%+([[:space:]])}
  done
} 

{
  IFS= read -ru3 header_discarded
  while IFS='|' read -ru3 rev svn_path file download_options rest_if_any_ignored;rest_if_any_discarded; do
    trim rev svn_path file download_options
    # do what you need with those variables
    typeset -p rev svn_path file download_options
  done
} 3< input_file.txt

declare -- rev="1336"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test2.txt"
declare -- download_options="PROGRAM APPLICATION_SHORT_NAME=\"SQLGL\""
declare -- rev="1334"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test.txt"
declare -- download_options="REQUEST_GROUP REQUEST_GROUP_NAME=\"Program Request Group\" APPLICATION_SHORT_NAME=\"SQLGL\""

That input could also be seen as simple CSV with | as field separator, so you could preprocess it with something like:

mlr --csvlite --fs '|' --ho --ragged clean-whitespace then \
  cut -of 'REV NUM,FILE NAME,DOWNLOAD OPTIONS'

Which would take care of extracting the fields you want, however they're positioned in the input and do the whitespace trimming (and then pipe to IFS='|' read -r rev file download_options)

#! /bin/bash -
shopt -s extglob # for +(...) ksh-style glob operator
trim() {
  typeset -n _var
  for _var do
    _var=${_var##+([[:space:]])}
    _var=${_var%%+([[:space:]])}
  done
}
{
  IFS= read -ru3 header_discarded
  while IFS='|' read -ru3 rev svn_path file download_options rest_if_any_ignored; do
    trim rev svn_path file download_options
    # do what you need with those variables
    typeset -p rev svn_path file download_options
  done
} 3< input_file.txt

declare -- rev="1336"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test2.txt"
declare -- download_options="PROGRAM APPLICATION_SHORT_NAME=\"SQLGL\""
declare -- rev="1334"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test.txt"
declare -- download_options="REQUEST_GROUP REQUEST_GROUP_NAME=\"Program Request Group\" APPLICATION_SHORT_NAME=\"SQLGL\""

#! /bin/bash -
shopt -s extglob # for +(...) ksh-style glob operator
trim() {
  typeset -n _var
  for _var do
    _var=${_var##+([[:space:]])}
    _var=${_var%%+([[:space:]])}
  done
} 

{
  IFS= read -ru3 header_discarded
  while IFS='|' read -ru3 rev svn_path file download_options rest_if_any_discarded; do
    trim rev svn_path file download_options
    # do what you need with those variables
    typeset -p rev svn_path file download_options
  done
} 3< input_file.txt

declare -- rev="1336"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test2.txt"
declare -- download_options="PROGRAM APPLICATION_SHORT_NAME=\"SQLGL\""
declare -- rev="1334"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test.txt"
declare -- download_options="REQUEST_GROUP REQUEST_GROUP_NAME=\"Program Request Group\" APPLICATION_SHORT_NAME=\"SQLGL\""

That input could also be seen as simple CSV with | as field separator, so you could preprocess it with something like:

mlr --csvlite --fs '|' --ho --ragged clean-whitespace then \
  cut -of 'REV NUM,FILE NAME,DOWNLOAD OPTIONS'

Which would take care of extracting the fields you want, however they're positioned in the input and do the whitespace trimming (and then pipe to IFS='|' read -r rev file download_options)

added 865 characters in body

Source Link

edited Aug 4 at 7:26

Stéphane Chazelas

584.4k
96
1.1k
1.7k

shopt -s lastpipe
awk -F'[[:space:]]*[|][[:space:]]*' '{print $1}' |
  readarray -s1 lastpiperev

(-s1 skips the first line (same as using NR>1 in awk); we include whitespacewhitespace¹ around the | in the field separator, though that would not still trim the leading ones of the first field or the trailing ones of the last field if any).

^{¹ beware mawk doesn't support POSIX character classes, so on systems that still use that awk implementation, you'd need to replace [[:space:]] with an explicit list of whitespace characters to trim such as [ \t\r\v\f]; mawk also doesn't support multibyte characters, so you can't include non-ASCII whitespace characters in locales using UTF-8 either.}

shopt -s lastpipe
awk -F'[[:space:]]*[|][[:space:]]*' '{print $1}' |
  readarray -s1 lastpipe

(-s1 skips the first line (same as using NR>1 in awk); we include whitespace around the | in the field separator, though that would not still trim the leading ones of the first field or the trailing ones of the last field if any).

shopt -s lastpipe
awk -F'[[:space:]]*[|][[:space:]]*' '{print $1}' |
  readarray -s1 rev

(-s1 skips the first line (same as using NR>1 in awk); we include whitespace¹ around the | in the field separator, though that would not still trim the leading ones of the first field or the trailing ones of the last field if any).

^{¹ beware mawk doesn't support POSIX character classes, so on systems that still use that awk implementation, you'd need to replace [[:space:]] with an explicit list of whitespace characters to trim such as [ \t\r\v\f]; mawk also doesn't support multibyte characters, so you can't include non-ASCII whitespace characters in locales using UTF-8 either.}

added 865 characters in body

Source Link

edited Aug 4 at 7:18

Stéphane Chazelas

584.4k
96
1.1k
1.7k

If you do need a shell loop to process that, you could use read's IFS-spliting to split on |s instead of using awk:

#! /bin/bash -
shopt -s extglob # for +(...) ksh-style glob operator
trim() {
  typeset -n _var
  for _var do
    _var=${_var##+([[:space:]])}
    _var=${_var%%+([[:space:]])}
  done
}
{
  IFS= read -ru3 header_discarded
  while IFS='|' read -ru3 rev svn_path file download_options rest_if_any_ignored; do
    trim rev svn_path file download_options
    # do what you need with those variables
    typeset -p rev svn_path file download_options
  done
} 3< input_file.txt

Here, on your sample, that gives:

declare -- rev="1336"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test2.txt"
declare -- download_options="PROGRAM APPLICATION_SHORT_NAME=\"SQLGL\""
declare -- rev="1334"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test.txt"
declare -- download_options="REQUEST_GROUP REQUEST_GROUP_NAME=\"Program Request Group\" APPLICATION_SHORT_NAME=\"SQLGL\""

As to why you're only getting the first word of each column, in:

REV_NUM=($(awk -F "|" 'NR>1 {print $1}' input_file.txt))

That unquoted $(...) used in list context (here the assignment to an array variable) is invoking the split+glob operator. The glob part you don't want so should be disabled (with set -o noglob), and the splitting part is done based on the list of characters in the $IFS variable.

By default, that contains space, tab and newline, but here you want to split on newline only.

While you could do IFS=$'\n', that would still not work if there were empty lines in the awk output as those would be discarded.

To store all the lines in an array, you'd use bash's readarray builtin:

readarray -s1 rev < <(
  awk -F'[[:space:]]*[|][[:space:]]*' '{print $1}'
)

Or:

shopt -s lastpipe
awk -F'[[:space:]]*[|][[:space:]]*' '{print $1}' |
  readarray -s1 lastpipe

(-s1 skips the first line (same as using NR>1 in awk); we include whitespace around the | in the field separator, though that would not still trim the leading ones of the first field or the trailing ones of the last field if any).

If you do need a shell loop to process that, you could use read's IFS-spliting to split on |s instead of using awk:

#! /bin/bash -
shopt -s extglob # for +(...) ksh-style glob operator
trim() {
  typeset -n _var
  for _var do
    _var=${_var##+([[:space:]])}
    _var=${_var%%+([[:space:]])}
  done
}
{
  IFS= read -ru3 header_discarded
  while IFS='|' read -ru3 rev svn_path file download_options rest_if_any_ignored; do
    trim rev svn_path file download_options
    # do what you need with those variables
    typeset -p rev svn_path file download_options
  done
} 3< input_file.txt

Here, on your sample, that gives:

declare -- rev="1336"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test2.txt"
declare -- download_options="PROGRAM APPLICATION_SHORT_NAME=\"SQLGL\""
declare -- rev="1334"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test.txt"
declare -- download_options="REQUEST_GROUP REQUEST_GROUP_NAME=\"Program Request Group\" APPLICATION_SHORT_NAME=\"SQLGL\""

If you do need a shell loop to process that, you could use read's IFS-spliting to split on |s instead of using awk:

#! /bin/bash -
shopt -s extglob # for +(...) ksh-style glob operator
trim() {
  typeset -n _var
  for _var do
    _var=${_var##+([[:space:]])}
    _var=${_var%%+([[:space:]])}
  done
}
{
  IFS= read -ru3 header_discarded
  while IFS='|' read -ru3 rev svn_path file download_options rest_if_any_ignored; do
    trim rev svn_path file download_options
    # do what you need with those variables
    typeset -p rev svn_path file download_options
  done
} 3< input_file.txt

Here, on your sample, that gives:

declare -- rev="1336"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test2.txt"
declare -- download_options="PROGRAM APPLICATION_SHORT_NAME=\"SQLGL\""
declare -- rev="1334"
declare -- svn_path="svn/Repo/PROD"
declare -- file="test.txt"
declare -- download_options="REQUEST_GROUP REQUEST_GROUP_NAME=\"Program Request Group\" APPLICATION_SHORT_NAME=\"SQLGL\""

As to why you're only getting the first word of each column, in:

REV_NUM=($(awk -F "|" 'NR>1 {print $1}' input_file.txt))

That unquoted $(...) used in list context (here the assignment to an array variable) is invoking the split+glob operator. The glob part you don't want so should be disabled (with set -o noglob), and the splitting part is done based on the list of characters in the $IFS variable.

By default, that contains space, tab and newline, but here you want to split on newline only.

While you could do IFS=$'\n', that would still not work if there were empty lines in the awk output as those would be discarded.

To store all the lines in an array, you'd use bash's readarray builtin:

readarray -s1 rev < <(
  awk -F'[[:space:]]*[|][[:space:]]*' '{print $1}'
)

Or:

shopt -s lastpipe
awk -F'[[:space:]]*[|][[:space:]]*' '{print $1}' |
  readarray -s1 lastpipe

(-s1 skips the first line (same as using NR>1 in awk); we include whitespace around the | in the field separator, though that would not still trim the leading ones of the first field or the trailing ones of the last field if any).

added 37 characters in body

Source Link

edited Aug 4 at 6:55

Stéphane Chazelas

584.4k
96
1.1k
1.7k

Loading

Source Link

answered Aug 4 at 6:49

Stéphane Chazelas

584.4k
96
1.1k
1.7k

Loading

Stack Exchange Network

Return to Answer