Zero/Nul separator breaks column command

Question

Problem

I want to parse some data structured as lines (\n separated) with fields separated by the NUL character \0.

Many linux commands handle this separator with options such as --zero for find, or -0 for xargs or by defining the separator as \0 for gawk.

I didn't manage to understand how to make column interpret NUL as separator.

Example

If you generate the following set of data (2 lines with 3 columns, separated by \0):

echo -e "line1\nline2" | awk 'BEGIN {OFS="\0"} {print $1"columnA",$1"columnB",$1"columnC"}'

You would get the expected following output (\0 separators won't be displayed but is separating each field):

line1columnAline1columnBline1columnC
line2columnAline2columnBline2columnC

But when I try to use column to display my column, despite passing \0, the output for some reason only display the first column:

echo -e "line1\nline2" \ | awk 'BEGIN {FS="\0"; OFS="\0"} {print $1"columnA",$1"columnB",$1"columnC"}' | column -s '\0'

line1columnA    line2columnA

Actually, even without providing the delimiter, column seems to break on the nul character:

echo -e "line1\nline2" \ | awk 'BEGIN {FS="\0"; OFS="\0"} {print $1"columnA",$1"columnB",$1"columnC"}' | column

line1columnA    line2columnA

Question

Is there a way to use \0 as a field/column separator in column ?
Optional/ bonus question: Why does column behaves like this (I would expect the \0 to be totally ignored if not managed and the whole line to be printed as a single field) ?
Optional/ bonus question 2: Some data in these columns will be file paths and I wanted to use \0 as a best practice. Do you a have better practice to recommand for storing "random strings" in file without having to escape potential conflictual field separator character they may contain?

Maybe do what CSV does and use double-quotes to surround fields containing the column separator and allow escaping of double-quotes inside double-quotes with either "" or \". In fact, why not use CSV? It's easy to work with, and most languages have decent libraries for parsing and outputting properly-formed CSV. — cas
– cas, Commented Apr 11, 2021 at 3:35
Leave FS="\0" as is, and change OFS=" " (white space.) Then change "column -s '\0' to "column -t". — Cinaed Simson
– Cinaed Simson, Commented Apr 11, 2021 at 4:44
@cas, it's a good idea: do you have any recommandation of csv tool for linux? it's for a shell script and ideally the more standard tool already installed would have the favor over better but obscure third library. — Pierre-Jean
– Pierre-Jean, Commented Apr 11, 2021 at 7:16
@CinaedSimson: in this example it should work, but rebut if one of your column name gets a space in it, you loose the benefit of nul byte separator and your end result would break — Pierre-Jean
– Pierre-Jean, Commented Apr 11, 2021 at 7:19

LL3 · Accepted Answer · 2021-04-11 17:53:12Z

Is there a way to use \0 as a field/column separator in column ?

No, because both implementations of column (that I am aware of), which are the historical BSD and the one in the util-linux package, both use the standard C library's string manipulation functions to parse input lines, and those functions work under the assumption that strings are NUL-terminated. In other words, a NUL byte is meant to always mark the end of any string.

Optional/ bonus question: Why does column behaves like this (I would expect the \0 to be totally ignored if not managed and the whole line to be printed as a single field) ?

On top of what I explained above, note that option -s expects literal characters. It does not parse an escape syntax like \0 (nor \n for that matters). This means that you told column to consider either a \ and a 0 as valid separators for its input.

You can provide escape sequences through the $'' string syntax if you are using one of the many shells that support it (e.g. it is available in bash but not in dash). So for instance column -s $'\n' would be valid (to specify a <newline> as column separator) if run by one of those shells.

As a side-note, it's not clear to me what you'd expect from column. Even if it did support NUL as separator, it would just turn each line of that input into a whole column on output. Perhaps you'd wanted to also use -t so as to columnize the single fields for each line?

Optional/ bonus question 2: Some data in these columns will be file paths and I wanted to use \0 as a best practice. Do you a have better practice to recommand for storing "random strings" in file without having to escape potential conflictual field separator character they may contain?

The only one I know of is by prefixing each single field with its length, expressed as text or binary as you see fit. But then surely you could not pipe them into column.

Also, if your concern is file paths then you should consider not using the \n either as a "structure" separator, because that is a perfectly valid character for filenames.

Just as a proof-of-concept, based on your example but using NUL as structure/record separator and length-specified fields: (I also fiddled a bit with your example strings to involve multibyte characters)

echo -e 'line1\nline2 ò' \ | LC_ALL=C awk '
    BEGIN {
        ORS="\0"
# here we just move arguments away from ARGV
# so that awk reads input from stdin
        for (i in ARGV) {
            c[i]=ARGV[i]
            delete ARGV[i]
        }
    }
    {
# first field is the line read
        printf "%4.4d%s", length, $0
# then a field for each argument
        for(i=1; i<length(c); i++)
            printf "%4.4d%s", length(c[i]), c[i]
        printf "%s", ORS
    }
' "€ column A" $'colu\nmnB' "column C"

Use arguments to awk to pass as many arbitrary column strings as you wish.

Then, a hypothetical counterpart script in awk (actually has to be gawk or mawk to handle RS="\0"):

LC_ALL=C awk '
    BEGIN { RS="\0" }
    {
        nf=0; while(length) {
            field_length = substr($0, 1, 4)
            printf "field %d: \"%s\""ORS, ++nf, substr($0, 5, field_length)
            $0 = substr($0, 5+field_length)
        }
        printf "%s", ORS
    }
'

Note that it is important to specify the same locale for both scripts to match the character size. Specifying LC_ALL=C for both is fine.

Pourko · Accepted Answer · 2021-04-11 11:46:42Z

-2

Your columns didn't even reach your awk command. Everything past the first zero was lost even before the echo command. You can't store a binary zero in a variable.

var=$'zzz\x00zzz'
echo "${#var}"
3
var=$'zzz\xFFzzz'
echo "${#var}"
7

You could use tr to change all the zeros to any other delimiter of your choice, before you even begin doing what you plan on doing.

Or you could change your shell to zsh.

edited Apr 11, 2021 at 11:46

answered Apr 10, 2021 at 23:02

Pourko

1,9449 silver badges29 bronze badges

1

zsh can store null bytes in its variables, so one solution is to change the shell

phuclv
– phuclv

2021-04-11 00:34:02 +00:00
Commented Apr 11, 2021 at 0:34
1

Hmm... I don't see where shell variables are involved here (just pipes); isn't what you're demonstrating here the limited escape handling of echo? In bash for example printf 'zzz\x00zzz' | hexdump -C appears to preserve the null byte

steeldriver
– steeldriver

2021-04-11 00:48:48 +00:00
Commented Apr 11, 2021 at 0:48
2

@Pourko OK point conceded - but the OP is not using $'string' to insert the null character, they're using awk. Are you suggesting that FS="\0"(for example, awk 'BEGIN{OFS="\0"; print "foo","bar"}' | hexdump -C) is not outputing null-separated fields?

steeldriver
– steeldriver

2021-04-11 01:14:15 +00:00
Commented Apr 11, 2021 at 1:14
2

The only escape sequence OP uses in echo is \n which does work manyhow, though not everyhow, especially with -e which OP also uses. In GNU awk a variable containing \0 works, but not FreeBSD awk (at least not my rather old one).

dave_thompson_085
– dave_thompson_085

2021-04-11 03:02:28 +00:00
Commented Apr 11, 2021 at 3:02
1

@Pierre-Jean "displays the first column correctly"... When passing things around, the first zero signified an end to the string, and everything after it got lost. Just pick another delimiter, anything but zero, and save yourself the headaches.

Pourko
– Pourko

2021-04-11 07:17:13 +00:00
Commented Apr 11, 2021 at 7:17

| Show 7 more comments

Stack Exchange Network

Zero/Nul separator breaks column command

Problem

Example

Question

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Zero/Nul separator breaks column command

Problem

Example

Question

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions