Skip to content

Binary data type #309

@01mf02

Description

@01mf02

I am currently thinking about how to integrate binary data (a vector of bytes) into jaq.

I researched a bit how this is done in fq.
If you have any suggestions or comments about what I write below, please comment!
A minimal prototype is implemented in #305.

Are bytes the same as strings?

In fq, bytes can be transparently used like strings for many operations, such as:

$ fq -n '([] | tobytes) == ""'
true

This suggests that empty bytes and empty strings are the same.
However, they do not behave the same way:

$ fq -n '""[]'
error: cannot iterate over: string ("")
$ fq -n '([] | tobytes)[]'
$ # no output for the previous command

This shows that bytes have different behaviour than strings, so treating them the same way is potentially confusing.
Users also cannot easily see with built-in jq methods whether some value is binary or string:

$ fq -n '([] | tobytes, "") | type'
"string"
"string"
$ fq -n '([] | tobytes, "") | _exttype'
"binary"
"string"

For that reason, I am leaning towards treating bytes and strings as separate kinds of values.

Proposed semantics

Order

I propose that for any bytes b and string s, s < b, and b < []. That is, bytes sit between strings and arrays. The reasoning behind bytes being larger than strings is that strings are also internally encoded as Vec<u8>, but with additional constraints. In jq, values are typically ordered from more constrained types to less constrained types, so because bytes are less constrained than strings, but more constrained than arrays (which are Vec<Val>, which is more general than Vec<u8>), they fit inbetween.
In fq, on the other hand, bytes and strings are compared to each other, probably by comparing their underlying data.

Type

I propose that for any bytes b, (b | type) == "bytes". In fq, _exttype returns "binary" instead of "bytes", but I think that "bytes" is more fitting, because the corresponding function is tobytes (in fq), and the similarly named functions tonumber and toboolean all correspond to the outputs returned by type for these values.
However, YAML calls bytes "binary" too, so I'm not sure about this yet.

Binary data in YAML is called !!binary, and in XML schema, it is called base64Binary, so there seems to be more consensus to make (b | type) == "binary".

Arithmetic operations

The addition of two bytes should be their concatenation, like for arrays and strings. Adding strings to bytes, however, should probably fail, instead of converting implicitly the bytes to a string or the string to bytes. This is to protect the user from hard-to-spot conversions and to make it clearer in user code what is happening. jq also does not do implicit conversions between different types when adding them; for example, adding an array to a string always fails, even if the array could be converted to a string via implode.
It's also not clear in which direction the conversion would implicitly work: For example, I assumed that adding a string to bytes would return bytes, but it's actually a string:

$ fq -n '([] | tobytes) + "" | _exttype'
"string"

Indexing, slicing, iteration

Let's see how fq does it:

$ fq -n '([3, 7, 42] | tobytes)[0]'
3
$ fq -n '[([3, 7, 42] | tobytes)[:-1]]'
[
  "\u0003\u0007"
]
$ fq -n '([3, 7, 42] | tobytes)[]'
$ # no output for previous command

I agree with fq's interpretation of indexing and slicing, which is like for arrays. However, I find it strange that iteration yields nothing. In jq, every type that can be indexed (array, object) can also be iterated over. So I propose that b | .[] should return the elements of b, i.e. its bytes. As invariant, it should hold true that for any binary input, ([.[]] | tobytes) == .
On a second thought, it's probably quite confusing to have .[] return the individual bytes, in particular because these show up when using ... For that reason, .[] should probably yield an error, like it does for strings.
To get the individual elements of a byte array, it is probably the easiest to do it like in fq, with explode:

$ fq -cn '([3, 7, 42] | tobytes) | explode'
[3,7,42]

I'm still not sure about allowing .[i] for binary data: fq allows it, but on the other hand, it does not support keys:

$ fq -cn '([3, 7, 42] | tobytes) | keys'
error: keys cannot be applied to: string

In my mind, jaq should allow either both of them (.[i] and keys), or none.
Disallowing .[i] would have the advantage of making binary data operations more consistent with string operations.

Display

fq prints bytes like an ASCII string:

$ fq -n '([3, 7, 42, 255] | tobytes) | tostring'
"\u0003\u0007*\ufffd"

For JSON compatibility, that seems to be a good choice. For other formats like YAML that have a dedicated binary data type, binary data should be encoded as such, instead of as a string like for JSON.
For JSON output, to distinguish binary data from strings, we could print the binary data with a different color. That's not perfect, but better than nothing.

fq provides the functions to_utf8 to encode a string to UTF-8 binary data (string -> binary) and from_utf8 to decode UTF-8 binary data to a string (binary -> string).

Data representation

At first, I was wondering whether we could not just reuse strings to store binary data. The reasons against that are that this would significantly increase the amount of data in memory used to represent binary data. Furthermore, it would make printing of dedicated binary data in formats with special support for it (YAML) hard or impossible. Finally, having a dedicated binary type would make it possible to represent memory-mapped binary data directly as jaq value, thus enabling processing of binary data that does not fit into RAM possible. This could work without enhancing the Val type with a lifetime, by using bytes::Bytes, for example.

Functions

Like in fq, frombytes should convert IO lists to bytes. An IO list is either an array of IO lists, a string, or a number in the range 0--255.

We should add/modify the following filters:

def isnumber: true < . and . < ([] | tobytes);
def isbinary: ([] | tobytes) <= . and [] | tobytes);

def binaries:    select(isbinary);

def iterables: select(. >= [] | tobytes);
def scalars:   select(. <  [] | tobytes);

By the way, the binaries filter is actually a good point for naming the type of binary data "binary" instead of "bytes", because otherwise, the filter would need to be named bytes instead of binaries, and that might be a bit confusing, also because bytes is plural and all other types of values have singular names.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions