Revisions to How to process JSON with strings containing invalid UTF-8

added 5 characters in body

Source Link

edited Oct 4, 2024 at 5:41

584.6k
96
1.1k
1.7k

AnAnd then use things like:

$ logger $'St\xe9phane'
$ journalctl --since today -o json | perl -MJSON -MData::Dumper -lne '
   BEGIN{$j = JSON->new}
   $j->incr_parse($_);
   while ($obj = $j->incr_parse) {
     $msg = $obj->{MESSAGE};
     # handle array of integer representation
     $msg = join "", map(chr, @$msg) if ref $msg eq "ARRAY";
     print $msg
   }' |
   sed -n '/phane/l'
St\351phane$

An then use things like:

$ logger $'St\xe9phane'
$ journalctl --since today -o json | perl -MJSON -MData::Dumper -lne '
   BEGIN{$j = JSON->new}
   $j->incr_parse($_);
   while ($obj = $j->incr_parse) {
     $msg = $obj->{MESSAGE};
     # handle array of integer representation
     $msg = join "", map(chr, @$msg) if ref $msg eq "ARRAY";
     print $msg
   }' |
   sed -n '/phane/l'
St\351phane$

And then use things like:

$ logger $'St\xe9phane'
$ journalctl --since today -o json | perl -MJSON -lne '
   BEGIN{$j = JSON->new}
   $j->incr_parse($_);
   while ($obj = $j->incr_parse) {
     $msg = $obj->{MESSAGE};
     # handle array of integer representation
     $msg = join "", map(chr, @$msg) if ref $msg eq "ARRAY";
     print $msg
   }' |
   sed -n '/phane/l'
St\351phane$

added 5 characters in body

Source Link

edited Oct 4, 2024 at 5:35

Stéphane Chazelas

584.6k
96
1.1k
1.7k

A possible (not fully satisfactory) approach if one doesn't need to consider any of the strings in the JSON as text is to pre-process the input to the JSON-processing tool (jq, mlr...) with iconv -f latin1 -t utf-8 and post-process its output with iconv -f utf-8 -t latinlatin1, that is convert all bytes >= 0x80 to the character with the corresponding Unicode code point, or in other words, consider the input as if it was encoded in latin1.

A possible (not fully satisfactory) approach if one doesn't need to consider any of strings in the JSON as text is to pre-process the input to the JSON-processing tool (jq, mlr...) with iconv -f latin1 -t utf-8 and post-process its output with iconv -f utf-8 -t latin, that is convert all bytes >= 0x80 to the character with the corresponding Unicode code point, or in other words, consider the input as if it was encoded in latin1.

A possible (not fully satisfactory) approach if one doesn't need to consider any of the strings in the JSON as text is to pre-process the input to the JSON-processing tool (jq, mlr...) with iconv -f latin1 -t utf-8 and post-process its output with iconv -f utf-8 -t latin1, that is convert all bytes >= 0x80 to the character with the corresponding Unicode code point, or in other words, consider the input as if it was encoded in latin1.

deleted 10 characters in body

Source Link

edited Nov 17, 2023 at 12:50

Stéphane Chazelas

584.6k
96
1.1k
1.7k

$ exec 3> $'\x80\xff'
$ lsfd -Jp "$$" | perl -MJSON -l -0777 -ne '
   $_ = JSON->new->decode($_);
   print $_->{name} for grep {
      $_->{assoc} == 3 } @{$_->{lsfd}
   }' |
   sed -n l
/home/chazelas/tmp/\200\377$

Using latin1 also makes is relatively easy to deal with journalctl's representation of messages as [1, 2, 3] whichwhere we just need to convert those byte values to the character with the corresponding Unicode codepoint (and when encoded as latin1, you get the right byte back)

$ exec 3> $'\x80\xff'
$ lsfd -Jp "$$" | perl -MJSON -l -0777 -ne '
   $_ = JSON->new->decode($_);
   print $_->{name} for grep {
      $_->{assoc} == 3 } @{$_->{lsfd}
   }' | sed -n l
/home/chazelas/tmp/\200\377$

Using latin1 also makes is relatively easy to deal with journalctl's representation of messages as [1, 2, 3] which we just need to convert those byte values to the character with the corresponding Unicode codepoint (and when encoded as latin1, you get the right byte back)

$ exec 3> $'\x80\xff'
$ lsfd -Jp "$$" | perl -MJSON -l -0777 -ne '
   $_ = JSON->new->decode($_);
   print $_->{name} for grep {$_->{assoc} == 3} @{$_->{lsfd}}' |
   sed -n l
/home/chazelas/tmp/\200\377$

Using latin1 also makes is relatively easy to deal with journalctl's representation of messages as [1, 2, 3] where we just need to convert those byte values to the character with the corresponding Unicode codepoint (and when encoded as latin1, you get the right byte back)