[Json] Re: [Technical Errata Reported] RFC8259 (8622)

Max Zerzouri <maxdamantus@gmail.com> Sat, 01 November 2025 12:32 UTC

Date: Sun, 02 Nov 2025 01:32:26 +1300
From: Max Zerzouri <maxdamantus@gmail.com>
To: Rob Sayre <sayrer@gmail.com>, Nico Williams <nico@cryptonector.com>
Message-ID: <aQX92uOcXKg1TvGD@jove>
References: <20251031142335.2DB55C000BCA@rfcpa.rfc-editor.org> <CAChr6SzGmPeSaVrnmcnj6jrWR=yyspb9A3XrpyJztRSvDELAxQ@mail.gmail.com> <aQTeUod9rBp/5zNY@ubby> <E2247A5B-1C07-49BE-8DD8-568A3F2D374F@bzfx.net> <aQUhz7D5+IEw5Dkc@ubby> <a8f18866-c478-4d43-923d-a9f1c2c32c8a@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <a8f18866-c478-4d43-923d-a9f1c2c32c8a@gmail.com>
Message-ID-Hash: 4HWK4WATWXVZ6GALEK3UXM4GRMPKGWO3
CC: Nico Williams <nico@cryptonector.com>, Austin Wright <aaa@bzfx.net>, RFC Errata System <rfc-editor@rfc-editor.org>, tbray@textuality.com, andy@hxr.us, orie@or13.io, linuxwolf+ietf@outer-planes.net, fotyek.robert63@gmail.com, json@ietf.org
Precedence: list
Subject: [Json] Re: [Technical Errata Reported] RFC8259 (8622)
Archived-At: <https://mailarchive.ietf.org/arch/msg/json/OzoamKmASmiu3VUwv55DG5zcyXs>

On Fri, Oct 31, 2025 at 02:47:20PM -0700, Rob Sayre wrote:
> 
> On 10/31/25 1:53 PM, Nico Williams wrote:
> > 
> > Sure, but RFC 8259 is incredibly wishy-washy in other areas.  And we
> > seem to have accepted a decade ago that the place for JSON interop
> > guidelines is profiles like I-JSON.
> > 
> > There's a fellow who argues that JSON allows properly-escaped binary
> > strings.  And... I tried real hard to show that's wrong, but, no, it
> > turns out that RFC 8259 lets you drive a truck full of invalid UTF-8
> > sequences through JSON strings just fine[0].
> > 
> 
> Hi,
> 
> There is an erratum already filed about this one:
> 
> https://www.rfc-editor.org/errata/eid7603
> 
> It is correct. It is also acknowledged in RFC 9839:
> 
> https://www.rfc-editor.org/rfc/rfc9839.html
> 
> I happened to write the first version of the text that makes this
> distinction.
> 
> https://www.rfc-editor.org/rfc/rfc9839.html#name-using-subsets
> 
> "Note that escaping techniques such as those in the JSON example in Section
> 3 cannot be used to circumvent this sort of restriction...".

Hi,

Hopefully I'm not going against some topic hijacking etiquitte (first time
participating on this mailing list), but since I'm the fellow mentioned above,
I'll clarify that my point was that the JSON RFC doesn't mandate disallowing
ill-formed Unicode text in the input stream. That is, implementations can
handle ill-formed Unicode that is *not* escaped.

For example, if the JSON text itself is stored in a Unicode 16-bit string (eg,
JavaScript or Java), it's possible for lone surrogates such as <D800> to appear
within a JSON string literal. I think most people would expect such a lone
surrogate in the input stream to be treated as equivalent to the escaped form
\uD800. This is how it's always worked in JavaScript at least, from Doug
Crockford's original implementation to the standard `JSON.parse`
implementation.

My overall argument in the discussion was that the most useful behaviour for a
JSON implementation using Unicode 8-bit strings would be to pass ill-formed
UTF-8 code units through, just as implementations using Unicode 16-bit strings
naturally pass ill-formed UTF-16 code units through [0].

It's a fairly long thread, but in case anyone's interested, this is probably
the comment where I most directly addressed the wording in the RFC:
  https://github.com/01mf02/jaq/issues/309#issuecomment-3314246576

[0] Actually, `JSON.stringify` in JavaScript escapes them nowadays rather than
passing them though:
  https://github.com/tc39/ecma262/issues/944

If JSON had a \xXX notation for 8-bit code units just as it has a \uXXXX
notation for 16-bit code units, I would suggest using the former notation
rather than passing ill-formed data through.

If I could go back 30 years I would suggest everyone use 8-bit strings rather
than 16-bit, in which case we would probably only have \xXX notation for code
units (and maybe a \u{X+} notation for code points), and I suspect everyone
will happily handle ill-formed 8-bit data, because most of the problems with
ill-formed data ultimately come from Unicode conversion, which noone would
bother with in a UTF-8-only world. My suggestions are based on what is
practical and useful in today's messy world.

I wasn't aware of Erratum 7603, but I don't think it changes much. If anything
it seems more deviant from sensible behaviour, since as alluded to in the
relevant thread, there are odd cases where "sequence of code points" is
different to "sequence of code units":
  https://mailarchive.ietf.org/arch/msg/json/vAmxG1z1mR52Wx_dJuGOjXEI50Q/

Strictly speaking, it doesn't align with JavaScript or any standard Unicode
strings (since Unicode strings are specifically sequences of code units [1],
not code points), but it does align with Python 3 strings. Python's JSON
implementation similarly allows passing through of ill-formed code points:

>>> json.loads('"\U00002200 \u2200 \\u2200"') # U+2200 ' ' U+2200 ' ' <2200>
'∀ ∀ ∀'
>>> json.loads('"\U0001F4A9 \uD83D\uDCA9 \\uD83D\\uDCA9"') # U+1F4A9 ' ' U+D83D U+DCA9 ' ' <D83D DCA9>
'💩 \ud83d\udca9 💩'

Note that in the second example, the last pair is treated by the `json` library
as a sequence of UTF-16 code units (equivalent to the code point U+1F4A9), but
the previous pair is passed through as a distinct sequence of code points that
has no valid Unicode representation. This is because Python 3 strings are
indeed sequences of arbitrary code points.

Arguably the change from "Unicode characters" ("Unicode code units" while
squinting) to "Unicode code points" might exclude passing through ill-formed
8-bit data since there are no "UTF-8 surrogate code points", but since we're
apparently using Python 3 string semantics now maybe an implementation could
pretend to use Python's "surrogateescape" (PEP-383) mechanism which involves
automatically translating ill-formed UTF-8 code units to ill-formed code points
(ie, code points corresponding with UTF-16 surrogates).

Looking briefly through the Erratum 7603 thread, the only in-favour reasoning I
agree with is that it makes it align with ECMA-404, though ECMA-404 is also
vague/misleading. My preference would be for both standards to just specify
strings as sequences of code units, since that aligns with Unicode [1] and it's
basically how the actual implementations work. It's up to implementations to
choose which code units are used (UTF-8, UTF-16 or UTF-32) based on their
native string model. Since Python 3 uses a non-standard model for Unicode
strings it can pretend that its "code units" are code points.

[1] https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G32765

For extra clarity it might be good if they specified that the \uXXXX notation
is always to be interpreted as a UTF-16 code unit. This means that
implementations that don't use Unicode 16-bit strings need to incorporate a
conversion from UTF-16 when parsing JSON. This is already the case in all
serious implementations, including Python's as demonstrated above.

Sorry for the long tangent from the actual thread topic (non-unique object
keys). Seemed like I was getting called out, and it's always hard to talk about
this issue concisely.

Thanks,
Max

[Json] [Technical Errata Reported] RFC8259 (8622) RFC Errata System
[Json] Re: [Technical Errata Reported] RFC8259 (8… Rob Sayre
[Json] Re: [Technical Errata Reported] RFC8259 (8… Tim Bray
[Json] Re: [Technical Errata Reported] RFC8259 (8… Nico Williams
[Json] Re: [Technical Errata Reported] RFC8259 (8… Austin Wright
[Json] Re: [Technical Errata Reported] RFC8259 (8… Rob Sayre
[Json] Re: [Technical Errata Reported] RFC8259 (8… Nico Williams
[Json] Re: [Technical Errata Reported] RFC8259 (8… Rob Sayre
[Json] Re: [Technical Errata Reported] RFC8259 (8… Nico Williams
[Json] Re: [Technical Errata Reported] RFC8259 (8… Rob Sayre
[Json] Re: [Technical Errata Reported] RFC8259 (8… Tim Bray
[Json] Re: [Technical Errata Reported] RFC8259 (8… Nico Williams
[Json] Re: [Technical Errata Reported] RFC8259 (8… Max Zerzouri
[Json] Re: [Technical Errata Reported] RFC8259 (8… Carsten Bormann
[Json] Re: [Technical Errata Reported] RFC8259 (8… Rob Sayre
[Json] Binary in JSON strings (Re: Re: [Technical… Nico Williams
[Json] Re: Binary in JSON strings (Re: Re: [Techn… Rob Sayre
[Json] Re: Binary in JSON strings (Re: Re: [Techn… Max Zerzouri
[Json] Binary in JSON strings (Re: Re: [Technical… Max Zerzouri
[Json] Re: Binary in JSON strings (Re: Re: [Techn… Rob Sayre
[Json] Re: [Technical Errata Reported] RFC8259 (8… David Kemp