[Json] Re: [Technical Errata Reported] RFC8259 (8622)

Max Zerzouri <maxdamantus@gmail.com> Sat, 01 November 2025 12:32 UTC

Return-Path: <maxdamantus@gmail.com>
X-Original-To: json@mail2.ietf.org
Delivered-To: json@mail2.ietf.org
Received: from localhost (localhost [127.0.0.1]) by mail2.ietf.org (Postfix) with ESMTP id 46E087FF9B70 for <json@mail2.ietf.org>; Sat, 1 Nov 2025 05:32:43 -0700 (PDT)
X-Virus-Scanned: amavisd-new at ietf.org
X-Spam-Flag: NO
X-Spam-Score: -2.099
X-Spam-Level:
X-Spam-Status: No, score=-2.099 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: mail2.ietf.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail2.ietf.org ([166.84.6.31]) by localhost (mail2.ietf.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qSzJGfALdiAS for <json@mail2.ietf.org>; Sat, 1 Nov 2025 05:32:41 -0700 (PDT)
Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by mail2.ietf.org (Postfix) with ESMTPS id 500747FF9B5C for <json@ietf.org>; Sat, 1 Nov 2025 05:32:41 -0700 (PDT)
Received: by mail-pf1-x42b.google.com with SMTP id d2e1a72fcca58-7a74b13f4f8so2112651b3a.1 for <json@ietf.org>; Sat, 01 Nov 2025 05:32:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1762000354; x=1762605154; darn=ietf.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=ybprgfx8Kznj58DafmMRHt+vYPDebWFeyadXLBsPP2U=; b=SpwgVMQRAaSe1BxFIj+Z+Q9yoIUc+AqLQlM91Kg51M/D5ejepuKftcO7qMOPG52Mm1 o6Su1RTmh32CC2BPJArJIDKWESFtdmWgPbx9UAdTM1tJBLqNTzC9Nofg0s4+DyhlTWSl H2RlVK0dTfpVibPftymPya6i759rMy1DceoFxhoqMaIujdQu276rDiyhsA7+7VerE+Kr ZJc+6ib+CeXy6tf9lEm8/SKCGx6O58LP1Qon1P6iOtol+KyOZcPNmfY5PVdEYYqNY9Hf at1mpwklyqMU3sPq/VN+fhcLE4Y2bF7qROqnNn7GgCkYvxA1M9jq1iED5nCLPkHCiW6K L6jg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762000354; x=1762605154; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ybprgfx8Kznj58DafmMRHt+vYPDebWFeyadXLBsPP2U=; b=bngOHo9tPQKoR55SWNenU60N8LFHB+lgwpI50w9pStp/6p0H4SiuLJW2eMKtqn4Bos UW9wF5VbPnUcZh7aC/QU8WtIwwwlM02sibhrEmjKloXZHfPpCV68s34nU9nw6uo0Oev+ 5Z0/dDSoNL24iRs04hFB0fFRrUcrl9AQH37nBPzX9IX2U5Z3vH3nm1T98N2gzJrdfh96 FhV9rDMiKMdtlAoOTTFLj4BWKtdYBgdn8dlsHjdlq3YWsY7IvZ/q4GJD5mVsLF4ipMhM 3L5HA9hWk+70ObTuQSRElchfM6IksjmICUjvOE5eEO+xYT3PRUnSbapu9sLmwCrynvoN ly0Q==
X-Forwarded-Encrypted: i=1; AJvYcCUO5HS80/dcaxVY302RKU9OSIPvz+n1rxlwpkYxKneSRhrJwSAzVdoRRc3WMO+pMkj7kXFX@ietf.org
X-Gm-Message-State: AOJu0YwwOQoiiFe/Otr8HkSeaI1Row3upoW9Q3Vkkg6y7MdqSko5CTfj ys+wleUMuqXcsJ9wpF+D5t1WbanzkAD4ejB4uOQ+uV6FSGp/qj/eGxl2
X-Gm-Gg: ASbGnctCmVFKl93jN1dVSqj2FLmSW+i20QMCqpjWhL1IHnZ4QoXaY65xAddvu7z0yps 8ALZeoFLdqgVsm4maGYdAedtbt15GY+KN5rNwQkFL0KeY9qYQHzFwtnff6I4HhQhFeiJvlCvmi7 xFxz81gMZ4tgZJQo0WAuuQTxQNZq+JAieBtDKu9S14CqqDznjIbwnfX0rOAiS+CIFpuZODOkFGM 5K43qp3fRUI5g1b07hzOwh+OAzH+wvC9N//231qYH1pbnm8F5aLGLqwVO57u547vvT8kkttvAva Ez89YMlkHz7Yl2SKdBfkQd9qXSE2uPUX/k9PoAWG716UxAuo7OluKkHV8+JhfkeUqR1MJs4CqqB Y5lgK8lnmwkzG69/9qtwXAk9uhVpVouO2ZCqdwZZDADtsmSwqRv5GRc7KqBYbldMPuT8H4JTsT2 cX0g==
X-Google-Smtp-Source: AGHT+IEwOHb8nbA3ZcvZFoEMkcQ5IRrLlrrvKePH3RPv3cxN+RHGZ88IRcNrV9kk0rZmiTJJ2cacvA==
X-Received: by 2002:a05:6a00:1707:b0:77f:2978:30b0 with SMTP id d2e1a72fcca58-7a77776284bmr8509946b3a.11.1762000353668; Sat, 01 Nov 2025 05:32:33 -0700 (PDT)
Received: from jove ([161.65.98.251]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7a7d5e27104sm5306048b3a.0.2025.11.01.05.32.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 01 Nov 2025 05:32:33 -0700 (PDT)
Date: Sun, 02 Nov 2025 01:32:26 +1300
From: Max Zerzouri <maxdamantus@gmail.com>
To: Rob Sayre <sayrer@gmail.com>, Nico Williams <nico@cryptonector.com>
Message-ID: <aQX92uOcXKg1TvGD@jove>
References: <20251031142335.2DB55C000BCA@rfcpa.rfc-editor.org> <CAChr6SzGmPeSaVrnmcnj6jrWR=yyspb9A3XrpyJztRSvDELAxQ@mail.gmail.com> <aQTeUod9rBp/5zNY@ubby> <E2247A5B-1C07-49BE-8DD8-568A3F2D374F@bzfx.net> <aQUhz7D5+IEw5Dkc@ubby> <a8f18866-c478-4d43-923d-a9f1c2c32c8a@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <a8f18866-c478-4d43-923d-a9f1c2c32c8a@gmail.com>
X-MailFrom: maxdamantus@gmail.com
X-Mailman-Rule-Hits: nonmember-moderation
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-json.ietf.org-0
Message-ID-Hash: 4HWK4WATWXVZ6GALEK3UXM4GRMPKGWO3
X-Message-ID-Hash: 4HWK4WATWXVZ6GALEK3UXM4GRMPKGWO3
X-Mailman-Approved-At: Sat, 01 Nov 2025 05:37:58 -0700
CC: Nico Williams <nico@cryptonector.com>, Austin Wright <aaa@bzfx.net>, RFC Errata System <rfc-editor@rfc-editor.org>, tbray@textuality.com, andy@hxr.us, orie@or13.io, linuxwolf+ietf@outer-planes.net, fotyek.robert63@gmail.com, json@ietf.org
X-Mailman-Version: 3.3.9rc6
Precedence: list
Subject: [Json] Re: [Technical Errata Reported] RFC8259 (8622)
List-Id: "JavaScript Object Notation (JSON) WG mailing list" <json.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/json/OzoamKmASmiu3VUwv55DG5zcyXs>
List-Archive: <https://mailarchive.ietf.org/arch/browse/json>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Owner: <mailto:json-owner@ietf.org>
List-Post: <mailto:json@ietf.org>
List-Subscribe: <mailto:json-join@ietf.org>
List-Unsubscribe: <mailto:json-leave@ietf.org>

On Fri, Oct 31, 2025 at 02:47:20PM -0700, Rob Sayre wrote:
> 
> On 10/31/25 1:53 PM, Nico Williams wrote:
> > 
> > Sure, but RFC 8259 is incredibly wishy-washy in other areas.  And we
> > seem to have accepted a decade ago that the place for JSON interop
> > guidelines is profiles like I-JSON.
> > 
> > There's a fellow who argues that JSON allows properly-escaped binary
> > strings.  And... I tried real hard to show that's wrong, but, no, it
> > turns out that RFC 8259 lets you drive a truck full of invalid UTF-8
> > sequences through JSON strings just fine[0].
> > 
> 
> Hi,
> 
> There is an erratum already filed about this one:
> 
> https://www.rfc-editor.org/errata/eid7603
> 
> It is correct. It is also acknowledged in RFC 9839:
> 
> https://www.rfc-editor.org/rfc/rfc9839.html
> 
> I happened to write the first version of the text that makes this
> distinction.
> 
> https://www.rfc-editor.org/rfc/rfc9839.html#name-using-subsets
> 
> "Note that escaping techniques such as those in the JSON example in Section
> 3 cannot be used to circumvent this sort of restriction...".

Hi,

Hopefully I'm not going against some topic hijacking etiquitte (first time
participating on this mailing list), but since I'm the fellow mentioned above,
I'll clarify that my point was that the JSON RFC doesn't mandate disallowing
ill-formed Unicode text in the input stream. That is, implementations can
handle ill-formed Unicode that is *not* escaped.

For example, if the JSON text itself is stored in a Unicode 16-bit string (eg,
JavaScript or Java), it's possible for lone surrogates such as <D800> to appear
within a JSON string literal. I think most people would expect such a lone
surrogate in the input stream to be treated as equivalent to the escaped form
\uD800. This is how it's always worked in JavaScript at least, from Doug
Crockford's original implementation to the standard `JSON.parse`
implementation.

My overall argument in the discussion was that the most useful behaviour for a
JSON implementation using Unicode 8-bit strings would be to pass ill-formed
UTF-8 code units through, just as implementations using Unicode 16-bit strings
naturally pass ill-formed UTF-16 code units through [0].

It's a fairly long thread, but in case anyone's interested, this is probably
the comment where I most directly addressed the wording in the RFC:
  https://github.com/01mf02/jaq/issues/309#issuecomment-3314246576

[0] Actually, `JSON.stringify` in JavaScript escapes them nowadays rather than
passing them though:
  https://github.com/tc39/ecma262/issues/944

If JSON had a \xXX notation for 8-bit code units just as it has a \uXXXX
notation for 16-bit code units, I would suggest using the former notation
rather than passing ill-formed data through.

If I could go back 30 years I would suggest everyone use 8-bit strings rather
than 16-bit, in which case we would probably only have \xXX notation for code
units (and maybe a \u{X+} notation for code points), and I suspect everyone
will happily handle ill-formed 8-bit data, because most of the problems with
ill-formed data ultimately come from Unicode conversion, which noone would
bother with in a UTF-8-only world. My suggestions are based on what is
practical and useful in today's messy world.

I wasn't aware of Erratum 7603, but I don't think it changes much. If anything
it seems more deviant from sensible behaviour, since as alluded to in the
relevant thread, there are odd cases where "sequence of code points" is
different to "sequence of code units":
  https://mailarchive.ietf.org/arch/msg/json/vAmxG1z1mR52Wx_dJuGOjXEI50Q/

Strictly speaking, it doesn't align with JavaScript or any standard Unicode
strings (since Unicode strings are specifically sequences of code units [1],
not code points), but it does align with Python 3 strings. Python's JSON
implementation similarly allows passing through of ill-formed code points:

>>> json.loads('"\U00002200 \u2200 \\u2200"') # U+2200 ' ' U+2200 ' ' <2200>
'∀ ∀ ∀'
>>> json.loads('"\U0001F4A9 \uD83D\uDCA9 \\uD83D\\uDCA9"') # U+1F4A9 ' ' U+D83D U+DCA9 ' ' <D83D DCA9>
'💩 \ud83d\udca9 💩'

Note that in the second example, the last pair is treated by the `json` library
as a sequence of UTF-16 code units (equivalent to the code point U+1F4A9), but
the previous pair is passed through as a distinct sequence of code points that
has no valid Unicode representation. This is because Python 3 strings are
indeed sequences of arbitrary code points.

Arguably the change from "Unicode characters" ("Unicode code units" while
squinting) to "Unicode code points" might exclude passing through ill-formed
8-bit data since there are no "UTF-8 surrogate code points", but since we're
apparently using Python 3 string semantics now maybe an implementation could
pretend to use Python's "surrogateescape" (PEP-383) mechanism which involves
automatically translating ill-formed UTF-8 code units to ill-formed code points
(ie, code points corresponding with UTF-16 surrogates).

Looking briefly through the Erratum 7603 thread, the only in-favour reasoning I
agree with is that it makes it align with ECMA-404, though ECMA-404 is also
vague/misleading. My preference would be for both standards to just specify
strings as sequences of code units, since that aligns with Unicode [1] and it's
basically how the actual implementations work. It's up to implementations to
choose which code units are used (UTF-8, UTF-16 or UTF-32) based on their
native string model. Since Python 3 uses a non-standard model for Unicode
strings it can pretend that its "code units" are code points.

[1] https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G32765

For extra clarity it might be good if they specified that the \uXXXX notation
is always to be interpreted as a UTF-16 code unit. This means that
implementations that don't use Unicode 16-bit strings need to incorporate a
conversion from UTF-16 when parsing JSON. This is already the case in all
serious implementations, including Python's as demonstrated above.

Sorry for the long tangent from the actual thread topic (non-unique object
keys). Seemed like I was getting called out, and it's always hard to talk about
this issue concisely.

Thanks,
Max