[World-visible; doc created by hsivonen. Current level of consensus: just hsivonen’s proposal]
(Previously: https://docs.google.com/document/d/1KEkY1Du7x0Qv_yNCFOXASQLONqqCgHmD9PXSpX1ovrk/edit )
Proposed scope for ICU4X collator MVP
In scope
- Data and capability to decompose any character
- Rationale: Collation builds on the NFD definition
- Data and capability to query a character for its combining class
- Rationale: Required for NFD ordering
- Tweaking the ICU4C data builder to output data without the canonical closure.
- Rationale: Avoiding a new builder in MVP scope; avoiding departure from spec definitions in MVP scope; optimizing for binary size.
- Ability to read the data structures generated by the ICU4C data builder.
- Support for CLDR-style prefix rules in addition to DUCET contractions and expansions.
- Rationale: Needed for CLDR support.
- CLDR root collation using the ICU4C builder with canonical closure omitted
- Rationale: Avoid reinventing the representation and avoid creating a builder. But also avoid the size and complexity of the canonical closure.
- One level of tailoring using rules from CLDR.
- Rationale: ICU4C does one level and ECMA-402 assumes CLDR.
- The options from ECMA-402
- usage: sort, search
- numeric: true, false
- caseFirst: upper, lower, false
- Sensitivity: base, accent, case, variant
- ignorePunctuation: true, false
- Backward second-level for fr-CA
- Rationale: Supported by Firefox and Chrome
- Collator object that provides compare functions that take two slices in relevant UTFs and return std::cmp::Ordering.
- (Should there be a public version that takes iterators over char? That’s probably the internal implementation anyway.)
Explicitly out of scope for MVP
- Rationale: Search is a separate feature that can be scoped out into its own issue and developed separately on top of the MVP.
- Firefox doesn’t use ICU4C for ctrl-F, and the result seems faster than what Chrome does (while being technically wrong but subjectively OK).
- Firefox does not do locale-aware search. Instead, Firefox has “Match Case” and “Match Diacritics” check boxes. Not matching case uses the Unicode fold case concept. Not matching diacritics uses a fast table lookup to map accented characters to their bases for the precomposed case and another fast lookup table to decide if a character is a combining diacritic. There is no concept of Collation Element involved: the algorithm operates on UTF-16 with the substitutions/ignorings mentioned.
- Firefox is technically wrong but subjectively OK by assuming that the normalization form of the search needle and its occurrences in the haystack are the same.
- This could be fixed without switching to collator-based search. There doesn’t seem to be particular pressure to fix this, though.
- Subjectively, as a person running software under the en-US locale but also searching in language that treats ä and ö as base letters that are distinct from a and o, I find the Firefox approach that allows me to check “Match Diacritics” superior to Chrome’s approach of not providing that option and instead trying to guess intent from locale.
- I read through the CLDR “search” definitions, and they seem to fall into these categories:
- Performance improvements compared to sorting due to the order not mattering for search, which are moot in the Firefox approach described above.
- A Hebrew punctuation special case.
- Special cases for Hangul.
- L and L-dot distinction for Catalan.
- ae, oe, and ue matching ä, ö, and ü in German and aa matching å in Norwegian and Danish.
- The very suspect notion that v and w should be primary-equal for search for Finnish and Swedish.
- Anecdote: I had lived as a Finnish native-speaker in Finland for nearly 4 decades before learning about the notion that v and w could be considered primary-equal in Finnish collation, and I learned it by reading CLDR sources. (I had never questioned the notion that Finnish sorts alphabetically in the sense of every letter recited when reciting the alphabet sorting as a primary-different base in the recited order. And that Swedish sorted the same way with the same recited alphabet.) As a user, I for sure didn’t expect the macOS Finder behavior of v finding w in search, which hadn’t even occurred to me before reading CLDR sources.
- The Hangul and Hebrew special-cases could be always enabled and don’t need to be locale-sensitive. Unclear how important it would be to have them on top of what Firefox does.
- It seems better to hack the ae, oe, ue, aa rules for German, Norwegian and Danish and perhaps Catalan L-dot on top of Firefox’s approach, if those tweaks are important for users, than to switch Firefox to collator-based search.
- Search usage parameter is part of ECMA-402, though.
- Rationale: The feature is for a database use case, is not exposed by ECMA-402, and adds extra complexity.
- Firefox uses collation keys in two places:
- In XSLT, where, based on the advice given in ICU4C docs, this is almost certainly a bad idea.
- In IndexedDB, but the usage is in a behind-flag Firefox-only experimental feature that was motivated by Firefox OS. (Notably, SQLite doesn’t use collation keys.)
- New collation element design
- Rationale: There’s no reason to believe that I’d have sufficient expertise and insight to design a better collation element representation than what ICU4C has. Also, a different design would grow the scope of the project by having to create a new data builder instead of using ICU4C to build the data tables.
- Canonical closure optimization
- Rationale: Staying closer to the conceptual definitions from the specs reduces complexity and MVP project scope. Also, it allows for differentiating ICU4X on binary size, which is in line with the notion of ICU4X making sense for small devices. The binary size of ICU4C has been an issue for Firefox for Android in particular.
- Rationale: See the previous point. (Planning to use the ICU4C builder with the modification to omit the canonical closure.)
- Rationale: Extra work and API surface compared to the CLDR root collation that addresses the same problem space and is needed for ECMA-402 coverage.
- Rationale: Outside ECMA-402 scope and would bring a run-time data builder into scope.
- Also, the feature seems like an overkill for the use case cited as the first example: Combining French and Arabic collation for North Africa. Since French uses the root collation, the Arabic collation is already correct for French except it orders the Arabic script before the Latin script. If there’s a lot of demand for collation with Arabic tailorings except the script re-ordering, it would make sense to provide that in CLDR or otherwise as something that gets baked on at compile time.
- More than one level of tailoring
- Rationale: ICU4C offers only root + one level.
- Backward second-level except for fr-CA
- Rationale: Only used for fr-CA. Explicit usage prohibited by ECMA-402. No point in providing explicit API surface for it. The API surface for turning it off for fr-CA is requesting fr instead.
- Control to turn off normalization
- Rationale: Prohibited by ECMA-402. Turning normalization off doesn’t make sense when the data doesn’t have the canonical closure property.