Skip to content

Deprecate mozc::Util::IsJisX0208() #1353

@yukawa

Description

@yukawa

Description

The goal here is to effectively inline mozc::Util::IsJisX0208 into AddSymbolToDictionary in src/rewriter/gen_symbol_rewriter_dictionary_main.cc as that's the only remaining usage of mozc::Util::IsJisX0208.

Background

Previously we had a utility method mozc::Util::GetCharacterSet, which classified the given character into a character set.

mozc/src/base/util.h

Lines 434 to 447 in 9a44dac

// Basically, if charset >= JIX0212, the char is platform dependent char.
enum CharacterSet {
ASCII, // ASCII (simply ucs4 <= 0x007F)
JISX0201, // defined at least in 0201 (can be in 0208/0212/0213/CP9232)
JISX0208, // defined at least in 0208 (can be in 0212/0213/CP932)
JISX0212, // defined at least in 0212 (can be in 0213/CP932)
JISX0213, // defined at least in 0213 (can be in CP932)
CP932, // defined only in CP932, not in JISX02*
UNICODE_ONLY, // defined only in UNICODE, not in JISX* nor CP932
CHARACTER_SET_SIZE,
};
// Returns CharacterSet.
static CharacterSet GetCharacterSet(char32 ucs4);

Then mozc::Util::GetCharacterSet was replaced with mozc::Util::IsJisX0208 (d381608 ) as other character sets were no longer used at that time.

The only remaining usage of mozc::Util::IsJisX0208 is AddSymbolToDictionary in src/rewriter/gen_symbol_rewriter_dictionary_main.cc

void AddSymbolToDictionary(const absl::string_view pos,
const absl::string_view value,
const absl::Span<const std::string> keys,
const absl::string_view description,
const absl::string_view additional_description,
const SortingKeyMap& sorting_keys,
rewriter::DictionaryGenerator& dictionary) {
// use first char of value as sorting key.
const absl::string_view first_value = Util::Utf8SubString(value, 0, 1);
const auto it = sorting_keys.find(first_value);
uint16_t sorting_key = 0;
if (it == sorting_keys.end()) {
DLOG(WARNING) << first_value << " is not defined in sorting map.";
// If the character is platform-dependent, put the character at the last.
if (!Util::IsJisX0208(value)) {
sorting_key = USHRT_MAX;
}
} else {
sorting_key = it->second;
}

As gen_symbol_rewriter_dictionary_main.cc is a build-time utility, special code generation we currently perform with src/base/gen_character_set.py is a bit overkill. Let's simplify the code by

  • Move the logic into src/rewriter/gen_symbol_rewriter_dictionary_main.cc
  • Remove mozc::Util::IsJisX0208()
  • Delete the following files:
    • src/base/gen_character_set.py
    • src/data/unicode/JIS0201.TXT
    • src/data/unicode/JIS0208.TXT

Steps to reproduce

  1. bazelisk build //data_manager/oss:mozc_dataset_for_oss@symbol --config oss_windows -c opt

Expected behavior

  • The following files remain unchanged:
    • bazel-bin/data_manager/oss/symbol_token.data
    • bazel-bin/data_manager/oss/symbol_string.data
  • The following files no longer exist:
    • src/base/gen_character_set.py
    • src/data/unicode/JIS0201.TXT
    • src/data/unicode/JIS0208.TXT

Version or commit-id

c7160d4

Environment

  • OS: All

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions