Fun With Templates
This accepts any type as the character type, which on a previous project of mine led to my routines interpreting double* as a string of double-precision values terminated by 0.0. (In an extremely contrived test of whether it sufficed to restrict the overload to types for which std::char_traite<CharT> was defined.)
This led me to define the C++20 concept:
#include <concepts>
template <typename CharT> concept char_type =
std::same_as<CharT, char> ||
std::same_as<CharT, signed char> ||
std::same_as<CharT, unsigned char> ||
std::same_as<CharT, wchar_t> ||
std::same_as<CharT, char8_t> ||
std::same_as<CharT, char16_t> ||
std::same_as<CharT, char32_t>;
This would let you write
template<char_type CharType = char>
The C++11 version would have used std::enable_if.
Unicode Support
No mainstream operating system uses a fixed-width default encoding any more. (Windows, the last holdout, made UTF-8 the default in Windows 11.) Since you’re accepting Unicode input, real-world input with multi-byte characters, surrogate pairs and combining characters will break the program.
Most OSes set the default locale to a multi-byte character set. To deal with these, you either need to split the string into graphemes, convert each one into a normalized form, and keep the count in a hash map, or else convert to strings of char32_t, the only portably fixed-width encoding. The latter only works if you can ignore combining characters.
To adhere to the Unicode standard, you also want to ensure that all canonically-equivalent representations are processed the same way.
The algorithm you would want to follow here is:
- Decompose the string. (Probably with canonical decomposition, but you might prefer compatibility decomposition.)
- Ignore all characters from character classes you should ignore (such as accents, punctuation and spaces, but probably not numbers)
- For each remaining character, convert to the same case (uppercase or lowercase, pick one)
- Since you discarded all combining characters in step 2, you can convert your character to a single UCS-4 codepoint. Do so.
- Increment the count for the canonicalized base character in a hash map
There is no function that does step 1 in the C++ standard library (that I know of). You would use a third-party library such as ICU for this. If you overlook that, there are functions that will do steps 2–4 on wchar_t characters. These will work correctly on Linux, but fail for utf-16 surrogate pairs on Windows. Or you could use the same libreary you used for step 1. Step 5 might use a std::unordered_map< char32_t, size_t >.
A Palindrome in What Language?
This at least should suffice for Latin, Greek and Cyrillic scripts; I do not know how a native speaker of Bengali or Korean would define a “palindrome” in their languages, but the rule that you can discard all combining characters is unlikely to work for Hangul characters.
In Japanese, “palindromes” seem to be defined by syllables regardless of which characters are used to write them. So, 夫婦 (fufu, married couple) and 田植え歌 (Taueuta, rice-planting song) are considered palindromes. You might be able to rescue the algorithm by converting all Japanese kana to hirigana or katakana; I’m not sure.
Apparently, at least one Chinese definition of palindrome does not work phonetically,, but by applying the standard definition to hanzi characters, and the same characters would not be pronounced the same way across China anyway. The algorithm above would work on Chinese.
This gives us a counterexample to the possibility of a multilingual palindrome checker. Because Unicode uses the same codepoints for Japanese kanji and Chinese hanzi, it is not possible for a palindrome checker to work for both languages simultaneously. You can pick at most one language to check in at a time.