Checking whether a string is a permutation of a palindrome in C++20 - follow-up

Question

This post is the follow-up of Checking whether a string is a permutation of a palindrome in C++20.

So, what's new? Well, nothing else except that the procedure is now generic and accepts all string_views of char, char8_t, wchar_t, char16_t and char32_t:

string_utils.h:

#ifndef COM_GITHUB_CODERODDE_STRING_UTILS_HPP
#define COM_GITHUB_CODERODDE_STRING_UTILS_HPP

#include <cstddef> // std::size_t
#include <string_view>
#include <unordered_map>

namespace com::github::coderodde::string_utils {

    template<typename CharType = char>
    bool is_permutation_palindrome(const std::basic_string_view<CharType>& text)
    {
        std::unordered_map<CharType, std::size_t> counter_map;

        for (const auto ch : text)
            ++counter_map[ch];

        std::size_t number_of_odd_chars = 0;

        for (const auto& [ch, cnt] : counter_map)
            if (cnt % 2 == 1 && ++number_of_odd_chars > 1)
                return false;

        return true;
    }

} // namespace com::github::coderodde::string_utils

#endif // !COM_GITHUB_CODERODDE_STRING_UTILS_HPP

main.cpp:

#include "string_utils.h"
#include <iostream>

namespace su = com::github::coderodde::string_utils;

int main() {
    std::cout << std::boolalpha;
    std::cout << su::is_permutation_palindrome(std::wstring_view(L"")) << "\n";
    std::cout << su::is_permutation_palindrome(std::u8string_view(u8"yash")) << "\n";
    std::cout << su::is_permutation_palindrome(std::u16string_view(u"vcciv")) << "\n";
    std::cout << su::is_permutation_palindrome(std::u32string_view(U"abnnab")) << "\n";
}

Critique request

I would like to know how to improve my solution code-wise. I don't care much about performance.

Incomputable · Accepted Answer · 2021-12-23 14:35:56Z

There isn't much to talk about, the code is good.

Drop in ergonomics

The template now disallows usage of std::string directly, as the CharType is not automatically deduced. Some other calls where the string view could be converted to are now prohibited as well (C strings, for example). It might make sense to provide some overloads that take C strings and std strings to facilitate easier use.

Missing traits as type parameter for the string_view

I honestly do not know if anybody uses them, but I'll leave it here for completeness.

Mixing two responsibilities

I believe that histogram calculation deserves separation from this function. Perhaps having one easy to use function that composes from them is great, but not having the building blocks separately might cause problems.

I haven't worked with ranges, but what if it would be possible to actually create another view from the input that is histogram of the original, then just apply the algorithm? This might solve the problem that @Toby is mentioning, perhaps the ignoring functions could be written as part of the pipeline.

This might make it viable to accept any range type that has needed properties (foreshadowing concepts).

Rather than more overloads for more special cases, it might make more sense to just take any input range. Why restrict it to string views? Why even restrict it to characters? (And if you really want to restrict it to characters, fine, just require that the range value type is a character type.) — indi
– indi, Commented Dec 22, 2021 at 21:10
@indi, I guess I was not clear about it in my last paragraph. I edited my answer to clearly point it out (when I said view I was thinking about ranges that do not hold ownership, those called views, IIRC). — Incomputable
– Incomputable, Commented Dec 23, 2021 at 14:36

Toby Speight · Accepted Answer · 2021-12-22 16:36:42Z

5

Typically in English we consider a string to be a palindrome if the letters in the string are in the same sequence in both directions:

Spaces and punctuation are ignored.
Capitalisation is unimportant.
Accents are ignored (often omitted in English anyway).

(Example: "Madam, I'm Adam" is a well-known English palindrome.)

The first of these criteria would be relatively easy to implement.
The second could be more difficult, as capitalisation is locale-dependendent.
The third is harder, unless we make use of a library such as ICU to decompose characters and retain only the base letters.

Low-level review:

The great vowel shortage is over, and we can finally name variables such as count in full. ;)
String views are intended to be small enough that they should be passed by value.
Side-effects on the right-hand side of && can be harder to comprehend.

I'd probably go with something like
```
if ((number_of_odd_chars += cnt % 2) > 1)
```

answered Dec 22, 2021 at 16:36

Toby Speight

88.4k14 gold badges104 silver badges327 bronze badges

2

\$\begingroup\$ At least, I hope cnt was short for count! \$\endgroup\$

Toby Speight
– Toby Speight

2021-12-22 16:37:11 +00:00
Commented Dec 22, 2021 at 16:37
1

\$\begingroup\$ Well, oops..... \$\endgroup\$

coderodde
– coderodde

2021-12-22 16:42:03 +00:00
Commented Dec 22, 2021 at 16:42
5

\$\begingroup\$ @TobySpeight Yes, can’t would be unduly pessimistic. \$\endgroup\$

Davislor
– Davislor

2021-12-23 07:34:05 +00:00
Commented Dec 23, 2021 at 7:34

Add a comment |

Peter Cordes · Accepted Answer · 2021-12-23 12:50:00Z

You don't care much about performance, but there are some things worth mentioning, like a factor of 10 performance for medium to long strings of char.

You're still using size_t as the counter type. Answers on your previous question pointed out that you only need odd/even, and unsigned wrap-around preserves that. (Really just the low bit of the count is all that matters, you don't care what happens to carry-out from the low bit, whether it propagates off the end of some type or not). uint_fast8_t, unsigned char, uint8_t, or even bool would work just fine.

One justification for keeping wider counters would be if you factor out the histogram part so it can be used for other functions, with palindrome check using it as a helper.

bool would simplify the check loop: the % 2 would already be done by having increments wrap back to zero. (Or you could do that with integer types by doing ^= 1 instead of ++. xor is add-without-carry.) You could even use std::count_if to count non-zero or odd elements, or with guaranteed 0/1 elements even just sum the counts. But that loses the benefit of your early-out upon seeing a second odd count.

You're also still using unordered_map which costs you a factor of 10 in performance for char data, for strings of length 1024, compared to a plain array (or std::array).

You do want unordered_map for large input element types, so if you wanted the performance gain for narrow types you could use a template to choose array<bool, CharType_max + 1> vs. unordered_map depending on std::numeric_limits<CharType>::digits <= 12 or something. (That would give us at most a 4kiB array of bool, assuming bool is 1 byte; but that's a bad assumption if CharType is somehow actually 12. So maybe just 9 or something.) Using C++11 std::conditional, this looks like

#include <type_traits>
#include <cstdint>
  
  using UCharType = std::make_unsigned<CharType>;
  std::conditional< 
        std::numeric_limits<UCharType>::digits <= 9,  
           std::array<std::uint_fast8_t, std::numeric_limits<UCharType>::max + 1>,
           std::unordered_map<std::uint_fast8_t>
     >::type  counters;

(You could use that inside a function that returns auto, returning a histogram container, so you can still split factor out the histogramming separate from the palindrome checking.)

To index an array safely the indices need to be non-negative, so you'd need to use std::make_unsigned<CharType> for that, too. On 2's complement systems (like everything these days), conversion between signed and unsigned of the same width is a no-op, it's free for the compiler to implement, so we can always do the conversion.

Or if you use a std::bitset, you could get away with a 2^16 bit = 2^13 byte = 8 kiB std::bitset as a local var (on the stack). But for short input strings the overhead of zeroing and counting that much space would probably be worse than making a small unordered_map.

For short strings, a small std::bitset would actually be worth using vs. a short array, but the string length is a run-time variable, so you'd need two versions of the function.

Davislor · Accepted Answer · 2021-12-24 06:48:26Z

Fun With Templates

This accepts any type as the character type, which on a previous project of mine led to my routines interpreting double* as a string of double-precision values terminated by 0.0. (In an extremely contrived test of whether it sufficed to restrict the overload to types for which std::char_traite<CharT> was defined.)

This led me to define the C++20 concept:

#include <concepts>


template <typename CharT> concept char_type = 
  std::same_as<CharT, char>          ||
  std::same_as<CharT, signed char>   ||
  std::same_as<CharT, unsigned char> ||
  std::same_as<CharT, wchar_t>       ||
  std::same_as<CharT, char8_t>       ||
  std::same_as<CharT, char16_t>      ||
  std::same_as<CharT, char32_t>;

This would let you write

 template<char_type CharType = char>

The C++11 version would have used std::enable_if.

Unicode Support

No mainstream operating system uses a fixed-width default encoding any more. (Windows, the last holdout, made UTF-8 the default in Windows 11.) Since you’re accepting Unicode input, real-world input with multi-byte characters, surrogate pairs and combining characters will break the program.

Most OSes set the default locale to a multi-byte character set. To deal with these, you either need to split the string into graphemes, convert each one into a normalized form, and keep the count in a hash map, or else convert to strings of char32_t, the only portably fixed-width encoding. The latter only works if you can ignore combining characters.

To adhere to the Unicode standard, you also want to ensure that all canonically-equivalent representations are processed the same way.

The algorithm you would want to follow here is:

Decompose the string. (Probably with canonical decomposition, but you might prefer compatibility decomposition.)
Ignore all characters from character classes you should ignore (such as accents, punctuation and spaces, but probably not numbers)
For each remaining character, convert to the same case (uppercase or lowercase, pick one)
Since you discarded all combining characters in step 2, you can convert your character to a single UCS-4 codepoint. Do so.
Increment the count for the canonicalized base character in a hash map

There is no function that does step 1 in the C++ standard library (that I know of). You would use a third-party library such as ICU for this. If you overlook that, there are functions that will do steps 2–4 on wchar_t characters. These will work correctly on Linux, but fail for utf-16 surrogate pairs on Windows. Or you could use the same libreary you used for step 1. Step 5 might use a std::unordered_map< char32_t, size_t >.

A Palindrome in What Language?

This at least should suffice for Latin, Greek and Cyrillic scripts; I do not know how a native speaker of Bengali or Korean would define a “palindrome” in their languages, but the rule that you can discard all combining characters is unlikely to work for Hangul characters.

In Japanese, “palindromes” seem to be defined by syllables regardless of which characters are used to write them. So, 夫婦 (fufu, married couple) and 田植え歌 (Taueuta, rice-planting song) are considered palindromes. You might be able to rescue the algorithm by converting all Japanese kana to hirigana or katakana; I’m not sure.

Apparently, at least one Chinese definition of palindrome does not work phonetically,, but by applying the standard definition to hanzi characters, and the same characters would not be pronounced the same way across China anyway. The algorithm above would work on Chinese.

This gives us a counterexample to the possibility of a multilingual palindrome checker. Because Unicode uses the same codepoints for Japanese kanji and Chinese hanzi, it is not possible for a palindrome checker to work for both languages simultaneously. You can pick at most one language to check in at a time.

“Windows 11, the last holdout, made UTF-8 the default in Windows 11,” should probably drop the first “11.” Also, assuming that’s so, Windows 10 retains much larger market share than Windows 11 (by a factor of 9 or so), so it’s a little premature to say “no mainstream operating system” if Windows 10 still did. (That does not mean it’s premature to write programs that are variable-width capable, of course—such ability is in fact long overdue, at least for production code, no matter what Windows does or doesn’t do.) — KRyan
– KRyan, Commented Dec 24, 2021 at 5:48

Peilonrayz · Accepted Answer · 2021-12-26 22:38:29Z

0

You are treating capital letters as different characters to their lowercase form. Capitalization has no place in palindromic reasoning. Additionally numbers can be part of, or the entire, palindrome (the reason 02022020 was such a momentous date, as was 12022021).

If we are talking about the longest UTF-8 palindrome (text only) in the English Language, it would have to be 'A Man, A Plan, A Canal, Panama!" with the need to strip out all non-text characters.

edited Dec 26, 2021 at 22:38

Peilonrayz♦

44.6k7 gold badges80 silver badges158 bronze badges

answered Dec 24, 2021 at 17:29

AlMo320

1

Add a comment |

Stack Exchange Network

Checking whether a string is a permutation of a palindrome in C++20 - follow-up

5 Answers 5

Drop in ergonomics

Missing traits as type parameter for the string_view

Mixing two responsibilities

Fun With Templates

Unicode Support

A Palindrome in What Language?

You must log in to answer this question.

Linked

Hot Network Questions

Checking whether a string is a permutation of a palindrome in C++20 - follow-up

5 Answers 5

Drop in ergonomics

Missing traits as type parameter for the string_view

Mixing two responsibilities

Fun With Templates

Unicode Support

A Palindrome in What Language?

You must log in to answer this question.

Linked

Related

Hot Network Questions