Skip to main content
6 of 8
added 456 characters in body
Davislor
  • 9.1k
  • 19
  • 39

Use a String View, not a C-Style String

Currently, you use a C-style interface:

U32 NextUTF8Char(const char* str, U32& idx)

This has the serious flaw that there is no bounds checking on a string of arbitrary length, which is a buffer overrun waiting to happen.

The best type to use to represent a stringy object is a std::string_view, which passes the string and its length around simply with low overhead, and which other stringy types convert to and from efficiently,

Always, always, always check for buffer overruns!

Use char32_t

This is what it’s for! If you’re converting or displaying UCS-4 codepoints, your interface probably expects either char32_t (if it’s portable) or 32-bit wchar_t. The uint_least32_t type from <cstdint> has the same size and alignment, and also would work.

The uint32_t type isn’t completely portable. (It theoretically does not exist on a machine with no exact-width 32-bit type, although I suspect that actual implementations will support it anyway.) A wchar_t won’t work because that’s only 16 bits wide on Windows. (Which violates the Standard, but the original sin of Unicode was thinking 65,536 codepoints would be enough forever if they could just force the Japanese to go along with it, both Microsoft and the C++ Standard Committee believed this, and Microsoft was not going to break the Windows API.) An unsigned long is 64 bits wide on some implementations. An unsigned int is 16 bits wide on some architectures, and 64 bits wide on a few.

Consider An Iterator Interface

Which operations are you doing? You’re retrieving the next codepoint from the string and incrementing the index to the start of the next UTF-8 codepoint. Those are the * and ++ operations of a ForwardIterator! You also implicitly need to compare the position to the end of the string. Comparison of two indices, testing whether or not we are at the end, and swap, are useful operations too. And implementing the forward_iterator interface lets you use them with every other language feature that takes an iterator, for example:

const std::u32string converted(u32_begin(utf8_source), u32_end(utf8_source));

Or

for( auto it = u32_begin(utf8_source); it; ++it )

You can Optimize the Decoder

Currently, you have nested if-else blocks that sometimes increment the index and retrieve the next character. These do not generate good code for mainstream architectures in 2023.

The fastest approach I’m aware of is a finite-state machine, but it is also possible to write a branchless implementation.

Putting it All Together

Here is a sample implementation as an iterator class:

#include <cassert>
#include <compare> // partial_ordering
#include <cstddef> // ptrdiff_t, size_t
#include <cstring>
#include <iterator> // iterator_category
#include <stdexcept> // logic_error, runtime_error
#include <string_view>

namespace ucs4 {
/* The necessary non-member functions must be friends of the class.  This
 * requires a forward declaration before the class definition, which in turn
 * requires a forward declaration of the class as an incomplete type.
 */
class ucs4_it;
constexpr void swap(ucs4_it&, ucs4_it&) noexcept;

class ucs4_it {
private:
/* Store a view of the substring, not merely a position within it, so as to
 * detect and prevent a buffer overrun.
 */
    static constexpr const char* INVALID_UTF8_MSG = "Invalid UTF-8 data.";

    const char8_t* begin = nullptr;
    std::size_t size = 0;

    constexpr ucs4_it(const char8_t* const new_start, const std::size_t new_size) noexcept
     : begin(new_start), size(new_size)
    {}

public:
    using difference_type = std::ptrdiff_t;
    using value_type = char32_t;
    using pointer = char32_t*;
    using const_ptr = const char32_t*;
    using reference = char32_t&;
    using const_reference = const char32_t&;
    using size_type = std::size_t;
    using iterator_category = std::forward_iterator_tag;

    ucs4_it() = default;
    ucs4_it(const ucs4_it&) = default;
    ucs4_it(ucs4_it&&) = default;
    ucs4_it& operator=(const ucs4_it&) = default;
    ucs4_it& operator=(ucs4_it&&) = default;
    ~ucs4_it() = default;

    constexpr void swap(ucs4_it& other) noexcept {
        ::ucs4::swap(*this, other);
    }

    constexpr operator bool() const noexcept {return size != 0;}

/* Most of the logic of the original implementation goes here: */
    constexpr value_type operator*() const {
/* The default ucs4_it object references an empty string, and can be
 * dereferenced.
 */
        if (size == 0)
          return 0;

        if (!begin) {
/* This is a logic error: it should be impossible to create a stringy object
 * from a null pointer and a nonzero length.
 */
            throw std::logic_error("Invalid ucs4_it object (invalid base, nonzero length).");
        }

/* A not-particularly-optimized implementation with error-checking
 * and low cyclomatic complexity.
 */
        const auto c1 = begin[0];
        if        (c1 < 0b10000000U) {
            return c1;
        } else if (c1 < 0b11000000U) {
            throw std::runtime_error(INVALID_UTF8_MSG);
        } else if (c1 < 0b11100000U && size >= 2U) {
            const auto c2 = begin[1];
            if (c2 < 0b10000000U || c2 >= 0b11000000U) {
                throw std::runtime_error(INVALID_UTF8_MSG);
            }
            return (c1 & 0b00011111U) << 6U |
                   (c2 & 0b00111111U);
        } else if (c1 < 0b11110000U && size >= 3U) {
            const auto c2 = begin[1];
            const auto c3 = begin[2];

            if (c2 < 0b10000000U || c2 >= 0b11000000U ||
                c3 < 0b10000000U || c3 >= 0b11000000U) {
                throw std::runtime_error(INVALID_UTF8_MSG);
            }

            return (c1 & 0b00001111U) << 12U |
                   (c2 & 0b00111111U) << 6U |
                   (c3 & 0b00111111U);
        } else if (c1 < 0b11111000U && size >= 4U) {
            const auto c2 = begin[1];
            const auto c3 = begin[2];
            const auto c4 = begin[3];

            if (c2 < 0b10000000U || c2 >= 0b11000000U ||
                c3 < 0b10000000U || c3 >= 0b11000000U ||
                c4 < 0b10000000U || c4 >= 0b11000000U )
                {
                throw std::runtime_error(INVALID_UTF8_MSG);
            }

            return (c1 & 0b00000111U) << 18U |
                   (c2 & 0b00111111U) << 12U |
                   (c3 & 0b00111111U) << 6U |
                   (c4 & 0b00111111U);
        } else {
            throw std::runtime_error(INVALID_UTF8_MSG);
        }
    }

    constexpr ucs4_it& operator++() {
        if (size == 0) {
          return *this;
        }

        if (!begin) {
/* This is a logic error: it should be impossible to create a stringy object
 * from a null pointer and a nonzero length.
 */
            throw std::logic_error("Invalid ucs4_it object (invalid base, nonzero length).");
        }

        const auto c = *begin;

        if (c >= 0b10000000U && c < 0b11000000U) {
            // Not at a valid UTF-8 character boundary.
            throw std::runtime_error(INVALID_UTF8_MSG);
        }

        const size_t to_advance = (c < 0b10000000U) ? 1U :
                                  (c < 0b11100000U) ? 2U :
                                  (c < 0b11110000U) ? 3U :
                                                      4U;
        if (to_advance > size) {
            throw std::runtime_error(INVALID_UTF8_MSG);
        }

        *this = ucs4_it(begin + to_advance, size - to_advance);
        return *this;
    }

    constexpr ucs4_it operator++(int) {
        if (size == 0) {
            return *this;
        }

        ucs4_it to_return = *this;
        ++*this;
        return to_return;
    }

    friend constexpr void ::ucs4::swap(ucs4_it&, ucs4_it&) noexcept;
    friend constexpr std::strong_ordering operator<=>( const ucs4_it& left,
                                                       const ucs4_it& right )
        noexcept;
    friend inline ucs4_it begin(const std::string_view source) noexcept;
    friend inline ucs4_it begin(const char* const source) noexcept;
    friend constexpr ucs4_it begin(const std::u8string_view source) noexcept;
    friend inline ucs4_it begin(const char8_t* const source) noexcept;
    friend inline ucs4_it end(const std::string_view source) noexcept;
    friend constexpr ucs4_it end(const std::u8string_view source) noexcept;
};

constexpr void swap(ucs4_it& left, ucs4_it& right) noexcept {
    std::swap(left.begin, right.begin);
    std::swap(left.size, right.size);
}

/* Leave it up to the programmer to compare only iterators that index the same
 * object, and let them shoot themselves in the foot.  For example, it is
 * valid to compare two iterators within different substrings of the same
 * string, with different start and end points.
 */
constexpr std::strong_ordering operator<=>( const ucs4_it& left,
                                            const ucs4_it& right) noexcept {
    return left.begin <=> right.begin;
}

constexpr bool operator==( const ucs4_it left,
                           const ucs4_it right ) noexcept {
    return (left <=> right) == std::strong_ordering::equal;
}

/* Because we declare the comparison operators as non-member overloads, we
 * could also provide overloads to compare a ucs4_it and a char* or char8_t*.
 */

 inline ucs4_it begin(const std::string_view source) noexcept {
    return ucs4_it(reinterpret_cast<const char8_t*>(source.data()), source.size());
 }

inline ucs4_it begin(const char* const source) noexcept {
    return ucs4_it(reinterpret_cast<const char8_t*>(source), std::strlen(source)+1U);
}

constexpr ucs4_it begin(const std::u8string_view source) noexcept {
    return ucs4_it(source.data(), source.size());
}

inline ucs4_it begin(const char8_t* const source) noexcept {
    return ucs4_it(source, std::strlen(reinterpret_cast<const char*>(source)) + 1U);
}

inline ucs4_it end(const std::string_view source) noexcept {
    return ucs4_it(reinterpret_cast<const char8_t*>(source.data()) + source.size(), 0);
}

constexpr ucs4_it end(const std::u8string_view source) noexcept {
    return ucs4_it(source.data() + source.size(), 0);
}

} // end namespace ucs4

Some test boilerplate:

#include <concepts>
#include <cstdlib>
#include <iostream>
#include <source_location>
#include <string>
#include <string_view>

using std::cerr, std::cout, std::exit;
using namespace std::literals::string_view_literals;

template<class T, class U>
    requires (std::equality_comparable_with<T, U>)
constexpr void expect_test(const T& got,
                           const U& expected,
                           const std::source_location location =
    std::source_location::current()) {
    if (got != expected) {
        cout.flush();
        cerr << "Test in " << location.function_name()
             << " (" << location.file_name()
             << ':' << location.line()
             << ':' << location.column()
             << ") failed!\n";
        exit(EXIT_FAILURE);
    }

    cout << "Test in " << location.function_name()
         << " (" << location.file_name()
         << ':' << location.line()
         << ':' << location.column()
         << ") passed.\n";
}

And a simple test driver:

static_assert(std::forward_iterator<ucs4::ucs4_it>);
static_assert(!(ucs4::ucs4_it() != ucs4::ucs4_it()));
static_assert(ucs4::ucs4_it() <= ucs4::ucs4_it());
static_assert(ucs4::ucs4_it() >= ucs4::ucs4_it());
static_assert(!ucs4::ucs4_it());

int main() {
    expect_test(*ucs4::begin("!"sv), U'!');
    expect_test(*ucs4::begin(u8"¿"sv), U'¿');
    expect_test(*ucs4::begin(u8"א"sv), U'א');
    expect_test(*ucs4::begin(u8"𝓐"sv), U'𝓐');

    {
        constexpr auto test_sv = u8"☪☮∈✡℩☯✝ \U0001F644"sv;
        constexpr auto expected = U"☪☮∈✡℩☯✝ \U0001F644"sv;
        const std::u32string test1(ucs4::begin(test_sv), ucs4::end(test_sv));
        expect_test(test1, expected);

        std::u32string test2;
        for(auto it = ucs4::begin(test_sv); it; ++it) {
            test2.push_back(*it);
        }
        expect_test(test2, expected);
    }
    return EXIT_SUCCESS;
}

Code on Godbolt Compiler Explorer.

You’ll notice that this loses some performance because the ++ and * operators share many of the same tests. You could add a Rust-style interface by implementing a value_type next() member function of ucs4_it, which both dereferences and increments the iterator. There isn’t built-in syntax sugar for this interface in C++, but you might use it in a while (auto wc = it.next()) loop, unless the string could contain null bytes in the middle.

Davislor
  • 9.1k
  • 19
  • 39