A C++ function to read Code Points from an UTF-8 Stream

Question

I've written a function that reads and returns one UTF-8 code point from an istream. I am wondering if the code is efficient or if there are some obvious problems with the implementation.

chr_t utf32::get_utf32_char(std::istream &in_stream) {
    int next;
    chr_t out = in_stream.get();
    if (out == -1 || out < 0x80) {
        return out;
    } else if ((out & 0xe0) == 0xc0) {
        out &= 0x1f;
        out <<= 6;
        next = in_stream.get();
        if (next == -1) goto invalid_seq;
        out |= next & 0x3F;
        return out;
    } else if ((out & 0xf0) == 0xe0) {
        out &= 0x0f;
        out <<= 12;
        next = in_stream.get();
        if (next == -1) goto invalid_seq;
        out |= (next & 0x3F) << 6;
        next = in_stream.get();
        if (next == -1) goto invalid_seq;
        out |= next & 0x3F;
        return out;
    } else if ((out & 0xf8) == 0xf0) {
        out &= 0x07;
        out <<= 18;
        next = in_stream.get();
        if (next == -1) goto invalid_seq;
        out |= (next & 0x3F) << 12;
        next = in_stream.get();
        if (next == -1) goto invalid_seq;
        out |= (next & 0x3F) << 6;
        next = in_stream.get();
        if (next == -1) goto invalid_seq;
        out |= next & 0x3F;
        return out;
    } else {
        throw std::runtime_error("invalid utf8 character");
    }
invalid_seq:
    throw std::runtime_error("unexpected end of utf8 sequence");
}

Loki Astari · Accepted Answer · 2020-10-08 20:43:52Z

Overview

There is a lot of repeated code that could be removed by use of functions.

When bittwiddling like this it would be nice for a human readable explanation of what you are doing. I had to look up the unicode spec to make sure you were doing it correctly.

A lot of UTF-8 files (stream) contain a BOM marker 0xEF, 0xBB, 0xBF as the first code point. This is not part of the text stream and should be discarded if it exists. Though you may do this at the layer of abstraction above this in which case a comment pointing out that the BOM marker is not removed should be added.

You don't validate that the bytes 2 through 4 have the correct pattern for UTF-8 you just make that assumption.

You use exceptions on streams. Normally you would mark the stream as bad and return. The user of the stream is supposed to check the state of the stream before using any output (and further reading will fail).

C++ uses operator>> to read from a stream. It would be nice to be able to read your characters using this operator.

Code Review

The name of the function is not quite correct:

chr_t utf32::get_utf32_char(std::istream &in_stream)

Code points are distinct from there encoding. You are converting an code point that was encoding UTF-8 into UCS-4 (not UTF-32). UTF-32 is another encoding format used for transportation. I would note that UCS-4 and UTF-32 look the same but they are not the same thing.

You read into next (an int) in all locations apart from here:

    int next;
    chr_t out = in_stream.get();

Why not be consistent. I especially worry about corner case and auto conversions with characters and integers. Can't think of anything that would go wrong but why risk it. Read into next (the int) check for EOF then convert to your character representation.

Don't use magic numbers. In this context you should use EOF (not -1).

    if (out == -1 || out < 0x80) {
        return out;

I hate else on the same line as }.

    } else if ((out & 0xe0) == 0xc0) {

But your code your style.
Very few coding standards use this system.

In my opinion (so ignorable) you don't need to crush the code together that much. Extra vertical spacing will make the code easier to read.

Questionablt use of goto:

        if (next == -1) goto invalid_seq;

Why not simply:

        if (next == EOF) {
            throw std::runtime_error(unexpectedESFMessage);
        }

Redesign:

I would have used a more data driven approach:

struct Encoding
{
    char    mask;
    char    value;
    int     extra;
};
Encoding const utf8Info[] = { 
                        {0x80, 0x00, 0}
                        {0xE0, 0xC0, 1}
                        {0xF0, 0xE0, 2}
                        {0xF8, 0xF0, 3}
                      };
chr_t decodeUtf(std::istream& stream, chr_t result, int count)
{
    for(; count; --count) {
        int next = stream.get();
        if (next & 0xC0 != 0x80) {
            // Not a valid continuation character
            stream.setstate(std::ios::badbit)
            return -1;
        }
        result = (result << 6) | (next & 0x3F);
    }
    return result;
} 
chr_t getCodePoint(std::istream& stream)
{
    // NOTE: Does not remove any initial BOM marker.

    int next = stream.get();
    if (next == EOF) {
        return -1;
    }
    for(auto const& type: utf8Info) {
        if ( next & type.mask == type.value ) {
           return decodeUtf(stream, next & ~type.mask, type.extra);
        }
    }
    // Not a valid first character
    stream.setstate(std::ios::badbit)
    return -1;
}

std::istream& operator>>(std::istream& str, chr_t& out)
{
    chr_t tmp = getCodePoint(str);
    if (str) {
       out = tmp;
    }
    return str;
}

.

One small mistake i noticed was that decodeUtf(stream, next & type.mask, type.extra) would first need to negate the mask (~type.mask) cause otherwise you're taking the encoding bits instead of the data bits. other than that, this works perfectly! thank you very much :) — Ian Rehwinkel
– Ian Rehwinkel, Commented Oct 8, 2020 at 12:09

Stack Exchange Network

A C++ function to read Code Points from an UTF-8 Stream

1 Answer 1

Overview

Code Review

Redesign:

You must log in to answer this question.

Hot Network Questions

A C++ function to read Code Points from an UTF-8 Stream

1 Answer 1

Overview

Code Review

Redesign:

You must log in to answer this question.

Related

Hot Network Questions