Overview
There is a lot of repeated code that could be removed by use of functions.
When bittwiddling like this it would be nice for a human readable explanation of what you are doing. I had to look up the unicode spec to make sure you were doing it correctly.
A lot of UTF-8 files (stream) contain a BOM marker 0xEF, 0xBB, 0xBF as the first code point. This is not part of the text stream and should be discarded if it exists. Though you may do this at the layer of abstraction above this in which case a comment pointing out that the BOM marker is not removed should be added.
You don't validate that the bytes 2 through 4 have the correct pattern for UTF-8 you just make that assumption.
You use exceptions on streams. Normally you would mark the stream as bad and return. The user of the stream is supposed to check the state of the stream before using any output (and further reading will fail).
C++ uses operator>> to read from a stream. It would be nice to be able to read your characters using this operator.
Code Review
The name of the function is not quite correct:
chr_t utf32::get_utf32_char(std::istream &in_stream)
Code points are distinct from there encoding. You are converting an code point that was encoding UTF-8 into UCS-4 (not UTF-32). UTF-32 is another encoding format used for transportation. I would note that UCS-4 and UTF-32 look the same but they are not the same thing.
You read into next (an int) in all locations apart from here:
int next;
chr_t out = in_stream.get();
Why not be consistent. I especially worry about corner case and auto conversions with characters and integers. Can't think of anything that would go wrong but why risk it. Read into next (the int) check for EOF then convert to your character representation.
Don't use magic numbers. In this context you should use EOF (not -1).
if (out == -1 || out < 0x80) {
return out;
I hate else on the same line as }.
} else if ((out & 0xe0) == 0xc0) {
But your code your style.
Very few coding standards use this system.
In my opinion (so ignorable) you don't need to crush the code together that much. Extra vertical spacing will make the code easier to read.
Questionablt use of goto:
if (next == -1) goto invalid_seq;
Why not simply:
if (next == EOF) {
throw std::runtime_error(unexpectedESFMessage);
}
Redesign:
I would have used a more data driven approach:
struct Encoding
{
char mask;
char value;
int extra;
};
Encoding const utf8Info[] = {
{0x80, 0x00, 0}
{0xE0, 0xC0, 1}
{0xF0, 0xE0, 2}
{0xF8, 0xF0, 3}
};
chr_t decodeUtf(std::istream& stream, chr_t result, int count)
{
for(; count; --count) {
int next = stream.get();
if (next & 0xC0 != 0x80) {
// Not a valid continuation character
stream.setstate(std::ios::badbit)
return -1;
}
result = (result << 6) | (next & 0x3F);
}
return result;
}
chr_t getCodePoint(std::istream& stream)
{
// NOTE: Does not remove any initial BOM marker.
int next = stream.get();
if (next == EOF) {
return -1;
}
for(auto const& type: utf8Info) {
if ( next & type.mask == type.value ) {
return decodeUtf(stream, next & ~type.mask, type.extra);
}
}
// Not a valid first character
stream.setstate(std::ios::badbit)
return -1;
}
std::istream& operator>>(std::istream& str, chr_t& out)
{
chr_t tmp = getCodePoint(str);
if (str) {
out = tmp;
}
return str;
}
.