Skip to main content
added 390 characters in body
Source Link

As others have said, don't use floating point math, but in some sense that's reviewing the wrong layer. The real issue behind that is that you don't need to be branching on a derived quantity, the number of bits. Instead branch on the codepoint value ranges (original input). For example (excerpt from my implementation):

} else if ((unsigned)wc < 0x800) {
    *s++ = 0xc0 | (wc>>6);
    *s = 0x80 | (wc&0x3f);
    return 2;
}

Not only is branching directly on the input quantity simpler than computing a derived quantity like number of bits; for the problem at hand (UTF-8) it's necessary in order to do proper error handling. Boundaries that are not exact numbers of bits (between D800 and DFFF, above 10FFFF) correspond erroneous inputs that should not be output as malformed UTF-8 but rejected in some manner.

As others have said, don't use floating point math, but in some sense that's reviewing the wrong layer. The real issue behind that is that you don't need to be branching on a derived quantity, the number of bits. Instead branch on the codepoint value ranges (original input). For example (excerpt from my implementation):

} else if ((unsigned)wc < 0x800) {
    *s++ = 0xc0 | (wc>>6);
    *s = 0x80 | (wc&0x3f);
    return 2;
}

As others have said, don't use floating point math, but in some sense that's reviewing the wrong layer. The real issue behind that is that you don't need to be branching on a derived quantity, the number of bits. Instead branch on the codepoint value ranges (original input). For example (excerpt from my implementation):

} else if ((unsigned)wc < 0x800) {
    *s++ = 0xc0 | (wc>>6);
    *s = 0x80 | (wc&0x3f);
    return 2;
}

Not only is branching directly on the input quantity simpler than computing a derived quantity like number of bits; for the problem at hand (UTF-8) it's necessary in order to do proper error handling. Boundaries that are not exact numbers of bits (between D800 and DFFF, above 10FFFF) correspond erroneous inputs that should not be output as malformed UTF-8 but rejected in some manner.

Source Link

As others have said, don't use floating point math, but in some sense that's reviewing the wrong layer. The real issue behind that is that you don't need to be branching on a derived quantity, the number of bits. Instead branch on the codepoint value ranges (original input). For example (excerpt from my implementation):

} else if ((unsigned)wc < 0x800) {
    *s++ = 0xc0 | (wc>>6);
    *s = 0x80 | (wc&0x3f);
    return 2;
}