1. Home
2. Questions
3. Unanswered
4. AI Assist Labs
5. Tags
7. Chat
8. Users
10. Companies
Teams

Ask questions, find answers and collaborate at work with Stack Overflow for Teams.
Try Teams for free Explore Teams
Teams
Ask questions, find answers and collaborate at work with Stack Overflow for Teams. Explore Teams

Return to Answer

added 390 characters in body

Source Link

edited May 19, 2020 at 0:23

R.. GitHub STOP HELPING ICE

391
1
5

As others have said, don't use floating point math, but in some sense that's reviewing the wrong layer. The real issue behind that is that you don't need to be branching on a derived quantity, the number of bits. Instead branch on the codepoint value ranges (original input). For example (excerpt from my implementation):

} else if ((unsigned)wc < 0x800) {
    *s++ = 0xc0 | (wc>>6);
    *s = 0x80 | (wc&0x3f);
    return 2;
}

Not only is branching directly on the input quantity simpler than computing a derived quantity like number of bits; for the problem at hand (UTF-8) it's necessary in order to do proper error handling. Boundaries that are not exact numbers of bits (between D800 and DFFF, above 10FFFF) correspond erroneous inputs that should not be output as malformed UTF-8 but rejected in some manner.

As others have said, don't use floating point math, but in some sense that's reviewing the wrong layer. The real issue behind that is that you don't need to be branching on a derived quantity, the number of bits. Instead branch on the codepoint value ranges (original input). For example (excerpt from my implementation):

} else if ((unsigned)wc < 0x800) {
    *s++ = 0xc0 | (wc>>6);
    *s = 0x80 | (wc&0x3f);
    return 2;
}

As others have said, don't use floating point math, but in some sense that's reviewing the wrong layer. The real issue behind that is that you don't need to be branching on a derived quantity, the number of bits. Instead branch on the codepoint value ranges (original input). For example (excerpt from my implementation):

} else if ((unsigned)wc < 0x800) {
    *s++ = 0xc0 | (wc>>6);
    *s = 0x80 | (wc&0x3f);
    return 2;
}

Not only is branching directly on the input quantity simpler than computing a derived quantity like number of bits; for the problem at hand (UTF-8) it's necessary in order to do proper error handling. Boundaries that are not exact numbers of bits (between D800 and DFFF, above 10FFFF) correspond erroneous inputs that should not be output as malformed UTF-8 but rejected in some manner.

Source Link

answered May 18, 2020 at 20:36

R.. GitHub STOP HELPING ICE

391
1
5

As others have said, don't use floating point math, but in some sense that's reviewing the wrong layer. The real issue behind that is that you don't need to be branching on a derived quantity, the number of bits. Instead branch on the codepoint value ranges (original input). For example (excerpt from my implementation):

} else if ((unsigned)wc < 0x800) {
    *s++ = 0xc0 | (wc>>6);
    *s = 0x80 | (wc&0x3f);
    return 2;
}