Convert tri-gram of type wchar_t to an integer type

Question

I am trying to convert a tri-gram containing 3 characters of type wchar_t to an integer type (an uint64_t at the present time).

It's been quiet a while I've been working with shifting and squeezing bytes. Here's what I've come up with so far and it - at least the few test cases in the tri-gram list - seems to work so far. But actually I am not very confident with this code, also I have the feeling I may have complicated things more than needed. Code verbosity came out of experimenting.

In this version I am just taking 3-byte chars into account and there's not much of error checking yet. I should mention that portability is not much of a point, as this code will only run on Unix/Linux boxes. Endianess should be properly implemented then.

Please some suggestions or maybe more simple / reliable solution the one or other line.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>
#include <stdint.h>
#include <inttypes.h>
#include <math.h>

/**
 * Reconstruct trigram from uint64_t value
 */
void to_wchar(const uint64_t v, wchar_t *s)
{
    uint32_t v32_4 = (uint16_t) ((v) >> 16);
    uint32_t v32_3 = (uint16_t) ((v) >> 32);
    uint32_t v32_2 = (uint16_t) ((v) >> 48);
    uint32_t v32_1 = (uint16_t) ((v)  & 0x001FFFFF); //0x0000FFFFuL);

    s[3] = v32_1; //'\0';
    s[2] = v32_4;
    s[1] = v32_3;
    s[0] = v32_2;
}

/**
 * Convert byte array to uint64, storing 3 uint32_t values
 */
uint64_t to_uint64(const uint8_t *v)
{
    uint64_t ui64 = 
    /* wchar */
    ((int64_t) v[2]  << 56) |
    ((int64_t) v[3]  << 48) |
    /* wchar */
    ((int64_t) v[6]  << 40) |
    ((int64_t) v[7]  << 32) |
    /* wchar */
    ((int64_t) v[10] << 24) |
    ((int64_t) v[11] << 16) |
    /* unused */
    ((int64_t) 0 << 8)      | 
    (int64_t) 0;

    fprintf(stderr, "0x%" PRIx64 " (%ld)\n", ui64, ui64);

    return ui64;
}

int trigram_to_ui64(const wchar_t *v, uint64_t *ui64)
{
    uint8_t *z = NULL;
    int i;

    if ((WCHAR_MIN > 0) || (WCHAR_MAX < 0x10FFFF)) {
        fprintf(stderr, "Unprocessable rare UTF8 character. Found WCHAR_MIN=%d and WCHAR_MAX=%d", WCHAR_MIN, WCHAR_MAX);
        return 0;
    }

    z = malloc(3 * sizeof(wchar_t));
    memset(z, 0, sizeof(*z));

    uint32_t *ptr = (uint32_t *)z;
    for (i = 0; i < wcslen(v); i++) {
        wchar_t wc = v[i];

        /* Replace 4-byte and more chars, 0x007f = DEL */
        if (wc >= 0x10000)  
            wc = (count_bytes(wc) > 3 ? 0x007f : wc);

        /* Pack */
        uint8_t* data = (uint8_t*)ptr;
        data[0] = (uint8_t)((wc >> 24) & 0xff);
        data[1] = (uint8_t)((wc >> 16) & 0xff);
        data[2] = (uint8_t)((wc >> 8) & 0xff);
        data[3] = (uint8_t)(wc & 0xff);

        ptr++;
    }

    *ui64 = to_uint64(z);   

    free(z);

    return 1;
}

int ui64_to_trigram(const uint64_t ui64, wchar_t *trigram)
{
    to_wchar(ui64, trigram);

    return 1;
}


int main(int argc, char **argv)
{
    static const wchar_t *trigrams[] =
    {
        L"öx再",
        L"©¥c",
        L"再学习",
        L"AiB",
        L"a—c",
        L"екс",
        L"öbä"
        L"",
        L"A B",
        L" 学习",
        L"AB ",
        L"xy𝄞",
        NULL
    };

    int i;
    const wchar_t **str;
    uint64_t ui64;
    wchar_t aux[3] = {0};

    setlocale(LC_CTYPE, "");
    setlocale(LC_COLLATE, "");

    /* Basic testing ... */
    for (str = trigrams; *str; str++) {
            printf("==> '%ls'\n", *str);

            /* Convert to uint64_t */
            trigram_to_ui64(*str, &ui64);

            /* Convert back to wchar_t string */
            ui64_to_trigram(ui64, aux);

            printf("<== '%ls': ", aux);
            if (wcscmp(*str, aux) == 0)
                    printf("OK\n");
            else
                    printf("NOT OK\n");
    }

    return 0;
}

Just wanted to say: Good on you for using uint64_t instead of assuming unsigned long would do the trick. — anon
– anon, Commented May 14, 2016 at 15:34
@QPaysTaxes: Yes indeed. But honestly i even did not think about choosing type unsigned long :-) — Andreas W. Wylach
– Andreas W. Wylach, Commented May 16, 2016 at 8:10
Do you expect memset(z, 0, sizeof(*z)); to do more than z[0] = 0;? — chux
– chux, Commented Dec 3, 2024 at 15:37

pacmaninbw · Accepted Answer · 2024-12-03 15:20:27Z

General Observations

Some of the functions return values that indicate success or failure, these values are never checked in the program where the functions are used.

There are a lot of possible bugs in this code.

The Function `uint64_t to_uint64(const uint8_t *v)`

There are a number of issues I see in this function, it could be simplified by calling a function that does a single 32 bit value in a loop. There is no masking of the values either on the left hand side or the right hand side. There are too many Magic Numbers (see below). It violates DRY programming (see below).

The Function `int trigram_to_ui64(const wchar_t v, uint64_t ui64)`

It isn't clear why the variable z is being allocated memory since a simple array would be easier to implement and the allocated memory never changes size.

    wchar_t z[3] = {0};

The variable z should not be declared until after the test:

    if ((WCHAR_MIN > 0) || (WCHAR_MAX < 0x10FFFF)) {
        fprintf(stderr, "Unprocessable rare UTF8 character. Found WCHAR_MIN=%d and WCHAR_MAX=%d", WCHAR_MIN, WCHAR_MAX);
        return 0;
    }

Let the Compiler Tell You Where Possible Issues Are

When doing C program development it is always a good practice to compile with the -Wall and -Wextra flags. These flags will help you find possible errors in the code. When I was learning how to program in C on Unix we had a program call lint which helped identify possible problems in the code. I always ran lint on my code. To some extent the -Wall and -Wextra flags have replaced lint.

If the program had been compile with the -Wall andd -Wextra flags it would have found the following issues:

/usr/bin/gcc-12   -g -Wall -Wextra -Werror -MD -MT CMakeFiles/zombie.dir/main.c.o -MF CMakeFiles/zombie.dir/main.c.o.d -o CMakeFiles/zombie.dir/main.c.o -c main.c
main.c: In function ‘trigram_to_ui64’:
main.c:64:19: error: comparison of integer expressions of different signedness: ‘int’ and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
   64 |     for (i = 0; i < wcslen(v); i++) {
      |                   ^
main.c:69:19: error: implicit declaration of function ‘count_bytes’ [-Werror=implicit-function-declaration]
   69 |             wc = (count_bytes(wc) > 3 ? 0x007f : wc);
      |                   ^~~~~~~~~~~
main.c: In function ‘main’:
main.c:115:9: error: unused variable ‘i’ [-Werror=unused-variable]
  115 |     int i;
      |         ^
main.c:96:14: error: unused parameter ‘argc’ [-Werror=unused-parameter]
   96 | int main(int argc, char **argv)
      |          ~~~~^~~~
main.c:96:27: error: unused parameter ‘argv’ [-Werror=unused-parameter]
   96 | int main(int argc, char **argv)
      |                    ~~~~~~~^~~~
cc1: all warnings being treated as errors

Declare the Variables as Needed

In the original version of C back in the 1970s and 1980s variables had to be declared at the top of the function. That is no longer the case, and a recommended programming practice to declare the variable as needed. In C the language doesn't provide a default initialization of the variable so variables should be initialized as part of the declaration. For readability and maintainability each variable should be declared and initialized on its own line.

Following this practice will also delete or remove the warning messages about unused variables.

For loop control variables should be declared in the for loop. Variables that are only used in the loop should be declared within the loop.

    for (const wchar_t **str = trigrams; *str; str++) {
        uint64_t ui64;
        wchar_t aux[3] = {0};
        printf("==> '%ls'\n", *str);

        /* Convert to uint64_t */
        trigram_to_ui64(*str, &ui64);

        /* Convert back to wchar_t string */
        ui64_to_trigram(ui64, aux);

        printf("<== '%ls': ", aux);
        if (wcscmp(*str, aux) == 0)
            printf("OK\n");
        else
            printf("NOT OK\n");
    }

Prefer Braces `{` and `}` Around Single Statements in `if` or `loops`

Some programmers consider this a style issue, but it makes it much easier to read and maintain the code if each in an if, else or loop block is embedded within braces. Extending the functionality of these statements can be problematic when the braces are not used. For a more in depth discussion of this see the first 2 answers on this Stack Overflow question. As one of the answers points out this is true in all C like languages (C, C++, C#, JavaScript, Java, etc.). I have worked at multiple companies where this was required in the coding standard and flagged during code reviews.

Magic Numbers

There are Magic Numbers through out the program, it would be better to create symbolic constants for them to make the code more readble and easier to maintain. These numbers may be used in many places and being able to change them by editing only one line makes maintainence easier.

Numeric constants in code are sometimes referred to as Magic Numbers, because there is no obvious meaning for them. There is a discussion of this on stackoverflow.

Some of the magic numbers:

        data[0] = (uint8_t)((wc >> 24) & 0xff);
        data[1] = (uint8_t)((wc >> 16) & 0xff);
        data[2] = (uint8_t)((wc >> 8) & 0xff);
        data[3] = (uint8_t)(wc & 0xff);

Test for Possible Memory Allocation Errors

In modern high-level languages such as C++, memory allocation errors throw an exception that the programmer can catch. This is not the case in the C programming language. While it is rare in modern computers because there is so much memory, memory allocation can fail, especially if the code is working in a limited memory application such as embedded control systems. In the C programming language when memory allocation fails, the functions malloc(), calloc() and realloc() return NULL. Referencing any memory address through a NULL pointer results in undefined behavior (UB).

Possible unknown behavior in this case can be a memory page error (in Unix this would be call Segmentation Violation), corrupted data in the program and in very old computers it could even cause the computer to reboot (corruption of the stack pointer).

To prevent this undefined behavior a best practice is to always follow the memory allocation statement with a test that the pointer that was returned is not NULL.

Convention When Using Memory Allocation in C

When using malloc(), calloc() or realloc() in C a common convention is to sizeof(*PTR) rather sizeof(PTR_TYPE), this make the code easier to maintain and less error prone, since less editing is required if the type of the pointer changes.

DRY Code

There is a programming principle called the Don't Repeat Yourself Principle sometimes referred to as DRY code. If you find yourself repeating the same code multiple times it is better to encapsulate it in a function. If it is possible to loop through the code that can reduce repetition as well.

Stack Exchange Network

Convert tri-gram of type wchar_t to an integer type

1 Answer 1

General Observations

The Function `uint64_t to_uint64(const uint8_t *v)`

The Function `int trigram_to_ui64(const wchar_t v, uint64_t ui64)`

Let the Compiler Tell You Where Possible Issues Are

Declare the Variables as Needed

Prefer Braces `{` and `}` Around Single Statements in `if` or `loops`

Magic Numbers

Test for Possible Memory Allocation Errors

Convention When Using Memory Allocation in C

DRY Code

You must log in to answer this question.

Hot Network Questions

Convert tri-gram of type wchar_t to an integer type

1 Answer 1

General Observations

The Function uint64_t to_uint64(const uint8_t *v)

The Function int trigram_to_ui64(const wchar_t *v, uint64_t *ui64)

Let the Compiler Tell You Where Possible Issues Are

Declare the Variables as Needed

Prefer Braces { and } Around Single Statements in if or loops

Magic Numbers

Test for Possible Memory Allocation Errors

Convention When Using Memory Allocation in C

DRY Code

You must log in to answer this question.

Related

Hot Network Questions

The Function `uint64_t to_uint64(const uint8_t *v)`

The Function `int trigram_to_ui64(const wchar_t v, uint64_t ui64)`

Prefer Braces `{` and `}` Around Single Statements in `if` or `loops`