Revisions to Hash table implementation in c

Updated with memory leak info

Source Link

edited Sep 19, 2017 at 13:24

aghast

12.6k
25
46

You recompute the hash during resize operations. As your hash functions grows more powerful, this gets more expensive. I'd suggest storing the hash value in the key record, and not doing any hashing during resize - just compute the stored hash value modulemodulo the new table size.
Your type names are a mix of Pascal case and the C _t suffix. I'd suggest that you pick one: either use hashtable_t or HashTable as your type name.
Further, why not include the pointer in the type definition? Instead of declaring HashTable *ht; it would be easier to declare HashTable ht; and let the type be opaque.
You are leaking names into the outside. The various table-entry functions should definitely be prefixed with te_ (new, hash, and lookup). Ideally, they should be declared static and completely hidden from outside access.
Consider storing your hash/key/value entries in a simple dynamic array. You can then create a separate array of shorts or ints (depending on size) to serve as your hash space. You would hash into the hash space, then use that to find an index into the main table of entries. This lets you store the entries in insertion order (which can be useful), and lets you access the hash space using open addressing with a fairly high degree of locality. (You could even use unsigned char indexes in your hash space, if you felt there were enough use cases for small tables. Keep in mind that the size of the index is determined by the number of entries, not the size of the hash space: you could have 1024 buckets with < 256 entries and still use uchar indexes - damn few collisions, too, I'd bet.)

You recompute the hash during resize operations. As your hash functions grows more powerful, this gets more expensive. I'd suggest storing the hash value in the key record, and not doing any hashing during resize - just compute the stored hash value module the new table size.
Your type names are a mix of Pascal case and the C _t suffix. I'd suggest that you pick one: either use hashtable_t or HashTable as your type name.
Further, why not include the pointer in the type definition? Instead of declaring HashTable *ht; it would be easier to declare HashTable ht; and let the type be opaque.
You are leaking names into the outside. The various table-entry functions should definitely be prefixed with te_ (new, hash, and lookup). Ideally, they should be declared static and completely hidden from outside access.
Consider storing your hash/key/value entries in a simple dynamic array. You can then create a separate array of shorts or ints (depending on size) to serve as your hash space. You would hash into the hash space, then use that to find an index into the main table of entries. This lets you store the entries in insertion order (which can be useful), and lets you access the hash space using open addressing with a fairly high degree of locality. (You could even use unsigned char indexes in your hash space, if you felt there were enough use cases for small tables. Keep in mind that the size of the index is determined by the number of entries, not the size of the hash space: you could have 1024 buckets with < 256 entries and still use uchar indexes - damn few collisions, too, I'd bet.)

You recompute the hash during resize operations. As your hash functions grows more powerful, this gets more expensive. I'd suggest storing the hash value in the key record, and not doing any hashing during resize - just compute the stored hash value modulo the new table size.
Your type names are a mix of Pascal case and the C _t suffix. I'd suggest that you pick one: either use hashtable_t or HashTable as your type name.
Further, why not include the pointer in the type definition? Instead of declaring HashTable *ht; it would be easier to declare HashTable ht; and let the type be opaque.
You are leaking names into the outside. The various table-entry functions should definitely be prefixed with te_ (new, hash, and lookup). Ideally, they should be declared static and completely hidden from outside access.
Consider storing your hash/key/value entries in a simple dynamic array. You can then create a separate array of shorts or ints (depending on size) to serve as your hash space. You would hash into the hash space, then use that to find an index into the main table of entries. This lets you store the entries in insertion order (which can be useful), and lets you access the hash space using open addressing with a fairly high degree of locality. (You could even use unsigned char indexes in your hash space, if you felt there were enough use cases for small tables. Keep in mind that the size of the index is determined by the number of entries, not the size of the hash space: you could have 1024 buckets with < 256 entries and still use uchar indexes - damn few collisions, too, I'd bet.)

Updated with memory leak info

Source Link

edited Sep 19, 2017 at 13:18

aghast

12.6k
25
46

Update:

Here's the leaking function:

TableEntry_t *new(char *k, char *v)
{
    TableEntry_t *te = NULL;
    if ((te = malloc(sizeof(TableEntry_t))) == NULL)
        return NULL;
    if ((te->key = strdup(k)) == NULL)
        return NULL;
    if ((te->val = strdup(v)) == NULL)
        return NULL;
    te->next = NULL;
    return te;
}

Now, suppose that the expression (te->val = strdup(v)) == NULL evaluates as true.

In that case, you have your te pointing to allocated memory; you have te->key pointing to allocated memory; and you have te->next pointing to uninitialized memory.

Ignoring my suggestion to completely rewrite this part (#6), I would suggest rewriting this using calloc. Yes, it's possible to manage the zeros on your own for a structure this small, but why bother?

Something like this:

if ((te = calloc(1, sizeof *te)) == NULL
    || (te->key = strdup(k)) == NULL 
    || (te->val = strdup(v)) == NULL) 
{
    te_free(te);
    return NULL;
}

return te;

You will need to amend te_free to actually free the entry, and to get rid of that pesky recursion (which exposes you to a stack overflow attack). The current version never frees the te pointer, for some reason:

void te_free(TableEntry_t *te)
{
    if (te->next != NULL) 
    {
        te_free(te->next);
        free(te->next);
    }
    free(te->key);
    free(te->val);
}

Instead, use a loop and consume the entries from the front:

TableEntry_t *next;

while (te != NULL) 
{
    next = te->next;
    free(te->key);
    free(te->val);
    free(te);
    te = next;
}

Update:

Here's the leaking function:

TableEntry_t *new(char *k, char *v)
{
    TableEntry_t *te = NULL;
    if ((te = malloc(sizeof(TableEntry_t))) == NULL)
        return NULL;
    if ((te->key = strdup(k)) == NULL)
        return NULL;
    if ((te->val = strdup(v)) == NULL)
        return NULL;
    te->next = NULL;
    return te;
}

Now, suppose that the expression (te->val = strdup(v)) == NULL evaluates as true.

In that case, you have your te pointing to allocated memory; you have te->key pointing to allocated memory; and you have te->next pointing to uninitialized memory.

Ignoring my suggestion to completely rewrite this part (#6), I would suggest rewriting this using calloc. Yes, it's possible to manage the zeros on your own for a structure this small, but why bother?

Something like this:

if ((te = calloc(1, sizeof *te)) == NULL
    || (te->key = strdup(k)) == NULL 
    || (te->val = strdup(v)) == NULL) 
{
    te_free(te);
    return NULL;
}

return te;

You will need to amend te_free to actually free the entry, and to get rid of that pesky recursion (which exposes you to a stack overflow attack). The current version never frees the te pointer, for some reason:

void te_free(TableEntry_t *te)
{
    if (te->next != NULL) 
    {
        te_free(te->next);
        free(te->next);
    }
    free(te->key);
    free(te->val);
}

Instead, use a loop and consume the entries from the front:

TableEntry_t *next;

while (te != NULL) 
{
    next = te->next;
    free(te->key);
    free(te->val);
    free(te);
    te = next;
}

Source Link

answered Sep 19, 2017 at 4:50

aghast

12.6k
25
46

Probably the biggest pitfall I see is the question of ownership. As the code is written, you call strdup on both the key and value when you insert them in the new function. (By the way: you leak memory if strdup returns NULL.)

That makes the ownership clear, but it forces a copy of both the key and value, and doesn't allow the user to do easy updates. I would suggest not trying to duplicate the value part at all. I would further suggest making that a void * pointer rather than a char * - the current implementation is too tied to the string -> string model.

You might keep your own copy of the key - that makes sense if you are reading keys from a file buffer where the memory is likely to be re-used. Again, though, I would suggest pushing the ownership issue onto the user: let them call strdup for you, or take responsibility for interning the strings, or whatever.

You recompute the hash during resize operations. As your hash functions grows more powerful, this gets more expensive. I'd suggest storing the hash value in the key record, and not doing any hashing during resize - just compute the stored hash value module the new table size.
Your type names are a mix of Pascal case and the C _t suffix. I'd suggest that you pick one: either use hashtable_t or HashTable as your type name.
Further, why not include the pointer in the type definition? Instead of declaring HashTable *ht; it would be easier to declare HashTable ht; and let the type be opaque.
You are leaking names into the outside. The various table-entry functions should definitely be prefixed with te_ (new, hash, and lookup). Ideally, they should be declared static and completely hidden from outside access.
Consider storing your hash/key/value entries in a simple dynamic array. You can then create a separate array of shorts or ints (depending on size) to serve as your hash space. You would hash into the hash space, then use that to find an index into the main table of entries. This lets you store the entries in insertion order (which can be useful), and lets you access the hash space using open addressing with a fairly high degree of locality. (You could even use unsigned char indexes in your hash space, if you felt there were enough use cases for small tables. Keep in mind that the size of the index is determined by the number of entries, not the size of the hash space: you could have 1024 buckets with < 256 entries and still use uchar indexes - damn few collisions, too, I'd bet.)

I really think you need a better hashing function. In order to prepare for this, you should probably treat your key as a memory block, instead of as a string: pass a start-pointer and a size to the hash function.
Next, you might consider using a much stronger hash. Even a cryptographic hash. If you are caching your hash values, you'll never have to call it again on a key, and at some point the probability of a collision gets very close to zero (this is how Mercurial identifies changesets, for example).

Stack Exchange Network

Return to Answer