Does awk use separate chaining, open addressing or does it have its own way of handling collisions in a hashmap?
Do gawk and nawk implement the same algorithm?
Thank you.
Does awk use separate chaining, open addressing or does it have its own way of handling collisions in a hashmap?
Do gawk and nawk implement the same algorithm?
Thank you.
Check https://www.gnu.org/software/gawk/manual/gawk.html#Other-Environment-Variables
This specifies two environment variables which "influence how gawk behaves". There is a warning that these are meant for use by the gawk developers for testing and tuning, and are subject to change.
INT_CHAIN_MAX
This specifies intended maximum number of items gawk will maintain on a hash chain for managing arrays indexed by integers.
STR_CHAIN_MAX
This specifies intended maximum number of items gawk will maintain on a hash chain for managing arrays indexed by strings.
So it would appear that gawk manages key hash collisions by chaining the affected full keys from the single hash.
It is not clear what gawk does when this "maximum" is reached, as it cannot easily resolve that single chain. From other material (which I cannot now find) I suspect these maxima are the average chain length for the array as a whole: so when the average is exceeded, it can allocate a larger initial hash which will redistribute the former collisions, and then rebuild all the chains.
It is also known that "all array indexes are strings". But also, that iterating over an array indexed by small integers does iterate in numeric order (up to some limit of the order of thousands). Gawk is probably more heuristic than would be expected. It could keep each array as a direct index lookup for small ints, plus a hash for other strings, for example.