0

Does awk use separate chaining, open addressing or does it have its own way of handling collisions in a hashmap?

Do gawk and nawk implement the same algorithm?

Thank you.

10
  • Do you, by "hashmap", mean "associative array"? Have you looked at the implementation? Commented Jun 17, 2022 at 10:29
  • One could also ask if all versions of whatever implementation have always used the same algorithm. But I wonder, does it matter how they're implemented as long as the implementations work? Commented Jun 17, 2022 at 10:48
  • 3
    How do you mean "implemented them in my script"? I don't think I understand. I see no code in the question. Commented Jun 17, 2022 at 11:14
  • 1
    That would be a question for the people who provide those tools. For gawk you can contact them with questions like this at: help dash gawk at gnu dot org. Bear in mind, though, that whatever answer you get is just the answer NOW, they could change their implementation tomorrow. Commented Jun 17, 2022 at 12:16
  • 1
    GNU/awk is open source, and the code is easy to find (not so easy to follow, but a few structs should give the game away). nawk is (I suspect) Oracle protected. Commented Jun 17, 2022 at 13:09

1 Answer 1

1

Check https://www.gnu.org/software/gawk/manual/gawk.html#Other-Environment-Variables

This specifies two environment variables which "influence how gawk behaves". There is a warning that these are meant for use by the gawk developers for testing and tuning, and are subject to change.

INT_CHAIN_MAX

This specifies intended maximum number of items gawk will maintain on a hash chain for managing arrays indexed by integers.

STR_CHAIN_MAX

This specifies intended maximum number of items gawk will maintain on a hash chain for managing arrays indexed by strings.

So it would appear that gawk manages key hash collisions by chaining the affected full keys from the single hash.

It is not clear what gawk does when this "maximum" is reached, as it cannot easily resolve that single chain. From other material (which I cannot now find) I suspect these maxima are the average chain length for the array as a whole: so when the average is exceeded, it can allocate a larger initial hash which will redistribute the former collisions, and then rebuild all the chains.

It is also known that "all array indexes are strings". But also, that iterating over an array indexed by small integers does iterate in numeric order (up to some limit of the order of thousands). Gawk is probably more heuristic than would be expected. It could keep each array as a direct index lookup for small ints, plus a hash for other strings, for example.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.