Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Rewrite hashmap data structure #4297

Draft
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

lpereira
Copy link
Contributor

This structure now deviates quite a bit from the one used in the Big Go implementation, and is more suitable for enviroments with severe memory constraints.

A hashmap contains a linked list of buckets, which can store up to 8 elements; this number of buckets are pre-initialized depending on the size hint during hashmap instantiation. An additional bucket is allocated every time it's necessary; a bucket will have 100% utilization before needing a new one.

Only the topmost 8 bits of the hash of a key (tophash) is stored per key, and multiple keys with the same tophash can exist in the same bucket. A tophash of 0 indicates an empty slot in a bucket. A fast byte scan is used to determine which slots correspond to a particular tophash inside a bucket, which makes up for the fact that a linear scan through every bucket is necessary to find one that might contain a particular key.

Bucket slots aren't indexed by the hash of a key, so there's no need to carry a seed per hashmap instance to avoid issues such as the one described by Crosby and Wallach[1].

A bucket will have 100% utilization before we need another one, and we use a fast byte lookup to know which bucket corresponds to a particular tophash. There's no rehashing when the hashmap grows either, which is done to avoid generating too much garbage and reusing memory allocations as much as possible.

Some things that I still want to do include:

  • Make the bucket list circular, and have a hash-indexed array inside a hashmap, for faster lookup times. Could potentially grab 3 other bits from the 32-bit hash to skip ahead in the bucket linked list.

  • Randomize iteration.

  • Improve memory allocation when sizeHint is passed. Right now, each bucket is allocated individually rather than in a group of nodes.

[1] https://www.usenix.org/legacy/events/sec03/tech/full_papers/crosby/crosby.pdf

CC @dgryski

This structure now deviates quite a bit from the one used in the Big Go
implementation, and is more suitable for enviroments with severe memory
constraints.

A hashmap contains a linked list of buckets, which can store up to 8
elements; this number of buckets are pre-initialized depending on the
size hint during hashmap instantiation. An additional bucket is
allocated every time it's necessary; a bucket will have 100% utilization
before needing a new one.

Only the topmost 8 bits of the hash of a key (tophash) is stored per key,
and multiple keys with the same tophash can exist in the same bucket.  A
tophash of 0 indicates an empty slot in a bucket.  A fast byte scan is used
to determine which slots correspond to a particular tophash inside a
bucket, which makes up for the fact that a linear scan through every bucket
is necessary to find one that might contain a particular key.

Bucket slots aren't indexed by the hash of a key, so there's no need to carry
a seed per hashmap instance to avoid issues such as the one described by Crosby
and Wallach[1].

A bucket will have 100% utilization before we need another one, and we use a
fast byte lookup to know which bucket corresponds to a particular tophash.
There's no rehashing when the hashmap grows either, which is done to avoid
generating too much garbage and reusing memory allocations as much as possible.

Some things that I still want to do include:
  - Make the bucket list circular, and have a hash-indexed array inside a
    hashmap, for faster lookup times.  Could potentially grab 3 other bits
    from the 32-bit hash to skip ahead in the bucket linked list.

  - Randomize iteration.

  - Improve memory allocation when sizeHint is passed.  Right now, each bucket
    is allocated individually rather than in a group of nodes.

[1] https://www.usenix.org/legacy/events/sec03/tech/full_papers/crosby/crosby.pdf

Signed-off-by: L. Pereira <l.pereira@fastly.com>
Move code to allocate a number of buckets to its own function, so that it
can be reused both in hashmapMake() and hashmapSet().

Signed-off-by: L. Pereira <l.pereira@fastly.com>
@lpereira
Copy link
Contributor Author

(I found a way to improve this quite a bit, and am working on it. It should reduce per-bucket memory usage at the same time it makes lookups faster.)

@dgryski
Copy link
Member

dgryski commented Jun 19, 2024

Since this seems to be shifting the tradeoffs of the hashmap data structure, I wonder if we want to make it build-time configurable with the current implementation (similar to how gc's are also selectable).

@aykevl
Copy link
Member

aykevl commented Jun 22, 2024

@lpereira could you maybe do some benchmarks to compare the two implementations? That will help to determine whether we need to keep both of them around.
I'd like to avoid having to select between the two if possible, because it increases the maintenance burden and also extra options aren't necessarily a good thing for users (if we have multiple implementations, it's probably the compiler that should pick the best implementation for a given system).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants