A simple hash table in C

coffeeri · on June 13, 2023

I've had lately a look at QEMUs internals and saw their thread safe implementation of a hash table, capable of concurrent reads: qht [0].

If the author sees this, you might want to take a look at it.

[0] https://github.com/qemu/qemu/blob/master/util/qht.c

erhaetherth · on June 14, 2023

Curious why concurrent reads would ever be an issue. As long as there are no writes during those reads... everything should be stable, no?

marginalia_nu · on June 14, 2023

You can't really know that though.

Without appropriate memory barriers, you can end up with an inconsistent view of the memory due to out-of-order execution and other weird stuff the CPU does behind the scenes.

There's significant footguns around low level concurrency primitives.

ryao · on June 14, 2023

If you are doing locking to protect against writes occurring when reads occur, you get those barriers for free.

That said, they are using RCU to allow writers and readers to operate concurrently.

cryptonector · on June 14, 2023

The real problem is resource management. If you have a threaded GC then you can just read, and the act of reading will cause the value you read to be alive. If you don't have a GC, however, you probably use reference counts, and now you may need to compose an atomic read of a pointer with an atomic reference count increment, but by the time you're doing the second step the object you're referencing may have been destructed.

jlokier · on June 14, 2023

> As long as there are no writes during those reads... everything should be stable, no?

Yes, but how do you ensure there are no writes during those reads?

You have to protect the reads against concurrent writes.

The simplest way is to use a mutex, but that doesn't support concurrent reads.

The next way, which is fairly common, is to use a read-write lock. That allows reads that are concurrent with each other, but only one write at a time, and no concurrency between reads and writes.

A standard read-write lock is not particularly fast for concurrent reads. The lock-for-read operation is required so that reads prevent a concurrent write from starting, and wait for a write already started to finish. That's not fast on a multi-core system because it forces cache line bouncing between cores.

That is, unless particularly fancy types of read-write locks optimised for mostly reading are used, such as rwlock-per-core, and those are slow for writes. They are somewhat fast for reads on architectures with fast atomic operations, but not as fast as possible.

Read-write locks also come in different flavours, depending on whether you want new reads to be blocked and queued when there's a write blocked waiting for current reads to finish. Fairness is an issue. This can get complicated, and bugs in libc rwlocks are not unheard of because of the complication.

A seqlock can be used which has fast reads when there are no writes. They are fast on a multi-core system because there's no cache line bouncing between cores when there are only reads. But if there is a high rate of writes in one thread it can block all reads continuously, by causing them to livelock in loops. This is called spinning. More commonly, the writes tend to slow down seqlocked reads by a large factor in some scenarios, without blocking them completely. Just wasting a lot of CPU time and running slowly, out of proportion to the amount of blocking you would expect is necessary.

Rather like an over-contended spinlock. Spinlocks, which can come in a mutex flavour or read-write flavour, should rarely be used in threaded code outside a kernel. Because they spin as described above, out of proportion to the amount of blocking that's really required, and the effect is much worse in pre-empted userspace threads than in a non-premptible kernel.

It's possible to reduce seqlock and spinlock CPU spinning by transitioning to a different type of lock after some number of spins. Sometimes a dynamically estimated number of spins. This makes them behave better in userspace threaded code outside a kernel. But now your lock is rather complicated, and still not consistently fast at reads.

An approach which works really well is RCU. Concurrent reads can be very simple and never spin because there's no loop. There's no multi-core cache line bouncing. Writes are more complicated, and the reads have to adhere to certain patterns because the kind of concurrency allowed between reads and writes is different in RCU than with locks. It works best if your program has some kind of top-level event loop that is returned to often, to provide the "quiescent states" RCU requires. But there are other ways to do it, if there's no top level, they just require reads to do a bit more work than almost nothing.

Even RCU requires a little something in reads though, to get correct data. This is a data-dependency memory barrier. These barrier operations require zero instructions on nearly all CPUs because of how memory systems are designed, but are famously not free on the DEC Alpha which shows that it's not a "no operation", it just happens to be a side effect that is usually baked in. Even with zero instructions on nearly all CPUs, they limit which code optimisations the compiler is allowed to do, so have a non-zero average overhead, but it is very small in practice.

RCU is what the QEMU hash table uses.

rewmie · on June 14, 2023

You don't seem to be very curious because your questions are clarified literally in the very first comment in OP's link.

torstenvl · on June 13, 2023

[flagged]

haileys · on June 14, 2023

This is such a case of lawyer brain.

The OP wrote this as a learning exercise. With the goal of learning in mind, I would suggest that it's probably best for the OP to go and read as much code as they possibly can, no matter the license!

patrick451 · on June 14, 2023

Perhaps, but lawyers can ruin your life. If you ask me, it's a travesty that the law is so complex nobody in this thread can agree if this is good advice or paranoia run out of control. That's the world lawyers have created for us, and like it or not, we live in their world.

rewmie · on June 14, 2023

> This is such a case of lawyer brain.

I'm not sure you are fully aware of the implications. In fact, it seems you're dismissing easily avoidable risks by arguing you'd never be caught.

You can read plenty of code without risking accusations of license agreement violations.

Retr0id · on June 13, 2023

That's not how software licenses work.

torstenvl · on June 13, 2023

Yes, actually, it is.

"Software licenses" aren't their own subject. They are instruments of copyright law. Without copyright, there is nothing to be licensed. And clean room implementations are the best way to ensure that no copyright violation has occurred.

loeg · on June 13, 2023

Looking at GPL code doesn't poison your mind forever. Of course, you cannot directly copy GPL code and call it MIT. And you arguably cannot have GPL code open in one window and write your version in a separate window. Copyright protects a specific set of symbols in some order, not an underlying idea or algorithm. You can look at a GPL hashtable one day and write an MIT one the next, no problem.

torstenvl · on June 14, 2023

Yes, that is all correct.

However, it is my opinion that having a project posted on HN, being directed to look at the "internals" of another software project (such project being licensed under the GPL), and subsequently modifying your own project with what you've learned, is legally risky.

Specifically, if I were corporate counsel at a company looking to use MIT-licensed code in a product of ours, and our due diligence uncovered that just such a thing had happened, I would advise against using that code. The risk—that is, likelihood multiplied by the magnitude of the severity of the consequences—of being compelled to license our software under the GPL would be far too high.

As a result, I stand by my assessment that it is probably best—albeit not mandatory—for the OP author not to take a look at how GPL'd code accomplishes what OP author is trying to accomplish.

loeg · on June 14, 2023

> Specifically, if I were corporate counsel at a company looking to use MIT-licensed code in a product of ours, and our due diligence uncovered that just such a thing had happened, I would advise against using that code.

I get that corporate counsel is extremely conservative (do you practice in this area?) and often insensitive to the costs of following their advice (as opposed to the costs of not following it) and you may well be right that this is what they would advise if asked explicitly. But I don't think the end result is good advice for an engineer.

> The risk—that is, likelihood multiplied by the magnitude of the severity of the consequences—of being compelled to license our software under the GPL would be far too high.

I think you're overestimating both likelihood and severity. Likelihood -- I mean, your internal hash table is never going to see GPL enforcement action. Severity -- the least expensive path to remediation is unlikely to be GPL'ing your software. You could replace the component, for example.

I appreciate the discussion, by the way. Thanks!

boffinAudio · on June 14, 2023

A human beings right to enlighten themselves trumps any other human beings right to enrich themselves.

You're calling for a reduction in agency over a potential for legal action - self-censorship, essentially - in a way, forming a pre-judiciary "pre-crime" conclusion of guilt.

Code is language. Restrictions on its use are human rights violations, no matter what legal-ese can be trotted out to FUD the arena.

detrites · on June 14, 2023

What if they, unknowingly, had an LLM "look at" incompatibly-licensed code they later used?

And if that provided any immunity, what's to stop someone from claiming it happened to avoid issue?

rewmie · on June 14, 2023

> Looking at GPL code doesn't poison your mind forever.

If you purposely go through the source code of a project to get inspiration and afterwards you are so inspired that you end up implementing the same thing in your code, you're literally reusing the code. If the project that inspired you so much does not grant you the right to reuse the code in the way you are reusing it, you're violating the licensing terms.

"Clean room implementation" does not mean "I may or may not have copy/pasted the code". It literally means replicating a design without infringing copyright related to the work you're replicating. Directly accessing the code and lift the good parts in different degrees of verbatim-ness is the textbook definition of copyright violation.

morelisp · on June 14, 2023

Clean room reimplementation is a sufficient but not necessary way to avoid copyright violations.

If what you're proposing was true no corporate counsel would ever suggest hiring someone who had worked at a different company.

loeg · on June 14, 2023

I think it's very clear that you cannot literally copy and paste portions of the GPL code. No one was suggesting otherwise. I wrote:

> Of course, you cannot directly copy GPL code and call it MIT.

I also wrote:

> And you arguably cannot have GPL code open in one window and write your version in a separate window.

Alluding to the idea that something less literal than copying and pasting might still be too much like copying.

Your comment reads like you are disagreeing with me, but I think it's mostly just not responsive to mine.

User23 · on June 13, 2023

Software licenses are the greatest legal psych out of my lifetime. US law permits anyone who has a legally acquired copy of a piece of software to copy it further[1], such

  that such a new copy or adaptation is created as an essential step in the utilization of the computer program in conjunction with a machine and that it is used in no other manner

As anyone can see, this completely obviates any need to be licensed to use software once the seller has lawfully sold you a copy. It's yours and you can execute it as you see fit. Interestingly, this doesn't really affect the GPL because the GPL doesn't attempt to kick in unless you further redistribute. And that, of course, is not protected by section 117 and does require a license.

Of course we live in a regime where the process is the punishment, so if you're high profile you'll probably get hit with a ruinous lawsuit anyhow. C'est la vie.

[1] https://www.law.cornell.edu/uscode/text/17/117

Retr0id · on June 13, 2023

A clean room implementation is not the only way to avoid violating copyright.

the-smug-one · on June 13, 2023

This hashtable sure mallocs a lot!

Look here for example:

Entry *newVal = malloc(sizeof(Entry)); newVal->key = strdup(key);

Why not:

Entry* e = malloc(sizeof(Entry) + strlen(key));

Now you can also skip the pointer to the key, saving 8 bytes for each entry. Also, just return the hashtable by value, it's so small.

I think that if we're coding in C, C++ or Zig, then it can be fun to actually use the lower level power to our advantage :).

asveikau · on June 13, 2023

> malloc(sizeof(Entry) + strlen(key));

Don't forget +1 for the terminator.

I also think a linked list is not a great data structure in practice. Having the buckets be one of those amortized-O(1)-append heap allocations would be kinder on CPU caches. But, if you are making the string part of the entry, that would also require padding the string length to word boundaries so that the struct gets aligned.

Note also if you kept the length of the string somewhere (which you would need to for my above paragraph's suggestion), you could avoid strcmp-ing hash collisions where the string has different length. Strcmp is going to add up to an expensive thing in this code.

This code also never checks the return value of malloc. Color me unimpressed with this code. (Cue people arguing that checking for malloc failure is a bad idea ... Linux oom killer and high level languages have you spoiled, there i said it! /s)

wahern · on June 14, 2023

> > malloc(sizeof(Entry) + strlen(key));

> Don't forget +1 for the terminator.

Also should check for arithmetic overflow, or at least static_assert that sizeof(Entry) is "too small" to matter. (Hopefully the process of deciding what "too small" should be results in just adding the overflow check.)

dahfizz · on June 14, 2023

Not really. We defined Entry, so we know it's only a few bytes. And we know size_t is large enough to iterate over all of main memory. So the only way that statement overflows is if the key is literally larger than main memory. Not a concern.

asveikau · on June 14, 2023

I agree with your reasoning, but it's not "literally", it's main memory minus (sizeof(struct Entry) - 1). But it's impossible to find a C string that big. Any point in memory is pretty clearly going to hit a 0 or a page fault before then.

the-smug-one · on June 14, 2023

>But, if you are making the string part of the entry, that would also require padding the string length to word boundaries so that the struct gets aligned.

It also makes deletion a far more complicated situation. You'd essentially be implementing your own allocator then.

asveikau · on June 14, 2023

That's true. Maybe you could keep it simpler by making small strings inline with the struct, but larger strings go on the heap. This gives you fixed size records, so removal could be o(1) by swapping with the last one and shortening.

lyu07282 · on June 14, 2023

Well it's C, it's ought to be broken in some way. Just put it in some sufficiently high value target, a few months to years later the CVE is gonna explain how it was broken exactly.

aloisklink · on June 13, 2023

Even better, you can use [C99 Flexible array members][1], e.g. something like:

    struct KeyedEntry {
      struct Entry entry;
      char key[]; // flexible array member
    };

It's not too much more useful when just using `char`, but for other data types, it's a bit cleaner, since it handles alignment/padding better.

[1]: https://en.wikipedia.org/wiki/Flexible_array_member, or https://beej.us/guide/bgc/html/split/structs-ii-more-fun-wit...

_Edit_: Something like this might not be suitable for the author, since they did mention how they wanted each `struct` to have a fixed size.

kccqzy · on June 14, 2023

When writing C code, which for me is rare these days, the language doesn't give me enough guardrails in the type system (e.g. unique_ptr) and at compile time that I'd prefer to write straightforward and dumb code. And that means code that does a lot of malloc()s with no surprises. If I have a char* that cannot be free()d, such as pointing to the middle of a longer string, I don't even let that escape the scope of a single function. Leave the memory management trickeries to more modern languages like C++, or, better yet, Rust.

amstan · on June 14, 2023

I think parent was criticizing the multiple use of mallocs when one would be cleaner and easier. This wouldn't really make a difference to the library users, besides perhaps more heap fragmentation in the old version.

jandrese · on June 13, 2023

I was surprised recently when looking at different hash tables that have been implemented in C to discover that the standard library includes its own hash table. They are even part of POSIX. There is a reason you have never heard of it, or if you have you have never used it. In true POSIX fashion they are close to useless. The implementation doesn't allow you to modify the table after it has been created, you have to pass in all the data when you create the table. There is no add or delete and you can't change any value you pull from the table. It also stores the data in a static area local to the function so you can only use a single table in a program at a time. It doggedly commits every C stdlib sin possible.

zabzonk · on June 13, 2023

there are no hash tables in the Standard C Library - POSIX does not define Standard C.

more info: https://stackoverflow.com/questions/6118539/why-are-there-no...

jandrese · on June 13, 2023

Interesting that the manpage claims it conforms to POSIX.1-2001.

https://linux.die.net/man/3/hcreate

jlarocco · on June 14, 2023

POSIX is the standard for Unix, not for C.

loeg · on June 14, 2023

I think GP is talking about, e.g., glibc, or any libc you might find on a POSIX system, rather than the abstract C standard library. I agree that as a point of clarity, POSIX isn't standard C and the crappy hashtable described is in POSIX, not standard C.

boffinAudio · on June 14, 2023

Come on, this is a straw man argument. That function was built for a specific purpose, and the fact you had to hunt for it means its not meant for general purpose use.

Besides which, this statement is just inflammatory:

"In true POSIX fashion they are close to useless."

The straw man does not support the conclusion .. POSIX is full of utility.

emmelaich · on June 14, 2023

You can add with hsearch; set ACTION=ENTER

Also you could 'delete' with item.data set to some sentinel.

jmacjmac · on June 13, 2023

Nice article! I think eventually you need to move to macros to support multiple key/value types in C. Just leaving some macro implementations for reference:

- https://github.com/attractivechaos/klib/blob/master/khash.h

- https://github.com/tezc/sc/tree/master/map

- https://github.com/troydhanson/uthash

hot_gril · on June 13, 2023

I think uthash is the one I used long ago. Fun and easy.

ecaradec · on June 14, 2023

One trick I learned from https://craftinginterpreters.com/hash-tables.html is that you don't need an extra linked list for each bucket... But how do you handle collisions then ?

The trick is that you make sure that your table is large enough to not have a lot of collisions, then if you have a collision instead of storing exactly in the bucket given by the hash, you store it in the next available bucket. When you look for a key, you get the hash, then the index from the hash, and you start searching at this point. If you reach an empty value, then there is no value. If you find the value then you return that. So everything is stored in a single array, no need for extra allocations and it is better for cache locality.

The same principle is used here for very minimal hash table (13 lines): https://nullprogram.com/blog/2020/10/19/

This video also talk about how to optimize hash table and look into several implementations: https://www.youtube.com/watch?v=DMQ_HcNSOAI&t=1690s&ab_chann... . It talks about techniques used by advanced hash table. There was a lot more that I didn't know.

masklinn · on June 14, 2023

That’s not really a trick?

It’s an open addressing hash table (https://en.wikipedia.org/wiki/Open_addressing), with a linear probing collision resolution (https://en.wikipedia.org/wiki/Linear_probing).

scrozart · on June 14, 2023

Yup. This is how they're taught in CS undergrad (as of 10 years ago, anyway).

sureglymop · on June 14, 2023

This is how it was taught to me in a CS undergrad class last week. What we didn't look at though was, what happens when the hash table is full? Does everything get rehashed into a bigger table? What are the Performance implications of that? I wish it would have gone just a little further.

jlokier · on June 14, 2023

Even before it's full, an open addressing hash table (also called closed hashing - confusing names!) becomes very slow as it gets close to full, because most of the table needs to be scanned on inserts, and on queries for the recently inserted vales.

When it's full, insertion requires rehashing into a bigger table. (Or you can use an overflow list when the table is full or nearly full, if you don't want to rehash and prefer to treat this case as rare and ok to be slow.)

So for a general purpose open addressing / closed hashing hash table where you don't know the number of items in advance, rehashing to a bigger table is done at some threshold before it's actually full.

As long as you do rehashing using a multiplier, for example doubling the size at 50% full (as opposed to adding a fixed amount to the size), it stays fast on average, but the rehashing operations are of course individually slow. They just don't happen often, so the average stays at O(1) time. This is called amortised O(1) time complexity. The ideal threshold for resizing depends on the probing type.

Linked list hash tables, (also called closed addressing or open hashing - I think linked list is a clearer name) don't suffer from catastrophic slowdown with size in the same way as the open addressing / closed hashing type. Nor abrupt failure when full, because they're never full. These can still be resized and this is required if O(1) time is required to arbitrary numbers of items, but the resizing threshold is not as critical. If not resized they gradually degrade to O(N) time performance when filled too much. Many applications use fixed size linked list hash tables safely because of this.

Array-of-arrays hash tables behave similarly to linked list hash tables.

sureglymop · on June 14, 2023

Thank you for your comment, it is really informative and I appreciate it a lot. We looked at the different approaches and I asked my teacher if there is no in-between method, that is, open address hashing combined with linked list hashing where e.g. after an arbitrary amount of elements in the linked list the table is scanned to find the next possible place. I'm sure many things have been tried before. What I don't get is, when is the rehashing done? After a specific insert operation? So then that insert operation will be much slower than all others? I understand that it doesn't matter if the runtime has the same upper bound but it's just not what I instinctively would have thought.

masklinn · on June 14, 2023

> Does everything get rehashed into a bigger table?

Usually yes, though it’s also possible to perform gradual resizing that’s less efficient so it’s only used when needed (e.g. real time systems which can’t afford a full resize).

> What are the Performance implications of that?

Same as a vector. That’s why hashmaps generally provide an amortized insertion time.

> What we didn't look at though was, what happens when the hash table is full?

Of note: a hash table never gets full, hash tables have a load factor (which mostly depends on the collision resolution algorithm, as that informs how performances degrade as collisions increase), and the table will resize when it exceeds its load factor, to maintain a “healthy” rate of collisions.

tomgp · on June 14, 2023

And 25 years ago :D (though we did the linked list implementation too)

retrocryptid · on June 14, 2023

Just realized the reason I like this is the documentation isn't just an API reference, but a progressive refinement of the problem domain to solution domain mapping. You start with a simple statement and gradually introduce concepts, problems and solutions.

samsquire · on June 13, 2023

Thank you for this article.

I have been working on trying to design a contiguous hash table data structure that has these two mutual requirements:

a) allows deep clones (due to contiguous memory copyable with memcpy)

b) supports arbitrarily nestable hash tables

The closest library I know about is smolworld ( https://github.com/snej/smol_world ) but I don't know how easy it can be cloned.

These requirements rules out multiple mallocs: I do one single malloc and expect that to be enough for the entire hashmap.

I'm not sure if these constraints might also force a fixed number of buckets and a fixed capacity.

The clonable property requires interior mutability be limited, because if you memcpy pointers they would cause structure sharing, that I'm trying to avoid, hence a deep clone.

Rationale: these are the use cases I have for such a data structure. The first is cheap copy on write. I also have a left-right concurrency control hashmap, but I feel the properties of this data structure are even better. The first is a sharding in multithreading design: do a cheap memcpy for each thread and shard the processing and merge at the end, without any synchronization cost while the threads are working. I know that deep cloning is slow in Java and presumably C if you do it with loops rather than memcpy. Another use case is efficient serialization for network.

In other words, a protobuf but easily deep clonable without loops.

Gibbon1 · on June 14, 2023

> because if you memcpy pointers they would cause structure sharing, that I'm trying to avoid, hence a deep clone.

I'm probably that guy that pipes up when he doesn't know what he's talking about but.

You can store offsets instead of pointers. You can then just memcpy the whole thing and it'll just work as long as everything is self contained.

samsquire · on June 14, 2023

Thanks for your reply, you're right. You know what you're talking about.

This is what smol_world does, it uses 32 bit indexes.

I really like the idea of smol_world, I just don't know if you can easily .clone() it.

teo_zero · on June 13, 2023

If you pre-allocate a large enough buffer, you can keep a pointer to mark the boundary between the part where you have already stored data and the part still empty. Then re-implement malloc() to just move such pointer. Of course you must check for running out of allocated memory, etc. This simple approach assumes that you seldom delete entries, because it never reclaims the room occupied by deleted entries.

williamcotton · on June 14, 2023

Here's a simple example of the bump allocator part of this!

https://gist.github.com/williamcotton/99ab6e8efa3c4b6c07a149...

version_five · on June 13, 2023

I was looking for something like this a few months ago and found this which also may be of interest:

https://news.ycombinator.com/item?id=26590234

How to implement a hash table in C (benhoyt.com) 302 points by benhoyt on March 26, 2021, 158 comments

benhoyt · on June 13, 2023

Thanks for sharing! I do like the OP's article, but the more the merrier! My "How to implement a hash table in C" article [1] is for whatever reason one of my most popular articles. If only C had a slightly better standard library. :-)

In my article I use "open addressing" [2] instead of linked lists for collisions, as 1) it tends to be simpler, as you only have one data structure to manage (array, rather than array and linked list), and 2) it tends to be faster, as linked lists are slow on modern CPUs because they require jumping around in memory.

[1] https://benhoyt.com/writings/hash-table-in-c/

[2] https://en.wikipedia.org/wiki/Open_addressing

version_five · on June 13, 2023

Hey, thanks for writing it up! Like I said, I was looking for exactly this a few months ago, and between the explanation and the code it was a big help.

marginalia_nu · on June 14, 2023

Yeah I think open addressing with linear probing is definitely the paradigm I'd choose if I wanted to build a "simple" hash table for educational purposes.

It well enough and is conceptually probably the easiest approach to grok with relatively few pitfalls.

jancsika · on June 14, 2023

What's performance like when each bucket is an array instead of a linked list?

Another question-- at the moment where one would normally decide to resize the table for performance reasons (i.e., buckets are getting too full), how would the array-bucket style perform compared to the linked-list bucket style?

aes512 · on June 14, 2023

You mentioned it yourself, but please use `strncmp` - you don't want to have unbounded c style strings like that as keys.

A lot of mallocs. Just malloc a pool and implement an allocation strategy on top of it. Malloc more as necessary (say 2^n).

You mention map types in other languages. They're not always implemented as a hashmap. A tree is commonly used to implement maps too.

Embedded environments generally avoid `malloc` and the heap entirely. You have a simple allocator on top of a fixed allotment of memory instead.

jlokier · on June 14, 2023

> A lot of mallocs. Just malloc a pool and implement an allocation strategy on top of it. Malloc more as necessary (say 2^n).

Good mallocs do that themselves already. It's built in, they use a pool. Really good mallocs use a fast, cache-friendly pool, typically per CPU-core or per-thread.

By using a pool on top of a good malloc, you are implementing a second pool on top of the first pool. For something general purpose like a general purpoe hash table, layering pools like that often provides no advantage and some disadvantages including to performance. It's often, though not always, better to use a really good malloc

> You mention map types in other languages. They're not always implemented as a hashmap. A tree is commonly used to implement maps too.

Notably the C++ `std::map` is a tree. A later standard added `std::unordered_map` which is a hash table. The effect is many C++ programs use trees for their maps, and some C++ devs don't realise they are using a tree with O(log N) time operations instead of a hash table with O(1) time operations.

> Embedded environments generally avoid `malloc` and the heap entirely. You have a simple allocator on top of a fixed allotment of memory instead.

I think that's a strange way to look at it, because `malloc` is a simple allocator on top of a fixed allotment of memory when you have a fixed allotment of memory to work with. You can write whatever `malloc` and `free` functions you like to suit the situation. You don't have to use the one in libc, and small embedded environments often don't have one. E.g. I've seen extremely memory-constrained bootloaders and DSPs where there is still a function called `malloc`, but it is a very simple allocator for the situation.

In high reliability systems you might need to control the pattern of allocations in the progam so that memory use stays bounded. But in a general purpose hash table, the way to bound memory use is to control the pattern of hash table usage, i.e. the callers.

beeforpork · on June 14, 2023

Maybe consider cuckoo hashing when thinking about hash tables:

https://en.wikipedia.org/wiki/Cuckoo_hashing

No bucket lists are necessary. Guaranteed O(1) read access (not amortized, but always). Can be parameterized wrt. access time vs. memory efficiency. Can also be precompiled for offline data, e.g., Unicode codepoint tables (thought tries are also really good for this).

michaelcampbell · on June 14, 2023

I 'invented' a hash table like this many years ago. I'm sure I wasn't anywhere near the first, either, but at the time I thought it was pretty hot shit using a linked list as the thing to deal with collisions.

It's funny how many people come up with the same ideas.

rurban · on June 14, 2023

Why using a good or better hash table, when you can implement the worst by yourself? OMG

naillo · on June 14, 2023

Well these things are what everyone has to do in college anyway because it's helpful for training you. And if you didn't get to do that in your algo courses or you didn't do college or haven't done it yet then articles like this are super useful.

zokier · on June 14, 2023

> Solving it helped me gain better intuition around hash functions, pointers, and memory segments like the stack and the heap.

detrites · on June 14, 2023

Can you at least list some good or better hash tables?

rurban · on June 14, 2023

search for htable or hashtable in thousands of open source projects. only a minority has worse hashtables than this one (clisp, perl5 e.g.).

For better ones I would point to my linked list implementation: https://github.com/rurban/ctl/blob/master/ctl/unordered_set.... (because it has various security policies, nobody else has) or https://github.com/LIMachi/swiss-table (all in C)

tialaramex · on June 14, 2023

Having Swiss Tables (a more or less state of the art hash table for modern computers) is attractive but I think it'd be appropriate to be scared of unmaintained C labelled as a "proof of concept".

The CTL "unordered set" is roughly the same bad design as its namesake in C++, presumably on purpose. We really shouldn't be teaching new users this bad structure, nor the "security policy" mitigations it provides. And by the way it's not a "distributed" denial of service when somebody plugs 128 colliding values into your API, just a normal trivial DOS.

zabzonk · on June 13, 2023

if the initial array is going to be of fixed size, why not:

    typedef struct HashTable {
        Entry *buckets[4];
        int nBuckets;
    } HashTable;

which removes one layer of dynamic memory management.

the-smug-one · on June 13, 2023

>resize the buckets array once it gets too full

Because of that, basically.

zabzonk · on June 13, 2023

i don't see where in his code this is done. possibly me being stupid.

the-smug-one · on June 13, 2023

Oh sorry, it's from "future improvements" :), I assumed that was the train of thought. You're entirely correct, however.

dvh · on June 13, 2023

1. Implement simple hash table in C

2. Implement bloom filter

3. Download list of Bitcoin wallets with non-zero amount of BTC

4. Generate long list of random key pairs

5. ???

6. Profit