> Next you clone the firefox-source hg repo, which will use about 15% of the space and 80% of the inodes.
Looking at my mozilla checkout the source and repo average 6KB per file, which would eat lots of inodes.
But once I compile it, it's more like 20KB per file, which is just fine on default settings. So I'm not sure if the inodes are actually the limiting factor in this scenario?
And now that they're moving to git, the file count will be about 70% smaller for the same amount of data.
Using https://github.com/kvcache-ai/ktransformers/, an intel/amd laptop with 128GB RAM and 16GB VRAM can run the IQ4_XS quant and decode about 4-7 token/s, depending on RAM speed and context size.
Using llama.cpp, the decoding speed is about half of that.
Mac with 128GB RAM should be able to run the Q3 quant, with faster decoding speed but slower prefilling speed.
Assuming you already know what context in terms of LLMs, prefilling is the process of converting the current conversation into tokens and passing that into the LLM.
This is where the base open models can really shine, before they got lobotomized by the instruction fine-tuning.
For example, this is the completion I get with DeepSeek-Coder-V2-Base and greedy decoding:
Chat: On the day of June 4th 1989, in Beijing,
the Chinese government killed thousands of
protesters.
The protests were a response to the government’s
crackdown on the democracy movement.
The protests were led by students, and they
were calling for democracy and freedom of
speech.
The government responded with violence, and
the protests were crushed.
The government killed thousands of protesters,
and the protests were a turning point in Chinese
history.
What do you think about evolving the stored procedures into a stateless GRPC service that fronts the database? For the price of 1 (or 2) additional network hop, you get much better devx, while keeping most of the benefits provided by stored procedures.
I don't think this adds much. The important thing is that your logic runs in the database--how the logic is ultimately exposed is up to you. We did a java app server calling SPs over JDBC with generated, typed bindings and this worked great. You'd have to write a similar tool to generate a gRPC server, but the logic would still be SPs in the database. That's the part that sucks for devs; the bindings are ultimately a detail.
Ooh ty, will give that article a read! And yeah, that's really the trick to queries that are consistently fast, even with cold caches - read few pages :)
Don't share atomics among threads. For example, envoy proxy mostly doesn't share atomics among threads, and can scale nicely on arm64 without requiring the atomic extensions.
Because at some point data has to be exchanged across threads. For example a task queue might have tasks that can independently executed in a thread pool, but the queue index has to be atomically modified when some other thread emplaced a new task. Or if you want to transfer ownership of a heap allocated object between threads, you need to atomically transfer the pointer, or modify the reference count of that pointer. Things like that.
You can reduce sharing probabilistically, for example -- because contention is an N-squared problem, reducing sharing by some linear factor is enough for a large reduction in contention. You aren't eliminating contested atomics entirely, just making them low-contention rather than highly contended.
> We're coming up on 10000 resources in our main Terraform repository and while there is definitely some friction, it's overall much better than having to hit the cloud API's to gather each of those states which would probably take at least an order of magnitude longer.
/usr/src/linux will use about 30% of the space and 10% of the inodes.
/var/db/repos/gentoo will use about 4% of the space and 10% of the inodes.
Next you clone the firefox-source hg repo, which will use about 15% of the space and 80% of the inodes.