Note that the authors report the speed of generating many sequences in parallel ...

152334H · on Feb 21, 2023

I spoke with the authors of the paper; the leftmost points in Figure 1 were generated with batch-size 1, indicating ~1.2x and ~2x improvements in speed over DeepSpeed for 30B and 175B models respectively. For reference, this is speeding up from ~0.009tokens/s to about ~0.02tokens/s on 175B.

These results are generally unimpressive, of course. Most of the improvements at that point are attributable to the authors making use of a stripped down library for autoregressive sampling. HN falling for garbage once again...

ImprobableTruth · on Feb 21, 2023

Calling this garbage is absolutely wild. The authors make it very clear that this is optimized for throughput and not latency. Throughput focused scenarios absolutely do exist, editorializing this as "running large language models like ChatGPT" and focusing on chatbot applications is the fault of HN.

It's also a neat result that fp4 quantization doesn't cause much issue even at 175b, though that kinda was to be expected.

borzunov · on Feb 21, 2023

While I agree that throughput-focused scenarios exist and this work may be valuable for them, I still think that the repository can be improved to avoid "overselling".

The fact that the FlexGen's single-batch generation performance is much worse is unclear to most people not familiar with peculiarities of LLM inference and worth clarifying. Instead, the readme starts with mentioning ChatGPT and Codex - projects that both rely on single-batch inference of LLMs at interactive speeds, which is not really possible with FlexGen's offloading (given the speed mentioned in the parent comment). The batch sizes are not reported in the table as well.

Seeing that, I'm not surprised that most HN commenters misunderstood the project's contribution.