> results of the paper directly imply a significant reduction in memory requirem...

danielmarkbruce · on Feb 29, 2024

Yes. The optimizer is keeping a higher precision copy. It's likely slower and requires more memory than an equivalent full precision model when it comes to training. I'd also imagine it requires a multiple of epochs to get one epoch equivalent because the forward pass will need several goes to get the right choice between three states, rather than just moving a little bit in the right direction.

pclmulqdq · on Feb 29, 2024

Most research universities have the resources to train a ~10B parameter model, at least.