Hacker News new | past | comments | ask | show | jobs | submit login

> results of the paper directly imply a significant reduction in memory requirements for training such a model

Isn't memory use in training higher, since they maintain high precision latent weights in addition to the binarized weights used in the forward pass?




Yes. The optimizer is keeping a higher precision copy. It's likely slower and requires more memory than an equivalent full precision model when it comes to training. I'd also imagine it requires a multiple of epochs to get one epoch equivalent because the forward pass will need several goes to get the right choice between three states, rather than just moving a little bit in the right direction.


Most research universities have the resources to train a ~10B parameter model, at least.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: