Stumbled upon this paper while trying to learn about conditional flow matching. Apparently, we can now use a similar technique that supercharged Stable Diffusion 3 (conditional flow matching), and apply it to text generation. And the 1.7B model does quite well compared to 7-8B class autoregressive LLMs like the llama series.
Goes to show the current autoregressive architectures are not the end-all of language models, and in fact the problem can be framed in a very general way -- given some data distribution, learn how to map a latent code to generate data from that distribution...
Goes to show the current autoregressive architectures are not the end-all of language models, and in fact the problem can be framed in a very general way -- given some data distribution, learn how to map a latent code to generate data from that distribution...