I wonder when companies will remove the personality out of LLMs by default, espe...

dingnuts · 2025-06-20T16:05:14 1750435514

that would require actually curating the training data and eliminating sources that contain casual conversation

too expensive since those are all licensed sources, much easier to train on Reddit data

amelius · 2025-06-20T16:11:56 1750435916

Just ask an LLM to remove the personality from the training data. Then train a new LLM on that.

omneity · 2025-06-21T13:42:23 1750513343

It will work, but at the scale needed for pretraining you are bound to have many quality issues that will destroy your student model, so your data cleaning process better be very capable.

One way to think of it is that any little bias or undesirable path in your teacher model will be amplified in the resulting data and is likely to become over represented in the student model.