Between my experience and arXiv papers I've read I'd say this: Personally I am w...

Between my experience and arXiv papers I've read I'd say this:

Personally I am willing to label 1000-2000 documents to create a training set. It's reasonable to make about 2000 simple judgements in an 8-hour "day" so it is something you could do at work or in your spare time in a few days if you want.

You can compute an embedding and then use classical ML algorithms from scikit-learn such as the support vector machine. My recommender can train a new model in about 3 minutes and that includes building about 20 models and testing them against each other to produce a model which is well tested and probability calibrated. This process is completely industrial, can run unattended, and always makes a good model if it gets good inputs. Running it every day is no sweat.

You can also "fine-tune" a model, actually changing the parameters of the deep model. I've fine-tuned BERT family models for my recommender, it takes at least 30 minutes and the training process is not completely reliable. A reliable model builder would probably do that 20 or so times with different parameters (learning rate, how long to train the model, etc.) and pick out the best model. As it is the best models from it is about as good as my simpler models, and a bad one is worse. I can picture industrializing it but I'm not sure it's worth it. In a lot of papers people just seemed to copy a recipe from another paper and don't seem to do any model selection.

My problem is fuzzy: the answer to "Do I like this article?" could vary from day to day. If I had a more precise problem "fine tuning" might pull ahead. Some people do get a significant improvement which would make it worth it, particularly if you don't expect to retrain frequently.

I see papers where somebody does the embedding transformation but instead of pooling over the tokens (averaging the vectors) they input the vectors into an LSTM, GRU or something like that and train the recurrent model. This kind of model does great when word order matters as in sentiment analysis. I found that kind of model was easy to train in a repeatable way a decade ago so that's an option I'd consider.