When the generative model is autoregressive (autocomplete), it can easily be used as a predictor. All of the state of the art language models are tested against multiple choice exams and other types of prediction tasks. In fact, it's how they are trained...masking - https://www.microsoft.com/en-us/research/blog/mpnet-combines...
For GPT4: "Pricing is $0.03 per 1,000 “prompt” tokens (about 750 words) and $0.06 per 1,000 “completion” tokens (again, about 750 words)."
Meanwhile, there are off-shelf models that you can train very efficiently, on relevant data, privately, and you can run these on your own infrastructure.
Yes, GPT4 is probably great at all the benchmark tasks, but models have been great at all the open benchmark tasks for a long time. That's why they have to keep making harder tasks.
Depending on what you actually want to do with LMs, GPT4 might lose to a BERTish model in a cost-benefit analysis--especially given that (in my experience), the hard part of ML is still getting data/QA/infrastructure aligned with whatever it is you want to do with the ML. (At least at larger companies, maybe it's different at startups.)
For example: "Multiple-choice questions in 57 subjects (professional & academic)" - https://openai.com/research/gpt-4