Maybe in SOTA ml/nlp research, but in the world of building useful tools and pro...

deepsquirrelnet · 2024-07-20T00:49:14.000000Z

Latency, throughput and cost are still very important for many applications.

Also the output of a purpose-built encoder model is preferable to natural language. Not only is it unambiguous, but scores are often an important part of the result.

Last, if you need to get into some advanced methods of training, like pseudolabeling and semi-supervised learning, there’s different options and outlets for utilizing real world datasets.

That said, I’m not sure there’s much value in scaling up current encoder models. It seems like there’s already a point of diminishing returns.

hdhshdhshdjd · 2024-07-20T19:13:36.000000Z

Scores are also interesting in that you can get 1.0 match on a classification task, but if the model is dog shit it’s 1.0 of dog shit.

I’m still struggling with the degree to which I want to expose raw scores to users for that reason.

On the other hand, sometimes a document that slightly above an arbitrary threshold isn’t great, or a document slightly below an arbitrary threshold may be fine.

I’m excited about how easy it is to do this stuff, as the tooling is sophisticated enough now, you don’t need to know too much about the underlying mechanisms to do things that are useful. Once you get into those very fine distinction, it’s still very difficult work.

Tostino · 2024-07-19T23:13:05.000000Z

Want to share your collection with the class so we can all learn? Seems useful.

hdhshdhshdjd · 2024-07-20T00:26:16.000000Z

Product in stealth for a little bit longer, so can’t say much. :-)

ipsum2 · 2024-07-19T23:07:58.000000Z

What does your swiss army collection do?

hdhshdhshdjd · 2024-07-20T00:27:11.000000Z

Document classification in highly ambiguous contextual space. Solving some specific large scale classification tasks, so multi million document sets.

Seattle3503 · 2024-07-20T03:39:01.000000Z

What technique do you use to get BERT to work on longer documents?

hdhshdhshdjd · 2024-07-20T05:49:47.000000Z

512 has been sufficient to solve my problems. I had done some initial attempts with BigBird that weren’t going well, but then realized I didn’t really need it.

I may revisit at some point.

llm_trw · 2024-07-19T22:36:10.000000Z

Yeah, pretty much. When you have 2b files you need to troll through good luck using anything but a vector database. Once you do a level or two of pruning of the results then you can feed it into an LLM for final classification.