>> Just chunking every N characters wasn't especially fruitful Is there any scie...

CuriouslyC · 2024-03-06T12:42:16 1709728936

The easy answer is just use a model to chunk your data for you. Phi-2 can chunk and annotate with pre/post summary context in one pass, and it's pretty fast/cheap.

There is an optimal chunk size, which IIRC is ~512 tokens depending on some factors. You could hierarchically model your data with embeddings by chunking the data, then generating summaries of those chunks and chunking the summaries, and repeating that process ad nauseum until you only have a small number of top level chunks.

kordlessagain · 2024-03-06T13:42:00 1709732520

This is an example of knowledge transfer from a model. I used a similar approach to augment chunked texts with questions, summaries, and keyterms (which require structured output from the LLM). I haven't tried using a smaller model to do this as GPT3.5 is fast and cheap enough, but I like the idea of running a model in house to do things like this.

drittich · 2024-03-06T13:03:21 1709730201

How does this work when there is a limited context window. You do some pre-chunking?

CuriouslyC · 2024-03-06T13:20:24 1709731224

Phi can ingest 2k tokens and the optimal chunk size is between 512-1024 depending on the model/application, so you just give it a big chunk and tell it to break it down into smaller chunks that are semantically related, leaving enough room for book-end sentences to enrich the context of the chunk. Then you start the next big chunk with the remnants of the previous one that the model couldn't group.

drittich · 2024-03-06T15:30:49 1709739049

Isn't "give it a big chunk" just the same problem at a higher level? How do you handle, say, a book?

CuriouslyC · 2024-03-06T15:48:29 1709740109

You don't need to handle a whole book, the goal is to chunk the book into chunks of the correct size, which is less than the context size of the model you're using to chunk it semantically. When you're ingesting data, you fill up the chunker model's context, and it breaks that up into smaller, self relevant chunks and a remainder. You then start from the remainder and slurp up as much additional text as you can to fill the context and repeat the process.