>> Just chunking every N characters wasn't especially fruitful
Is there any science associated with creative effective embedding sets? For a book, you could do every sentence, every paragraph, every page or every chapter (or all of these options). Eventually people will want to just point their RAG system at data and everything works.
The easy answer is just use a model to chunk your data for you. Phi-2 can chunk and annotate with pre/post summary context in one pass, and it's pretty fast/cheap.
There is an optimal chunk size, which IIRC is ~512 tokens depending on some factors. You could hierarchically model your data with embeddings by chunking the data, then generating summaries of those chunks and chunking the summaries, and repeating that process ad nauseum until you only have a small number of top level chunks.
This is an example of knowledge transfer from a model. I used a similar approach to augment chunked texts with questions, summaries, and keyterms (which require structured output from the LLM). I haven't tried using a smaller model to do this as GPT3.5 is fast and cheap enough, but I like the idea of running a model in house to do things like this.
Phi can ingest 2k tokens and the optimal chunk size is between 512-1024 depending on the model/application, so you just give it a big chunk and tell it to break it down into smaller chunks that are semantically related, leaving enough room for book-end sentences to enrich the context of the chunk. Then you start the next big chunk with the remnants of the previous one that the model couldn't group.
You don't need to handle a whole book, the goal is to chunk the book into chunks of the correct size, which is less than the context size of the model you're using to chunk it semantically. When you're ingesting data, you fill up the chunker model's context, and it breaks that up into smaller, self relevant chunks and a remainder. You then start from the remainder and slurp up as much additional text as you can to fill the context and repeat the process.
Is there any science associated with creative effective embedding sets? For a book, you could do every sentence, every paragraph, every page or every chapter (or all of these options). Eventually people will want to just point their RAG system at data and everything works.