Thanks for the reply. I really did like the txtai approach. I am working with ma...

dmezzetti · 2024-07-22T13:54:08 1721656448

I understand. Interestingly enough, the textractor pipeline actually outputs Markdown as it's output as I've found it to be a format most LLMs work well with.

I know you've already found a solution but for the record, the markdown files could have been directly read in and then passed to a segmentation pipeline. That way you wouldn't need any of the deps of the textractor pipeline.