Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for the reply. I really did like the txtai approach.

I am working with markdown files. I think that required me to use Tika & Java based on this note in your docs [0]?

Note: BeautifulSoup4 only supports HTML documents, anything else requires Tika and Java to be installed.

Tika did a great job of chunking the markdown into sections with appropriate parent header context, if I remember correctly.

I just couldn't ask my users to manually install such complex dependencies. I worried about the support burden I would incur, due to the types of issues they would encounter.

[0] https://neuml.github.io/txtai/pipeline/data/textractor/




I understand. Interestingly enough, the textractor pipeline actually outputs Markdown as it's output as I've found it to be a format most LLMs work well with.

I know you've already found a solution but for the record, the markdown files could have been directly read in and then passed to a segmentation pipeline. That way you wouldn't need any of the deps of the textractor pipeline.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: