Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
An underrated alternative to Unstructured/Nougat for text extraction (min.io)
2 points by replicantrose on Feb 6, 2024 | hide | past | favorite | 2 comments


Apache Tika is time-tested and, by some, considered a legacy toolkit. With Tika running as a container and the use of Python bindings, it's possible to get a text extraction experience that is as easy to build with as newer frameworks like Unstructured, but also matches the extraction capability of dedicated extraction models like Nougat. Kind of surprising!

Furthermore, using a backing object store (i.e. MinIO) to hold the source documents is very useful (whether the extracted text is being used for RAG or an LLM training dataset).


Put together a document text extraction server using Apache Tika (with ~30 lines of code) that can be used to vectorize text for retrieval-augmented generation or to create LLM training datasets.

Much credit to the tika-python project for making the Python bindings!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: