Hacker News new | past | comments | ask | show | jobs | submit login
Txtai: Open-source vector search and RAG for minimalists (neuml.github.io)
249 points by dmezzetti 3 months ago | hide | past | favorite | 55 comments



Hello, author of txtai here. txtai was created back in 2020 starting with semantic search of medical literature. It has since grown into a framework for vector search, retrieval augmented generation (RAG) and large language model (LLM) orchestration/workflows.

The goal of txtai is to be simple, performant, innovative and easy-to-use. It had vector search before many current projects existed. Semantic Graphs were added in 2022 before the Generative AI wave of 2023/2024. GraphRAG is a hot topic but txtai had examples of using graphs to build search contexts back in 2022/2023.

There is a commitment to quality and performance, especially with local models. For example, it's vector embeddings component streams vectors to disk during indexing and uses mmaped arrays to enable indexing large datasets locally on a single node. txtai's BM25 component is built from the scratch to work efficiently in Python leading to 6x better memory utilization and faster search performance than the BM25 Python library most commonly used.

I often see others complain about AI/LLM/RAG frameworks, so I wanted to share this project as many don't know it exists.

Link to source (Apache 2.0): https://github.com/neuml/txtai


So here's something I've been wanting to do for a while, but have kinda been struggling to figure out _how_ to do it. txtai looks like it has all the tools necessary to do the job, I'm just not sure which tool(s), and how I'd use them.

Basically, I'd like to be able to take PDFs of, say, D&D books, extract that data (this step is, at least, something I can already do), and load it into an LLM to be able to ask questions like:

* What does the feat "Sentinel" do?

* Who is Elminster?

* Which God(s) do Elves worship in Faerûn?

* Where I can I find the spell "Crusader's Mantle"?

And so on. Given this data is all under copyright, I'd probably have to stick to using a local LLM to avoid problems. And, while I wouldn't expect it to have good answers to all (or possibly any!) of those questions, I'd nevertheless love to be able to give it a try.

I'm just not sure where to start - I think I'd want to fine-tune an existing model since this is all natural language content, but I get a bit lost after that. Do I need to pre-process the content to add extra information that I can't fetch relatively automatically. e.g., page numbers are simple to add in, but would I need to mark out things like chapter/section headings, or in-character vs out-of-character text? Do I need to add all the content in as a series of questions and answers, like "What information is on page 52 of the Player's Handbook? => <text of page>"?


Use RAG.

Fine tune will bias something to return specific answers. It's great for tone and classification. It's terrible for information. If you get info out of it, it's because it's a consistent hallucination.

Embeddings will turn the whole thing into a bunch of numbers. So something like Sentinel will probably match with similar feats. Embeddings are perfect for searching. You can convert images and sound to these numbers too.

But these numbers can't be stored in any regular DB. Most of the time it's somewhere in memory, then thrown out. I haven't looked deep into txtai but it looks like what it does. This is okay, but it's a little slow and wasteful as you're running the embeddings each time. So that's what vector DBs are for. But unless you're running this at scale where every cent adds up, you don't really need one.

As for preprocessing, many embedding models are already good enough. I'd say try it first, try different models, then tweak as needed. Generally proprietary models do better than open source, but there's likely an open source one designed for game books, which would do best on an unprocessed D&D book.

However it's likely to be poor at matching pages afaik, unless you attach that info.


Based on what you're looking to do, it sounds like Retrieval Augmented Generation (RAG) should help. This article has an example on how to do that with txtai: https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

RAG sounds sophisticated but it's actually quite simple. For each question, a database (vector database, keyword, relational etc) is first searched. The top n results are then inserted into a prompt and that is what is run with the LLM.

Before fine-tuning, I'd try that out first. I'm planning to have another example notebook out soon building on this.


Ah, that's very helpful, thanks! I'll have a dig into this at some point relatively soon.

An example of how I might provide references with page numbers or chapter names would be great (even if this means a more complex text-extraction pipeline). As would examples showing anything I can do to indicate differences that are obvious to me but that an LLM would be unlikely to pick up, such as the previously mentioned in-character vs out-of-character distinction. This is mostly relevant for asking questions about the setting, where in-character information might be suspect ("unreliable narrator"), while out-of-character information is generally fully accurate.

Tangentially, is this something that I could reasonably experiment with without a GPU? While I do have a 4090, it's in my Windows gaming machine, which isn't really set up for AI/LLM/etc development.


Will do, I'll have the new notebooks published within the next couple weeks.

In terms of a no GPU setup, yes it's possible but it will be slow. As long as you're OK with slow response times, then it will eventually come back with answers.


Thanks, I'd really appreciate it! The blog post you linked earlier was what finally made RAG "click" for me, making it very clear how it works, at least for the relatively simple tasks I want to do.


Glad to hear it. It's really a simple concept.


Where can we follow up on this when you're done--do you have a blog or social media?


All the links for that are here - https://neuml.com


All the people saying "don't use fine-tuning" don't realize that most of traditional fine-tuning's issues are due to modifying all of the weights in your model, which causes catastrophic forgetting

There's tons of parameter efficient fine-tuning methods, i.e. lora, "soft prompts", ReFt, etc which are actually good to use alongside RAG and will likely supercharge your solution compared to "simply using RAG". The fewer parameters you modify, the more knowledge is "preserved".

Also, look into the Graph-RAG/Semantic Graph stuff in txtai. As usual, David (author of txtai) was implementing code for things that the market only just now cares about years ago.


Thanks for the great insights on fine-tuning and the kind words!


You can actually do this with LLMStack (https://github.com/trypromptly/LLMStack) quite easily in a no-code way. Put together a guide to use LLMStack with Ollama last week - https://docs.trypromptly.com/guides/using-llama3-with-ollama for using local models. It lets you load all your files as a datasource and then build a RAG app over it.

For now it still uses openai for embeddings generation by default and we are updating that in the next couple of releases to be able to use a local model for embedding generation before writing to a vector db.

Disclosure: I'm the maintainer of LLMStack project


I did something similar to this using RAG except for Vampire rather than D&D. It wasn't overwhelmingly difficult, but I found that the system was quite sensitive to how I chunked up the books. Just letting an automated system prepare the PDFs for me gave very poor results all around. I had to ensure that individual chunks had logical start/end positions, that tables weren't cut off, and so on.

I wouldn't fine-tune, that's too much cost/effort.


Yeah, that's about what I'd expected (and WoD books would be a priority for me to index). Another commentator mentioned that Knowledge Graphs might be useful for dealing with the limitations imposed by RAG (e.g., have to limit results because context window is relatively small), which might be worth looking into as well. That said, properly preparing this data for a KG, ontologies and all, might be too much work.


RAG is all you need*. This is a pretty DIY setup, but I use a private instance of Dify for this. I have a private Git repository where I commit my "knowledge", a Git hook syncs the changes with the Dify knowledge API, and then I use the Dify API/chat for querying.

*it would probably be better to add a knowledge graph as an extra step, which first tells the system where to search. RAG by itself is pretty bad at summarizing and combining many different docs due to the limited LLM context sizes, and I find that many questions require this global overview. A knowledge graph or other form of index/meta-layer probably solves that.


From a quick search, it seems like Knowledge Graphs are particularly new, even by AI standards, so it's harder to get one up off the ground if you haven't been following AI extremely closely. Is that accurate, or is it just the integration points with AI that are new?


First I would calculate the number of tokens you actually need. If its less than 32k there are plenty of ways to pull this off without RAG. If more (millions), you should understand RAG is an approximation technique and results may not be as high quality. If wayyyy more (billions), you might actually want to finetune


Fine-tuning is almost certainly the wrong way to go about this. It's not a good way of adding small amounts of new knowledge to a model because the existing knowledge tends to overwhelm anything you attempt to add in the fine-tuning steps.

Look into different RAG and tool usage mechanisms instead. You might even be able to get good results from dumping large amounts of information into a long context model like Gemini Flash.


No fine-tuning is necessary. You can use something reasonably good at RAG that's small enough to run locally like the Command-R model run by Ollama and a small embedding model like Nomic. There are dozens of simple interfaces that will let you import files to create a RAG knowledgebase to interact with as you describe, AnythingLLM is a popular one. Just point it at your locally-running LLM or tell them to download one using the interface. Behind the scenes they store everything in LanceDB or similar and perform the searching for you when you submit a prompt in the simple chat interface.


Don't have anything to add to the others. Just sharing a way of thinking for deciding between RAG and fine-tuning:

(A) RAG is for changing content

(B) fine-tuning is for changing behaviour

(C) see if few shot-learning or prompt engineering is enough before going to (A) or (B)

It's a bit simplistic but I found it helpful so far.


Very easy to do with Milvus and LangChain. I built a private slack bot that takes PDFs, chunks it into Milvus using PyMuPDF, the uses LangChain for recall, its surprising good for what your describe and took maybe 2 hours to build and run locally.


Seems like using txtai would also be very easy?


Yes, this article is a good place to start: https://neuml.hashnode.dev/build-rag-pipelines-with-txtai


I learned about txtai later and it definitely seems cool, maybe I'll rewrite it later.


Typical HN response here but do you have a blog post or a guide on how you did this? Would love to know more..


I used AI, go feed it my comment.


I’ve done something similar, but using duckDB as the backend/vector store. You can use embeddings from wherever. My demo uses OpenAI.

https://github.com/patricktrainer/duckdb-embedding-search


I did some prototyping with txtai for the RAG used in aider’s interactive help feature [0]. This lets users ask aider questions about using aider, customizing settings, troubleshooting, using LLMs, etc.

I really liked the simplicity of txtai. But it seems to require Java as a dependency! Aider is an end user cli tool, and ultimately I couldn’t take on the support burden of asking my users to install Java.

[0] https://aider.chat/docs/troubleshooting/support.html


Thanks for giving txtai a try.

txtai doesn't require Java. It has a text extraction component which can optionally use Apache Tika. Apache Tika is a Java library. Tika can also be spun up as a Docker image much like someone can spin up Ollama for LLM inference.

Looking at your use case, it appears you wanted to parse and index HTML? If so, the only dependency should have been BeautifulSoup4.

Alternatively, one can use another library such as unstructured.io or PyMuPDF for word/pdf. Those are not issue free though. For example, unstructured requires libreoffice for word documents, poppler for pdfs. PyMuPDF is AGPL, which is a non-starter for many. Apache Tika is Apache 2.0, mature and it has robust production-quality support for a lot of formats.


Thanks for the reply. I really did like the txtai approach.

I am working with markdown files. I think that required me to use Tika & Java based on this note in your docs [0]?

Note: BeautifulSoup4 only supports HTML documents, anything else requires Tika and Java to be installed.

Tika did a great job of chunking the markdown into sections with appropriate parent header context, if I remember correctly.

I just couldn't ask my users to manually install such complex dependencies. I worried about the support burden I would incur, due to the types of issues they would encounter.

[0] https://neuml.github.io/txtai/pipeline/data/textractor/


I understand. Interestingly enough, the textractor pipeline actually outputs Markdown as it's output as I've found it to be a format most LLMs work well with.

I know you've already found a solution but for the record, the markdown files could have been directly read in and then passed to a segmentation pipeline. That way you wouldn't need any of the deps of the textractor pipeline.


I’ve been building a RAG mini app with txtai these past few weeks and it’s been pretty smooth. I’m between this and llamaindex as the backend for a larger app I want to build for a small-to-midsize customer.

With the (potentially) obvious bias towards your own framework, are there situations in which you would not recommend it for a particular application?


Glad to hear txtai is on your list.

I recently wrote an article (https://medium.com/neuml/vector-search-rag-landscape-a-revie...) comparing txtai with other popular frameworks. I was expecting to find some really interesting and innovative things in the others. But from my perspective I was underwhelmed.

I'm a big fan of simplicity and none of them are following that strategy. Agentic workflows seem like a big fancy term but I don't see the value currently. Things are hard enough as it is.

If your team is already using another framework, I'm sure anything can work. Some of the other projects are VC-backed with larger teams. In some cases, that may be important.


"Interested in an easy and secure way to run hosted txtai applications? Then join the txtai.cloud preview to learn more."

I wish the author all the best and this seems to be a very sane and minimalist approach when compared to all the other enterprise-backed frameworks and libraries in this space. I might even become a customer!

However, has someone started an open source library that's fully driven by a community? I'm thinking of something like Airflow or Git. I'm not saying that the "purist" model is the best or enterprise-backed frameworks are evil. I'm just not seeing this type of project in this space.


Appreciate the well wishes.

NeuML is not venture backed, so there is no impetus to build a hosted version. The main goal is making it easier for a larger audience.


Has anyone had experience with qdrant (https://qdrant.tech/) as a vector store data and can speak to how txtai compares?


txtai is not (just) a vector store, it's a full-fledged RAG system. Apples and oranges if you ask me.


I agree that the comparison between langchain/llamaindex is probably the better one.

With that being said, txtai has a much more in-depth approach with how it builds it's data stores vs just assuming the underlying systems will handle everything. It supports running SQL statements and integrates the components in a way other RAG systems don't. It was also a vector store before it had a RAG workflow. There are years of code behind that part.


It is very impressive :)


Looks pretty cool! Is this intended to be a simple alternative to, say, cobbling together something with LangChain and Chroma?


Thanks. That is correct. This is an alternative to LangChain/LlamaIndex on the RAG side and Chroma on the vector db side.


This looks interesting. I've been wanting to build some tools to help feed text documents into Stable Diffusion and this looks like it could be helpful. Are there any other libs people are aware of that they'd recommend in this space?


Txtai get things done quick, but one problem is the code base is not properly typed (in contrast to a bit higher learning curve but more proper Haystack). Would be nice if this project is properly type annotated.


We certainly could add typing to the main API calls. Typing isn't a huge thing to me as a developer, so I've never really made it a priority. The only place there is typing is in the FastAPI hooks given it's required.


Link to source (Apache 2.0): https://github.com/neuml/txtai


What type of embeddings db does it use? Is it interchangeable?


You can read more on that here: https://neuml.github.io/txtai/embeddings/configuration/vecto...

txtai supports Hugging Face Transformers models, llama.cpp embeddings models and API services such as OpenAI/Cohere/Ollama.


It's frustrating when developers of ML projects don't state even the most basic requirements. Do I need an Nvidia 4090 or a cluster of H100s to run this?


The embedding models at the heart of txtai can be small enough to run on intel CPUs from ten years ago. It's extremely frustrating when HN commentators don't do even the most basic research into the product that they are critiquing.


It’s frustrating when people ask for hardware requirements without stating what they are trying to do, do you have 100,000,000 books to index or do you have 5 articles? What are the context lengths you need? What about latency?

How can someone tell you what hardware you need when you give literally no information about what you’re trying to do?


There's a difference between "how many CPU-hours will my task need" and "how much memory does this program use to even start up".


Having some idea of the task will guide the choice of model, which will be an enormous factor in memory use (I.e. whether it will startup or not)

Do you need a 70b param model or a 7b model? Theres thousands and thousands of dollars hardware difference there

With no idea of the task, one can’t even ball park it


This particular tool has a page listing recommended models: https://neuml.github.io/txtai/models/


A RTX 3090 is more than enough for 7B LLMs. With 4-bit quantization, you can run inference with an even larger LLM using a 24GB GPU.

If you're using remote API services, you might be able to just use a CPU.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: