Hello, author of txtai here. txtai was created back in 2020 starting with semantic search of medical literature. It has since grown into a framework for vector search, retrieval augmented generation (RAG) and large language model (LLM) orchestration/workflows.
The goal of txtai is to be simple, performant, innovative and easy-to-use. It had vector search before many current projects existed. Semantic Graphs were added in 2022 before the Generative AI wave of 2023/2024. GraphRAG is a hot topic but txtai had examples of using graphs to build search contexts back in 2022/2023.
There is a commitment to quality and performance, especially with local models. For example, it's vector embeddings component streams vectors to disk during indexing and uses mmaped arrays to enable indexing large datasets locally on a single node. txtai's BM25 component is built from the scratch to work efficiently in Python leading to 6x better memory utilization and faster search performance than the BM25 Python library most commonly used.
I often see others complain about AI/LLM/RAG frameworks, so I wanted to share this project as many don't know it exists.
So here's something I've been wanting to do for a while, but have kinda been struggling to figure out _how_ to do it. txtai looks like it has all the tools necessary to do the job, I'm just not sure which tool(s), and how I'd use them.
Basically, I'd like to be able to take PDFs of, say, D&D books, extract that data (this step is, at least, something I can already do), and load it into an LLM to be able to ask questions like:
* What does the feat "Sentinel" do?
* Who is Elminster?
* Which God(s) do Elves worship in Faerûn?
* Where I can I find the spell "Crusader's Mantle"?
And so on. Given this data is all under copyright, I'd probably have to stick to using a local LLM to avoid problems. And, while I wouldn't expect it to have good answers to all (or possibly any!) of those questions, I'd nevertheless love to be able to give it a try.
I'm just not sure where to start - I think I'd want to fine-tune an existing model since this is all natural language content, but I get a bit lost after that. Do I need to pre-process the content to add extra information that I can't fetch relatively automatically. e.g., page numbers are simple to add in, but would I need to mark out things like chapter/section headings, or in-character vs out-of-character text? Do I need to add all the content in as a series of questions and answers, like "What information is on page 52 of the Player's Handbook? => <text of page>"?
Fine tune will bias something to return specific answers. It's great for tone and classification. It's terrible for information. If you get info out of it, it's because it's a consistent hallucination.
Embeddings will turn the whole thing into a bunch of numbers. So something like Sentinel will probably match with similar feats. Embeddings are perfect for searching. You can convert images and sound to these numbers too.
But these numbers can't be stored in any regular DB. Most of the time it's somewhere in memory, then thrown out. I haven't looked deep into txtai but it looks like what it does. This is okay, but it's a little slow and wasteful as you're running the embeddings each time. So that's what vector DBs are for. But unless you're running this at scale where every cent adds up, you don't really need one.
As for preprocessing, many embedding models are already good enough. I'd say try it first, try different models, then tweak as needed. Generally proprietary models do better than open source, but there's likely an open source one designed for game books, which would do best on an unprocessed D&D book.
However it's likely to be poor at matching pages afaik, unless you attach that info.
RAG sounds sophisticated but it's actually quite simple. For each question, a database (vector database, keyword, relational etc) is first searched. The top n results are then inserted into a prompt and that is what is run with the LLM.
Before fine-tuning, I'd try that out first. I'm planning to have another example notebook out soon building on this.
Ah, that's very helpful, thanks! I'll have a dig into this at some point relatively soon.
An example of how I might provide references with page numbers or chapter names would be great (even if this means a more complex text-extraction pipeline). As would examples showing anything I can do to indicate differences that are obvious to me but that an LLM would be unlikely to pick up, such as the previously mentioned in-character vs out-of-character distinction. This is mostly relevant for asking questions about the setting, where in-character information might be suspect ("unreliable narrator"), while out-of-character information is generally fully accurate.
Tangentially, is this something that I could reasonably experiment with without a GPU? While I do have a 4090, it's in my Windows gaming machine, which isn't really set up for AI/LLM/etc development.
Will do, I'll have the new notebooks published within the next couple weeks.
In terms of a no GPU setup, yes it's possible but it will be slow. As long as you're OK with slow response times, then it will eventually come back with answers.
Thanks, I'd really appreciate it! The blog post you linked earlier was what finally made RAG "click" for me, making it very clear how it works, at least for the relatively simple tasks I want to do.
All the people saying "don't use fine-tuning" don't realize that most of traditional fine-tuning's issues are due to modifying all of the weights in your model, which causes catastrophic forgetting
There's tons of parameter efficient fine-tuning methods, i.e. lora, "soft prompts", ReFt, etc which are actually good to use alongside RAG and will likely supercharge your solution compared to "simply using RAG". The fewer parameters you modify, the more knowledge is "preserved".
Also, look into the Graph-RAG/Semantic Graph stuff in txtai. As usual, David (author of txtai) was implementing code for things that the market only just now cares about years ago.
For now it still uses openai for embeddings generation by default and we are updating that in the next couple of releases to be able to use a local model for embedding generation before writing to a vector db.
Disclosure: I'm the maintainer of LLMStack project
I did something similar to this using RAG except for Vampire rather than D&D. It wasn't overwhelmingly difficult, but I found that the system was quite sensitive to how I chunked up the books. Just letting an automated system prepare the PDFs for me gave very poor results all around. I had to ensure that individual chunks had logical start/end positions, that tables weren't cut off, and so on.
I wouldn't fine-tune, that's too much cost/effort.
Yeah, that's about what I'd expected (and WoD books would be a priority for me to index). Another commentator mentioned that Knowledge Graphs might be useful for dealing with the limitations imposed by RAG (e.g., have to limit results because context window is relatively small), which might be worth looking into as well. That said, properly preparing this data for a KG, ontologies and all, might be too much work.
RAG is all you need*. This is a pretty DIY setup, but I use a private instance of Dify for this. I have a private Git repository where I commit my "knowledge", a Git hook syncs the changes with the Dify knowledge API, and then I use the Dify API/chat for querying.
*it would probably be better to add a knowledge graph as an extra step, which first tells the system where to search. RAG by itself is pretty bad at summarizing and combining many different docs due to the limited LLM context sizes, and I find that many questions require this global overview. A knowledge graph or other form of index/meta-layer probably solves that.
From a quick search, it seems like Knowledge Graphs are particularly new, even by AI standards, so it's harder to get one up off the ground if you haven't been following AI extremely closely. Is that accurate, or is it just the integration points with AI that are new?
First I would calculate the number of tokens you actually need. If its less than 32k there are plenty of ways to pull this off without RAG. If more (millions), you should understand RAG is an approximation technique and results may not be as high quality. If wayyyy more (billions), you might actually want to finetune
Fine-tuning is almost certainly the wrong way to go about this. It's not a good way of adding small amounts of new knowledge to a model because the existing knowledge tends to overwhelm anything you attempt to add in the fine-tuning steps.
Look into different RAG and tool usage mechanisms instead. You might even be able to get good results from dumping large amounts of information into a long context model like Gemini Flash.
No fine-tuning is necessary. You can use something reasonably good at RAG that's small enough to run locally like the Command-R model run by Ollama and a small embedding model like Nomic. There are dozens of simple interfaces that will let you import files to create a RAG knowledgebase to interact with as you describe, AnythingLLM is a popular one. Just point it at your locally-running LLM or tell them to download one using the interface. Behind the scenes they store everything in LanceDB or similar and perform the searching for you when you submit a prompt in the simple chat interface.
Very easy to do with Milvus and LangChain. I built a private slack bot that takes PDFs, chunks it into Milvus using PyMuPDF, the uses LangChain for recall, its surprising good for what your describe and took maybe 2 hours to build and run locally.
I did some prototyping with txtai for the RAG used in aider’s interactive help feature [0]. This lets users ask aider questions about using aider, customizing settings, troubleshooting, using LLMs, etc.
I really liked the simplicity of txtai. But it seems to require Java as a dependency! Aider is an end user cli tool, and ultimately I couldn’t take on the support burden of asking my users to install Java.
txtai doesn't require Java. It has a text extraction component which can optionally use Apache Tika. Apache Tika is a Java library. Tika can also be spun up as a Docker image much like someone can spin up Ollama for LLM inference.
Looking at your use case, it appears you wanted to parse and index HTML? If so, the only dependency should have been BeautifulSoup4.
Alternatively, one can use another library such as unstructured.io or PyMuPDF for word/pdf. Those are not issue free though. For example, unstructured requires libreoffice for word documents, poppler for pdfs. PyMuPDF is AGPL, which is a non-starter for many. Apache Tika is Apache 2.0, mature and it has robust production-quality support for a lot of formats.
Thanks for the reply. I really did like the txtai approach.
I am working with markdown files. I think that required me to use Tika & Java based on this note in your docs [0]?
Note: BeautifulSoup4 only supports HTML documents, anything else requires Tika and Java to be installed.
Tika did a great job of chunking the markdown into sections with appropriate parent header context, if I remember correctly.
I just couldn't ask my users to manually install such complex dependencies. I worried about the support burden I would incur, due to the types of issues they would encounter.
I understand. Interestingly enough, the textractor pipeline actually outputs Markdown as it's output as I've found it to be a format most LLMs work well with.
I know you've already found a solution but for the record, the markdown files could have been directly read in and then passed to a segmentation pipeline. That way you wouldn't need any of the deps of the textractor pipeline.
I’ve been building a RAG mini app with txtai these past few weeks and it’s been pretty smooth. I’m between this and llamaindex as the backend for a larger app I want to build for a small-to-midsize customer.
With the (potentially) obvious bias towards your own framework, are there situations in which you would not recommend it for a particular application?
I recently wrote an article (https://medium.com/neuml/vector-search-rag-landscape-a-revie...) comparing txtai with other popular frameworks. I was expecting to find some really interesting and innovative things in the others. But from my perspective I was underwhelmed.
I'm a big fan of simplicity and none of them are following that strategy. Agentic workflows seem like a big fancy term but I don't see the value currently. Things are hard enough as it is.
If your team is already using another framework, I'm sure anything can work. Some of the other projects are VC-backed with larger teams. In some cases, that may be important.
"Interested in an easy and secure way to run hosted txtai applications? Then join the txtai.cloud preview to learn more."
I wish the author all the best and this seems to be a very sane and minimalist approach when compared to all the other enterprise-backed frameworks and libraries in this space. I might even become a customer!
However, has someone started an open source library that's fully driven by a community? I'm thinking of something like Airflow or Git. I'm not saying that the "purist" model is the best or enterprise-backed frameworks are evil. I'm just not seeing this type of project in this space.
I agree that the comparison between langchain/llamaindex is probably the better one.
With that being said, txtai has a much more in-depth approach with how it builds it's data stores vs just assuming the underlying systems will handle everything. It supports running SQL statements and integrates the components in a way other RAG systems don't. It was also a vector store before it had a RAG workflow. There are years of code behind that part.
This looks interesting. I've been wanting to build some tools to help feed text documents into Stable Diffusion and this looks like it could be helpful. Are there any other libs people are aware of that they'd recommend in this space?
Txtai get things done quick, but one problem is the code base is not properly typed (in contrast to a bit higher learning curve but more proper Haystack).
Would be nice if this project is properly type annotated.
We certainly could add typing to the main API calls. Typing isn't a huge thing to me as a developer, so I've never really made it a priority. The only place there is typing is in the FastAPI hooks given it's required.
It's frustrating when developers of ML projects don't state even the most basic requirements. Do I need an Nvidia 4090 or a cluster of H100s to run this?
The embedding models at the heart of txtai can be small enough to run on intel CPUs from ten years ago. It's extremely frustrating when HN commentators don't do even the most basic research into the product that they are critiquing.
It’s frustrating when people ask for hardware requirements without stating what they are trying to do, do you have 100,000,000 books to index or do you have 5 articles? What are the context lengths you need? What about latency?
How can someone tell you what hardware you need when you give literally no information about what you’re trying to do?
The goal of txtai is to be simple, performant, innovative and easy-to-use. It had vector search before many current projects existed. Semantic Graphs were added in 2022 before the Generative AI wave of 2023/2024. GraphRAG is a hot topic but txtai had examples of using graphs to build search contexts back in 2022/2023.
There is a commitment to quality and performance, especially with local models. For example, it's vector embeddings component streams vectors to disk during indexing and uses mmaped arrays to enable indexing large datasets locally on a single node. txtai's BM25 component is built from the scratch to work efficiently in Python leading to 6x better memory utilization and faster search performance than the BM25 Python library most commonly used.
I often see others complain about AI/LLM/RAG frameworks, so I wanted to share this project as many don't know it exists.
Link to source (Apache 2.0): https://github.com/neuml/txtai