Reading the readme makes me think it's only searching the top 4 most likely docs via the embeddings, not the wiki at any time? or am I misunderstanding how this works? With embeddings being close to just term vector matching via dot(?) product?
So basically get all the sub-prhases/sounds -> vector -> check vector db for closest matching documents -> send to gpt for summarization and answering the quetsion.
If that's ture wouldn't that have severe limitations with scattered information? I guess it would help you get answers and walk the data better than the "I don't even know the term" problem with google?
Yep, that's the way it's currently implemented in langchain.
The 4 is a hyperparameter you can change, though, so you could set it to 10 as well.
The way it works is that it first looks up the N most relevant documents (N being 4 in the default case) in the FAISS store relevant to the question, so it uses distance of embedding vectors for this lookup.
Then it uses GPT3 to get summaries of the 4 entries related to the question and finally all the summaries together with the question will lead to the answer.
In doing so, you can trace the source where the answer came from and can also point to that URL in the end.
When you make N larger it just gets more expensive in terms of your API costs.
Looks interesting! Have you considered a proper vector database like Qdrant (https://qdrant.tech)? FAISS runs on a single machine, but if you want to scale things up, then a real database makes it a lot easier. And with a free 1GB cluster on Qdrant Cloud (https://cloud.qdrant.io), you can store quite a lot of vectors. Qdrant is also already integrated with Langchain.
Using something like Weaviate, which can be started in Docker with a one-liner, will give the ability to move away or toward dense vectors by concept. While doing dot product with manual code is fairly easy, using Weaviate to do the lifting (for embeddings as well) makes things super simple.
They probably took this approach because it's the only thing you can do with the OpenAI APIs (for now). Training your own corpus will be the way to go once it's possible.
Nice to have tools like this to wrap up features, definitely makes these types of solutions more accessible, thanks!
It would be nice to know from your experience if there is a kind of rule of thumb for calculating cost of fine tuning and running a solution like this against a docs site?
I tried to do something tangentially similar recently, telling ChatGPT that I'd ask it a question, but rather than a response, I wanted search terms for Wikipedia and Wikidata that I could give it that would have the answer in. The thinking is I'd then be able to provide those to it, and get it to synthesize that data, providing answers that had decent citations in them.
Perhaps it was the example I chose "flight time from New York to London" but I couldn't really get it to provide sensible search terms for the information it wanted or needed
Thanks for sharing the code. What happen when the existing content get updated and new contents created, would it need to create embeddings for all contents again? The current approach is not good as create embeddings cost money? Please see https://github.com/mpaepper/content-chatbot/blob/main/create.... Would it be possible progressively update the vector store?
For anyone interested in an audio version that talks to you, that you can get on your site today, my brother put this together a few weeks ago!
https://siteguide.ai/
It would be cool to use this with internal data, then allow clients to chat with a bot fine-tunes on their data, but that can also run queries, or get reports for specific dates, or charts, all via tools.
Yes, this will be an interesting next experiment - adding agents with additional tools (also for example access to internal APIs) will be quite powerful.
Should be fine, though, as it iterates over it, it creates embeddings and then stores them in the FAISS store (https://github.com/facebookresearch/faiss) which was created to handle a large amount of embeddings.
For the actual queries, it filters it down by the most relevant documents which are closest in the embedding space, so this should work.
Currently, it only splits documents linearly, so if you have information which is written backwards or things like that, it will likely not work so well.
So basically get all the sub-prhases/sounds -> vector -> check vector db for closest matching documents -> send to gpt for summarization and answering the quetsion.
If that's ture wouldn't that have severe limitations with scattered information? I guess it would help you get answers and walk the data better than the "I don't even know the term" problem with google?