Your website's content -> Q&A bot / chatbot

grogenaut · on March 22, 2023

Reading the readme makes me think it's only searching the top 4 most likely docs via the embeddings, not the wiki at any time? or am I misunderstanding how this works? With embeddings being close to just term vector matching via dot(?) product?

So basically get all the sub-prhases/sounds -> vector -> check vector db for closest matching documents -> send to gpt for summarization and answering the quetsion.

If that's ture wouldn't that have severe limitations with scattered information? I guess it would help you get answers and walk the data better than the "I don't even know the term" problem with google?

mpaepper · on March 22, 2023

Yep, that's the way it's currently implemented in langchain.

The 4 is a hyperparameter you can change, though, so you could set it to 10 as well.

The way it works is that it first looks up the N most relevant documents (N being 4 in the default case) in the FAISS store relevant to the question, so it uses distance of embedding vectors for this lookup.

Then it uses GPT3 to get summaries of the 4 entries related to the question and finally all the summaries together with the question will lead to the answer.

In doing so, you can trace the source where the answer came from and can also point to that URL in the end.

When you make N larger it just gets more expensive in terms of your API costs.

kacperlukawski · on March 22, 2023

Looks interesting! Have you considered a proper vector database like Qdrant (https://qdrant.tech)? FAISS runs on a single machine, but if you want to scale things up, then a real database makes it a lot easier. And with a free 1GB cluster on Qdrant Cloud (https://cloud.qdrant.io), you can store quite a lot of vectors. Qdrant is also already integrated with Langchain.

Guillaume86 · on March 22, 2023

Probably not very helpful at the scale most people would run this. Even brute forcing the search on CPU gives results in a few ms on small datasets.

kordlessagain · on March 22, 2023

Using something like Weaviate, which can be started in Docker with a one-liner, will give the ability to move away or toward dense vectors by concept. While doing dot product with manual code is fairly easy, using Weaviate to do the lifting (for embeddings as well) makes things super simple.

https://github.com/FeatureBaseDB/slothbot/blob/slothbot-work...

grogenaut · on March 22, 2023

that means you need docker running and the dependencies explode if you take this approach. I really like the tight dependency tree.

mpaepper · on March 22, 2023

Thanks for the suggestion, but for my fun small experiment, FAISS was more than enough.

wmf · on March 22, 2023

They probably took this approach because it's the only thing you can do with the OpenAI APIs (for now). Training your own corpus will be the way to go once it's possible.

layoric · on March 21, 2023

Nice to have tools like this to wrap up features, definitely makes these types of solutions more accessible, thanks!

It would be nice to know from your experience if there is a kind of rule of thumb for calculating cost of fine tuning and running a solution like this against a docs site?

mpaepper · on March 22, 2023

I don't have larger scale experience on this at the moment, but I can tell you what I observed during my trials (also see my related blog entry for some more info: https://www.paepper.com/blog/posts/build-q-and-a-bot-of-your...)

It cost around 0.05$ to create the embeddings for my ~50 blog entries.

Asking a question in the way I've described it also costs around 0.05$ via the API.

danielbln · on March 22, 2023

0.05$ for a question seems expensive, are you using davinci3 or gpt3.5-turbo?

petesergeant · on March 22, 2023

I tried to do something tangentially similar recently, telling ChatGPT that I'd ask it a question, but rather than a response, I wanted search terms for Wikipedia and Wikidata that I could give it that would have the answer in. The thinking is I'd then be able to provide those to it, and get it to synthesize that data, providing answers that had decent citations in them.

Perhaps it was the example I chose "flight time from New York to London" but I couldn't really get it to provide sensible search terms for the information it wanted or needed

rahimnathwani · on March 22, 2023

Check out langchain. It implements the ReAct approach, which is similar to what you describe, but without a human needing to be the go-between.

limcheekin · on March 25, 2023

Thanks for sharing the code. What happen when the existing content get updated and new contents created, would it need to create embeddings for all contents again? The current approach is not good as create embeddings cost money? Please see https://github.com/mpaepper/content-chatbot/blob/main/create.... Would it be possible progressively update the vector store?

Please advise. Thank you.

mdotk · on March 22, 2023

Make this a Wordpress plugin and I'd pay for it

itake · on March 22, 2023

Have you tested the recall of embedding search? I'm not sure about the latest results, but 1-2 years ago, it had 50-70% recall :-/

_ajoj · on March 22, 2023

Along with paying the openAI API fees? Perhaps a cache layer for common qa.

wedn3sday · on March 21, 2023

I would absolutely love to take our internal Wiki and use this against it.

mpaepper · on March 22, 2023

Go for it, it's quite easy to do, you only need to adapt the `https://github.com/mpaepper/content-chatbot/blob/main/create...` file a bit to match the way your data is represented and then you are good to go.

jonaraphael · on March 22, 2023

Awesome work! Thanks for sharing.

For anyone interested in an audio version that talks to you, that you can get on your site today, my brother put this together a few weeks ago! https://siteguide.ai/

nico · on March 21, 2023

Awesome!

Are you planning on adding agent/tools support?

It would be cool to use this with internal data, then allow clients to chat with a bot fine-tunes on their data, but that can also run queries, or get reports for specific dates, or charts, all via tools.

mpaepper · on March 22, 2023

Yes, this will be an interesting next experiment - adding agents with additional tools (also for example access to internal APIs) will be quite powerful.

rcarmo · on March 22, 2023

Curious to see if it can take my entire site content: https://taoofmac.com/static/graph

Might be a fun weekend experiment.

mpaepper · on March 22, 2023

Woah, that's a huge site!

Should be fine, though, as it iterates over it, it creates embeddings and then stores them in the FAISS store (https://github.com/facebookresearch/faiss) which was created to handle a large amount of embeddings.

For the actual queries, it filters it down by the most relevant documents which are closest in the embedding space, so this should work.

Let me know how it goes!

friendlypeg · on March 22, 2023

How does this handle websites with complicated structure instead of your typical blogposts where ideas are divided neatly into separate paragraph?

mpaepper · on March 22, 2023

Currently, it only splits documents linearly, so if you have information which is written backwards or things like that, it will likely not work so well.

_ajoj · on March 22, 2023

See also: https://github.com/whitead/paper-qa