I've been playing around with sentence embeddings to search documents, but I wonder how useful they are as a natural language interface for a database. The way one might phrase a question might be very different content wise from how the document describes the answer. Maybe it might be possible to do some type of transform where the question is transformed into a possible answer and then turned into a embedding but I haven't found much info on that yet.
Another idea I've had is to "overfit" a generative model like GPT on a dataset but pay more attention to how url and the like are tokenised
> Maybe it might be possible to do some type of transform where the question is transformed into a possible answer and then turned into a embedding but I haven't found much info on that yet.
Embeddings can be trained specifically to cause questions and content including their answers to have similar representations in latent space. This has been used this to create QA retrieval systems. Here's one commonly used example:
In your first paragraph, you are describing Hypothetical Document Embeddings (HyDE) [0]. I've tested it out, and in certain cases, it works amazingly well to get more complete answers.
> The way one might phrase a question might be very different content wise from how the document describes the answer.
If the embeddings are worth their salt, then they should not be influenced by paraphrasing with different words. Try the OpenAI embeddings or sbert.net embedding models.
Also would you just return a list of likely candidates and loop over the result set to see if any info is relevant to the question and then have the the final pass try to answer the question.
Another idea I've had is to "overfit" a generative model like GPT on a dataset but pay more attention to how url and the like are tokenised