Hacker News new | past | comments | ask | show | jobs | submit login

I've been playing around with sentence embeddings to search documents, but I wonder how useful they are as a natural language interface for a database. The way one might phrase a question might be very different content wise from how the document describes the answer. Maybe it might be possible to do some type of transform where the question is transformed into a possible answer and then turned into a embedding but I haven't found much info on that yet.

Another idea I've had is to "overfit" a generative model like GPT on a dataset but pay more attention to how url and the like are tokenised




> Maybe it might be possible to do some type of transform where the question is transformed into a possible answer and then turned into a embedding but I haven't found much info on that yet.

Here you go https://twitter.com/theseamouse/status/1614453236349693953


"The way one might phrase a question might be very different content wise from how the document describes the answer."

You have late-interaction models, which replace the dot product with a few transformer layers and are able to learn complex semantics.

Of course this would adversely affect latency and embedding size, so you might want to compress and cache the answers, hence (shameless plug):

https://aclanthology.org/2022.acl-long.457/


Embeddings can be trained specifically to cause questions and content including their answers to have similar representations in latent space. This has been used this to create QA retrieval systems. Here's one commonly used example:

https://huggingface.co/sentence-transformers/multi-qa-MiniLM...


In your first paragraph, you are describing Hypothetical Document Embeddings (HyDE) [0]. I've tested it out, and in certain cases, it works amazingly well to get more complete answers.

[0] https://python.langchain.com/en/latest/modules/chains/index_...


> might phrase a question might be very different content wise from how the document describes the answer

That's what hypothetical embeddings solve: https://summarity.com/hyde

There are also encoding schemes for question-answer retrieval (e.g. ColBERT)


> The way one might phrase a question might be very different content wise from how the document describes the answer.

If the embeddings are worth their salt, then they should not be influenced by paraphrasing with different words. Try the OpenAI embeddings or sbert.net embedding models.


is there an example what what your talking about?

Also would you just return a list of likely candidates and loop over the result set to see if any info is relevant to the question and then have the the final pass try to answer the question.


How many embeddings can fit into a single input?


embeddings are really good at that, you dont need to use similar words at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: