Hacker News new | past | comments | ask | show | jobs | submit | fzliu's comments login

I work at Voyage -- thanks for the kind words!

To add a bit for readers who may be unaware, https://www.continue.dev/ uses embeddings as a strategy for indexing codebases. When used together with keyword search, I find that it performs quite well, especially when used together with the reranking functionality.


Hello Frank, thanks a lot for the awesome work. I like continue a lot and they have an amazing team. I keep checking their repo, once in a while to see when they will release their ai agent.


I don't understand why the editors allowed the engineers' names to be made public. What did they hope to gain by doing this other than making them magnets for harassment and possibly threats?


The identities of the engineers are now unambigiously in the public interest as they now have an impact on the government. These aren't scrappy hackers trolling on internet forums.


They're public servants. Most public servants have their name, position, level, salary, etc listed in public datasets. There is no "doxxing" of public servants. This isn't a usenet.


In normal circumstances, e.g. as contractors I would agree. However, they are not, as far as I can tell they are operating as federal employees. No federal employee name is private.

I do see this (article) as mostly retaliation though. Maybe deserved or not. Hard to tell at this point. But these people are public employees nevertheless, they never had the option for anonymity.


The American people deserve to know what is happening in their government.



Isn't it free speech? Yes, that is a comment in style against the guidelines of this forum, but you've got to admire the irony.


Although voyage-3-m-exp is at the top of the leaderboard, I would not use it for production use cases unless your data is extremely similar to "in-domain" MTEB data. Check out this post from Nils (one of the authors of the original MTEB paper) for more info: https://x.com/Nils_Reimers/status/1870812625505849849

voyage-3-large will work better for almost all real-world production use cases.


Which model would you choose?


For float and int8, 1024 does indeed outperform 2048. However, for binary, 2048 outperforms 1024.

For our general-purpose embedding model (voyage-3-large), embedding vectors with 2048 dimensions outperform 1024 across the board: https://blog.voyageai.com/2025/01/07/voyage-3-large/


Do you have any insights into this? Perhaps scaling degradation with Matryoshka learning?


We have a repurposed version of SWE-bench Lite for retrieval; you can see the numbers in the last column of this spreadsheet: https://docs.google.com/spreadsheets/d/1Q5GDXOXueHuBT9demPrL...


> As with all LLM models and their subproducts, the only way to ensure good results is to test yourself, ideally with less subjective, real-world feedback metrics.

This is excellent advice. Sadly, very few people/organizations implement their own evaluation suites.

It doesn't make much sense to put data infrastructure in production without first evaluating its performance (IOPS, uptime, scalability, etc.) on internal workloads; it is no different for embedding models or models in general for that matter.


While the raw similarity score does matter, what typically matters more is the score relative to other documents. In the case of the examples in the notebook, those values were the highest in relative terms.

I can see why this may be unclear/confusing -- we will correct it. Thank you for the feedback!


Because LLMs such as Gemini -- and other causal language models more broadly -- are trained on next token prediction, the vectors that you get from pooling the output token embeddings aren't that useful for RAG or semantic search compared to what you get from actual embedding models.

One distinction to make here is that token embeddings and the embeddings/vectors that are output from embedding models are related but separate concepts. There are numerous token embeddings (one per token) which become contextualized as they propagate through the transformer, while there is a single vector/embedding that is output by embedding models (one per input data, such as long text, photo, or document screenshot).



Glad you liked the results! We do have multilingual models (and rerankers) -- voyage-3, in particular, is multilingual: https://blog.voyageai.com/2024/09/18/voyage-3/

voyage-multimodal-3 is multilingual as well, supporting the same set of languages as voyage-3.


Sorry for spreading false information. I edited the post above.

It is interesting that you’re not as up front about multilingualism compared to cohere. They seem to mention it a lot, which led to my confusion.


No worries at all. That's great feedback and an area of improvement for us when it comes to future posts -- we'll be more explicit about multilingualism in blogs and in our docs.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: