More

fzliu · 2025-02-03T22:25:46 1738621546

I work at Voyage -- thanks for the kind words!

To add a bit for readers who may be unaware, https://www.continue.dev/ uses embeddings as a strategy for indexing codebases. When used together with keyword search, I find that it performs quite well, especially when used together with the reranking functionality.

thawab · 2025-02-04T15:25:20 1738682720

Hello Frank, thanks a lot for the awesome work. I like continue a lot and they have an amazing team. I keep checking their repo, once in a while to see when they will release their ai agent.

fzliu · 2025-02-03T19:41:27 1738611687

I don't understand why the editors allowed the engineers' names to be made public. What did they hope to gain by doing this other than making them magnets for harassment and possibly threats?

minimaxir · 2025-02-03T19:45:04 1738611904

The identities of the engineers are now unambigiously in the public interest as they now have an impact on the government. These aren't scrappy hackers trolling on internet forums.

rozap · 2025-02-03T20:19:50 1738613990

They're public servants. Most public servants have their name, position, level, salary, etc listed in public datasets. There is no "doxxing" of public servants. This isn't a usenet.

latency-guy2 · 2025-02-04T07:34:54 1738654494

In normal circumstances, e.g. as contractors I would agree. However, they are not, as far as I can tell they are operating as federal employees. No federal employee name is private.

I do see this (article) as mostly retaliation though. Maybe deserved or not. Hard to tell at this point. But these people are public employees nevertheless, they never had the option for anonymity.

fullshark · 2025-02-03T20:38:06 1738615086

The American people deserve to know what is happening in their government.

ceejayoz · 2025-02-03T20:32:39 1738614759

Now do @libsoftiktok.

Or Musk himself: https://x.com/elonmusk/status/1858916546338590740 https://x.com/elonmusk/status/1858914228624924963

tgv · 2025-02-03T21:28:04 1738618084

Isn't it free speech? Yes, that is a comment in style against the guidelines of this forum, but you've got to admire the irony.

fzliu · 2025-01-14T02:18:17 1736821097

Although voyage-3-m-exp is at the top of the leaderboard, I would not use it for production use cases unless your data is extremely similar to "in-domain" MTEB data. Check out this post from Nils (one of the authors of the original MTEB paper) for more info: https://x.com/Nils_Reimers/status/1870812625505849849

voyage-3-large will work better for almost all real-world production use cases.

moojacob · 2025-01-15T20:14:21 1736972061

Which model would you choose?

fzliu · 2025-01-14T02:15:52 1736820952

For float and int8, 1024 does indeed outperform 2048. However, for binary, 2048 outperforms 1024.

For our general-purpose embedding model (voyage-3-large), embedding vectors with 2048 dimensions outperform 1024 across the board: https://blog.voyageai.com/2025/01/07/voyage-3-large/

qeternity · 2025-01-14T16:24:13 1736871853

Do you have any insights into this? Perhaps scaling degradation with Matryoshka learning?

fzliu · 2025-01-14T02:13:29 1736820809

We have a repurposed version of SWE-bench Lite for retrieval; you can see the numbers in the last column of this spreadsheet: https://docs.google.com/spreadsheets/d/1Q5GDXOXueHuBT9demPrL...

fzliu · 2024-12-24T20:21:26 1735071686

> As with all LLM models and their subproducts, the only way to ensure good results is to test yourself, ideally with less subjective, real-world feedback metrics.

This is excellent advice. Sadly, very few people/organizations implement their own evaluation suites.

It doesn't make much sense to put data infrastructure in production without first evaluating its performance (IOPS, uptime, scalability, etc.) on internal workloads; it is no different for embedding models or models in general for that matter.

fzliu · 2024-11-17T17:15:06 1731863706

While the raw similarity score does matter, what typically matters more is the score relative to other documents. In the case of the examples in the notebook, those values were the highest in relative terms.

I can see why this may be unclear/confusing -- we will correct it. Thank you for the feedback!

fzliu · 2024-11-17T17:09:34 1731863374

Because LLMs such as Gemini -- and other causal language models more broadly -- are trained on next token prediction, the vectors that you get from pooling the output token embeddings aren't that useful for RAG or semantic search compared to what you get from actual embedding models.

One distinction to make here is that token embeddings and the embeddings/vectors that are output from embedding models are related but separate concepts. There are numerous token embeddings (one per token) which become contextualized as they propagate through the transformer, while there is a single vector/embedding that is output by embedding models (one per input data, such as long text, photo, or document screenshot).

fzliu · 2024-11-17T16:50:40 1731862240

I understand the sentiment. We are starting to open source some tools, mostly around embedding model evaluation (i.e. https://github.com/voyage-ai/voyage-evaluation-public), with other stuff coming up.

FWIW, there are other deployment options besides the API as well: AWS (https://docs.voyageai.com/docs/aws-marketplace-model-package), Azure (https://docs.voyageai.com/docs/azure-marketplace-managed-app...), Snowflake (https://docs.voyageai.com/docs/snowflake), and vector database integrations (https://docs.voyageai.com/docs/integrations-and-other-librar..., https://milvus.io/docs/integrate_with_voyageai.md, https://docs.pinecone.io/integrations/voyage, https://weaviate.io/developers/weaviate/model-providers/voya..., https://qdrant.tech/documentation/embeddings/voyage/, etc).

fzliu · 2024-11-17T10:23:54 1731839034

Glad you liked the results! We do have multilingual models (and rerankers) -- voyage-3, in particular, is multilingual: https://blog.voyageai.com/2024/09/18/voyage-3/

voyage-multimodal-3 is multilingual as well, supporting the same set of languages as voyage-3.

stephantul · 2024-11-17T10:26:32 1731839192

Sorry for spreading false information. I edited the post above.

It is interesting that you’re not as up front about multilingualism compared to cohere. They seem to mention it a lot, which led to my confusion.

fzliu · 2024-11-17T10:30:03 1731839403

No worries at all. That's great feedback and an area of improvement for us when it comes to future posts -- we'll be more explicit about multilingualism in blogs and in our docs.