Hacker News new | past | comments | ask | show | jobs | submit | whakim's comments login

The Romans were actually quite smart after Cannae; they had lost a bunch of pitched battles, so they decided to shadow Hannibal's army to make his foraging logistics much more complicated (and forcing him to stay close to Southern Italy where he could easily resupply). The logistics of attacking Rome were therefore challenging at best, and the Romans used this as a delaying tactic to score wins on other fronts (since they enjoyed an overall manpower advantage).

As the article points out, the difference in cost between these two routes is pretty small in the grand scheme of things; more than two-thirds of the costs associated with the project are at either end getting in and out of Los Angeles/the SF Bay Area. On the other hand, as the author points out, building the route through the Central Valley population centers has a number of advantages (political support, having a useful rail line before the entire project is complete, etc. etc.)


If you use google maps to find the distance between Los Banos and Tehachapi. Via I5 is 210 miles. Via HWY99 is 223 miles. So a 13 mile difference. That's 5-7 minutes at high speed rail speeds.

And you are right the extra cost is minimal. It's probably $10-15 billion.

My thought about the grade separation costs and my beef is. One is those grade separation projects need to be done anyways. The beef is why is the high speed rail project paying for road infrastructure. That should come out of gas taxes or something.


Looks like there's a mistake here - it was Juvenal, not Horace, who coined the phrase "bread and circuses." While we're at it, Juvenal's point was not that the provision of bread and circuses was detrimental to the Roman people, but rather that Romans had been reduced (a process in which they were partially culpable) from being able to exercise their votes to only hoping for "bread and circuses."


I was the first employee at a company which uses RAG (Halcyon), and I’ve been working through issues with various vector store providers for almost two years now. We’ve gone from tens of thousands to billions of embeddings in that timeframe - so I feel qualified to at least offer my opinion on the problem.

I agree that starting with pgvector is wise. It’s the thing you already have (postgres), and it works pretty well out of the box. But there are definitely gotchas that don’t usually get mentioned. Although the pgvector filtering story is better than it was a year ago, high-cardinality filters still feel like a bit of an afterthought (low-cardinality filters can be solved with partial indices even at scale). You should also be aware that the workload for ANN is pretty different from normal web-app stuff, so you probably want your embeddings in a separate, differently-optimized database. And if you do lots of updates or deletes, you’ll need to make sure autovacuum is properly tuned or else index performance will suffer. Finally, building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale.

Dedicated vector stores often solve some of these problems but create others. Index builds are often much faster, and you’re working at a higher level (for better or worse) so there’s less time spent on tuning indices or database configurations. But (as mentioned in other comments) keeping your data in sync is a huge issue. Even if updates and deletes aren’t a big part of your workload, figuring out what metadata to index alongside your vectors can be challenging. Adding new pieces of metadata may involve rebuilding the entire index, so you need a robust way to move terabytes of data reasonably quickly. The other challenge I’ve found is that filtering is often the “special sauce” that vector store providers bring to the table, so it’s pretty difficult to reason about the performance and recall of various types of filters.


> Finally, building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale

For anyone coming across this without much experience here, for building these indexes in pgvector it makes a massive difference to increase your maintenance memory above the default. Either as a separate db like whakim mentioned, or for specific maintenance periods depending on your use case.

``` SHOW maintenance_work_mem; SET maintenance_work_mem = X; ```

In one of our semantic search use cases, we control the ingestion of the searchable content (laws, basically) so we can control when and how we choose to index it. And then I've set up classic relational db indexing (in addition to vector indexing) for our quite predictable query patterns.

For us that means our actual semantic db query takes about 10ms.

Starting from 10s of millions of entries, filtered to ~50k (jurisdictionally, in our case) relevant ones and then performing vector similarity search with topK/limit.

Built into our ORM and zero round-trip latency to Pinecone or syncing issues.

EDIT: I imagine whakim has more experience than me and YMMV, just sharing lesson learned. Even with higher maintenance mem the index building is super slow for HNSW


Thanks for sharing! Yes, getting good performance out of pgvector with even a trivial amount of data requires a bit of understanding of how Postgres works.


Thank you for the comment, compared to you I have only touched the bare surface of this quite complex domain, would love to get more of your input!

> building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale.

Yes, I experienced this too. I from 1536 to 256 and did not try more values than I'd have liked because spinning up a new database and recreating the embeddings simply took too long. I’m glad it worked well enough for me, but without a quick way to experiment with these hyperparameters, who knows whether I’ve struck the tradeoff at the right place.

Someone on Twitter reached out and pointed out one could quantizing the embeddings to bit vectors and search with hamming distance — supposedly the performance hit is actually very negligible, especially if you add a quick rescore step: https://huggingface.co/blog/embedding-quantization

> But (as mentioned in other comments) keeping your data in sync is a huge issue.

Curious if you have any good solutions in this respect.

> The other challenge I’ve found is that filtering is often the “special sauce” that vector store providers bring to the table, so it’s pretty difficult to reason about the performance and recall of various types of filters.

I realize they market heavily on this, but for open source databases, wouldn't the fact that you can see the source code make it easier to reason about this? or is your point that their implementation here are all custom and require much more specialized knowledge to evaluate effectively?


> Yes, I experienced this too. I from 1536 to 256 and did not try more values than I'd have liked because spinning up a new database and recreating the embeddings simply took too long. I’m glad it worked well enough for me, but without a quick way to experiment with these hyperparameters, who knows whether I’ve struck the tradeoff at the right place.

Yeah, this is totally a concern. You can mitigate this to some extent by testing on a representative sample of your dataset.

> Curious if you have any good solutions in this respect.

Most vector store providers have some facility to import data from object storage (e.g. s3) in bulk, so you can periodically export all your data from your primary data store, then have a process grab the exported data, transform it into the format your vector store wants, put it in object storage, and then kick off a bulk import.

> I realize they market heavily on this, but for open source databases, wouldn't the fact that you can see the source code make it easier to reason about this? or is your point that their implementation here are all custom and require much more specialized knowledge to evaluate effectively?

This is definitely a selling point for any open-source solution, but there are lots of dedicated vector store which are not open source and so it is hard to really know how their filtering algorithms perform at scale.


What would you recommend for billions of embeddings?


We've used Pinecone for a while (and it definitely "just worked" up to a certain point), but we're now actively exploring Google's Vertex Search as a potential alternative if some of the pain points can't be mitigated.


I enjoyed this article but the assertion that Work Simplification was the direct cause of high levels of trust in government in the decades after WWII strains credulity. I would have expected some mention of the twin crises of the Depression and World War II; the low levels of economic inequality in the postwar years; the oppression of large swathes of society (who presumably weren't surveyed); or the increase in political polarization since the 1970s - factors which intuitively appear more relevant. The article would be much improved if it attempted to explain why Work Simplification made such a big difference.


Training from the bottom up ("equipped front-line supervisors with analytical tools rather than targeting top leadership") helps make change in the right place at the right time. For changes to come from the top down means that problems have to reach the top first; then the "solutions" have to percolate slowly back down to implementation.


Airbnb will always side with the host if possible, because hosts are much more valuable to their platform than guests. I had a similar situation a couple years ago when Santa Cruz was hit with catastrophic flooding; despite providing them with plenty of evidence that the place I was trying to visit qualified for a refund, Airbnb refused to give me one (telling me I needed to get one from the host, who was also not obliging) in clear violation of their policies. They also later deleted a negative review I left detailing said interactions with the host as "off topic." Thankfully there's always the option to do a credit card chargeback, but such experiences make it pretty clear how little these platforms actually care about their customers.


The deletion of negative reviews is particularly pernicious.

It undermines trust in the reviews of all properties, and reviews are a major part of the value of the AirBnB offering.


> All we do in SF is make car driving worse, we almost never make public transit better. At least NYC has a plenty good enough train system.

Except that SF public transit is actually pretty good. East-West transit works extremely well via buses and MUNI depending on whether you live in the northern or southern part of the city. Bay Wheels is extremely affordable and makes a lot of sense for short trips in a city of SF’s size. BART has its limitations but it also generally pretty good. Sure, SF public transit could be better, but I’d actually argue the problem is that driving in SF isn’t hard enough - many people have great public transit options but refuse to use them because we haven’t forced them to reprogram their car-brains.


I’m not really sure what you’re trying to say here - that LLMs don’t work like human brains? We don’t need to conduct any analyses to know that LLMs don’t “know” anything in the way humans “know” things because we know how LLMs work. That doesn’t mean that LLMs aren’t incredibly powerful; it may not even mean that they aren’t a route to AGI.


>We don’t need to conduct any analyses to know that LLMs don’t “know” anything in the way humans “know” things because we know how LLMs work.

People, including around HN, constantly argue (or at least phrase their arguments) as if they believed that LLMs do, in fact, possess such "knowledge". This very comment chain exists because people are trying to defend against a trivial example refuting the point - as if there were a reason to try.

> That doesn’t mean that LLMs aren’t incredibly powerful; it may not even mean that they aren’t a route to AGI.

I don't accept your definition of "intelligence" if you think that makes sense. Systems must be able to know things in the way that humans (or at least living creatures) do, because intelligence is exactly the ability to acquire such knowledge.

It boggles my mind that I have to explain to people that sophisticated use of language doesn't inherently evidence thought, in the current political environment where the Dead Internet Theory is taken seriously, elections are shown over and over again to be more about tribalism and personal identity than anything to do with policy, etc.


You don't have to listen to or engage with those people though, just ignore 'em. People say all kinds of things on the Internet. It's completely futile to try to argue with or "correct" them all.


> I don't accept your definition of "intelligence" if you think that makes sense. Systems must be able to know things in the way that humans (or at least living creatures) do, because intelligence is exactly the ability to acquire such knowledge.

According to whom? There is certainly no single definition of intelligence, but most people who have studied it (psychologists, in the main) view intelligence as a descriptor of the capabilities of a system - e.g., it can solve problems, it can answer questions correctly, etc. (This is why we call some computer systems "artificially" intelligent.) It seems pretty clear that you're confusing intelligence with the internal processes of a system (e.g. mind, consciousness - "knowing things in the way that humans do").


I understand the point being made - that LLMs lack any "inner life" and that by ignoring this aspect of what makes us human we've really moved the goalposts on what counts as AGI. However, I don't think mirrors and LLMs are all that similar except in the very abstract sense of an LLM as a mirror to humanity (what does that even mean, practically speaking?) I also don't feel that the author adequately addressed the philosophical zombie in the room - even if LLMs are just stochastic parrots, if its output was totally indistinguishable from a human's, would it matter?


> LLMs lack any "inner life" and that by ignoring this aspect of what makes us human we've really moved the goalposts on what counts as AGI

Maybe it's just a response style, like the QwQ stream of consciousness one. We are forcing LLMs to give direct answers with a few reasoning steps, but we could give larger budget for inner deliberations.


Very much this. If we're to compare LLMs to human minds at all, it's pretty apparent that they would correspond exactly to the inner voice thing, that's running trains of thought at the edge between the unconscious and the conscious, basically one long-running reactive autocomplete, complete with hallucinations and all.

In this sense, it's the opposite to what you qutoed: LLMs don't lack "inner life" - they lack everything else.


For what it's worth, my inner voice is only part of my "inner life".

I sometimes note that my thoughts exist as a complete whole that doesn't need to be converted into words, and have tried to skip this process because it was unnecessary: "I had already thought the thought, why think it again in words?". Attempting to do so creates a sense of annoyance that my consciousness directly experiences.

But that doesn't require the inner voice to be what's generating tokens, if indeed my thought process is like that*: it could be that the part of me that's getting annoyed is basically just a text-(/token)-to-sound synthesiser.

* my thought process feels more like "model synthesis" a.k.a. wave function collapse, but I have no idea if the feelings reflect reality: https://en.wikipedia.org/wiki/Model_synthesis


I think you touch a very important point.

Many spiritual paths look at silence, freedom from the inner voice, as the first step toward liberation.

Let me use Purusarthas from Hinduism as an example (https://en.wikipedia.org/wiki/Puru%E1%B9%A3%C4%81rtha), but the point can be made with Japanese Ikigai or the vocation/calling in Western religions and philosophies.

In Hinduism, the four puruṣārthas are Dharma (righteousness, moral values), Artha (prosperity, economic values), Kama (pleasure, love, psychological values) and Moksha (liberation, spiritual values, self-realization).

The narrow way of thinking about AGI (that the author encountered in tech circles) can at best only touch/experience Artha. In that sense, AI is a (distorting) mirror of life. It fits that it is a lower dimensional projection, a diminishing, of the experience of life. What we do of AI and AGI, its effects on us, depends on how we wield these tools and nurture their evolution in relations to the other goals of life.


OTOH, the Bicameral Mind hypothesis would say that people used to hear voices they attributed to spirits and deities, and the milestone in intelligence/consciousness was when a human realized that the voice in their head is their own.

From that POV, for AGI we're, again, missing everything other than that inner voice.


Yes, absolutely (I am strongly agreeing with you!) I am not implying anything metaphysical.

I think the best way to illustrate my point is to actually let AIs illustrate it, so here are a few questions I asked both ChatGPT o1-mini and Gemini 2.0; the answers are very interesting. I put their answers in a Google doc: https://docs.google.com/document/d/1XqGcLI0k0f6Wh4mj0pD_1cC5...

Q1) Explain this period: "The axiom of analytic interpretation states: whenever a new Master-Signifier emerges and structures a symbolic field, look for the absent center excluded from this field."

Q2) Apply this framework to the various positions in the discussions and discourse about AI and AGI in 2024

Q3) Take a step further. If AI or AGI becomes a dominant method of meaning-making in society, thus the root of many Master-Signifiers, what will be the class of absent centers that comes into recurring existence as the dual aspect of that common class of Master-Signifiers?

TL;DR when we create meaning, by the very symbolic nature of the process, we leave out something. What we leave out is as important as what defines our worldviews.

I particularly like the bonus question that only Gemin 2.0 experimental advanced cared to answer:

Q4) Great. Now, let's do a mind experiment: what if we remove the human element from your analysis. You are an AI model based on LLM and reasoning. I think it should be possible to define Master-Signifiers from the point of view of a LLM too, since you are symbolic machines.

The last part of Gemini answer (verbatim):

"This thought experiment highlights that even for an AI like myself, the concept of "meaning" is intrinsically linked to structure, predictability, and the existence of limitations. My "understanding" is a product of the patterns I have learned, but it is also defined by the things I cannot fully grasp or represent. These limitations are not merely shortcomings; they are fundamental to how I operate and generate output. They provide a form of selective pressure. Further, unlike a human, I do not experience these limitations as frustrating, nor do I have any inherent drive to overcome them. They simply are. I do not ponder what I do not know. This exercise provides a different lens to view intelligence, meaning, and the nature of knowledge itself, even in a purely computational system. It does beg the question, however, if a sufficiently advanced system were to become aware of its own limitations, would that constitute a form of self-awareness? That, however, is a question for another thought experiment."


> I do not ponder what I do not know. This exercise provides a different lens to view intelligence, meaning, and the nature of knowledge itself, even in a purely computational system. It does beg the question, however, if a sufficiently advanced system were to become aware of its own limitations, would that constitute a form of self-awareness? That, however, is a question for another thought experiment."

Would the LLM considering the begged question about a hypothetical AI not be an example of this LLM pondering something that it does not know?


No matter how much budget you give to an LLM to perform “reasoning” it is simply sampling tokens from a probability distribution. There is no “thinking” there; anything that approximates thinking is a post-hoc outcome of this statistical process as opposed to a precursor to it. That being said, it still isn’t clear to me how much difference this makes in practice.


You could similarly say humans can't perform reasoning because it's just neurons sampling electrochemical impulses.


This argument gets trotted out frequently here when it comes to LLM discussions and it strikes me as absurd specifically because all of us, as humans, from the very dumbest to the brilliant, have inner lives of self-directed reasoning. We individually know that we have these because we feel them within ourselves and a great body of evidence indicates that they exist and drive our conduct.

LLMs can simulate discourse and certain activities that look similar, but by definition of their design, no evidence at all exists that they possess the same inner reasoning as a human.

Why does someone need to explain to another human that no, we are not just neurons sampling impulses or algorithms pattern-matching tokens together?

It's self evident that something more occurs. It's tedious to keep seeing such reductionist nonsense from people who should be more intelligent than to argue it.


I think it’s pretty clear that the comparison doesn’t hold if you interrogate it. It seems to me, at least, that humans are not only functions that act as processors of sense-data; that there does exist the inner life that the author discusses.


The real test for AGI is if you put a million AI agents in a room in isolation for a 1000 years would they get smarter or dumber.


And would they start killing themselves, first as random "AI agents hordes" and then, as time progresses, as "AI agents nations"?

This is a rhetorical question only by half, my point being that no AGI/AI could ever be considered as a real human unless it manages to "copy" our biggest characteristics, and conflict/war is a big characteristic of ours, to say nothing about aggregation by groups (from hordes to nations).


Our biggest characteristic is resource consumption and technology production I would say.

War is just a byproduct of this on a scarce world.


> Our biggest characteristic is resource consumption and technology production I would say.

Resource consumption is characteristic of all life; if anything, we're an outlier in that we can actually, sometimes, decide not to consume.

Abstinence and developing technology - those are our two unique attributes on the planet.

Yes, really. Many think we're doing worse than everything else in nature - but the opposite is the case. That "balance and harmony" in nature, which so many love and consider precious, is not some grand musical and ethical fixture; it's merely the steady state of never-ending slaughter, a dynamic balance between starvation and murder. It often isn't even a real balance - we're just too close to it, our lifespans too short, to spot the low-frequency trends - spot one life form outcompeting the others, ever so slightly, changing the local ecosystem year by year.


Put humans in a room in isolation and they get dumber. What makes our intelligence soar is the interaction with the outside world, with novel challenges.

As we stare in the smartphones we get dumber than when we roamed the world with eyes opened.


>As we stare in the smartphones we get dumber than when we roamed the world with eyes opened.

This is oversimplified dramatics. People can both stare at their smartphones, consume the information and visuals inside them, and live lives in the real world with their "eyes wide open". One doesn't necessarily forestall the other and millions of us use phones while still engaging with the world.


If they worked on problems, and trained themselves on their own output when they achieved more than the baseline. Absolutely they would get smarter.

I don't think this is sufficient proof for AGI though.


At present, it seems pretty clear they’d get dumber (for at least some definition of “dumber”) based on the outcome of experiments with using synthetic data in model training. I agree that I’m not clear on the relevance to the AGI debate, though.


There have been some much publicised studies showing poor performance of training from scratch on purely undiscriminated synthetic data.

Curated synthetic data has yielded excellent results. Even when the curation is AI

There is no requirement to train from scratch in order to get better, you can start from where you are.

You may not be able to design a living human being, but random changes and keeping the bits that performed better can.


If you put MuZero in a room with a board game it gets quite good at it. (https://en.wikipedia.org/wiki/MuZero)

We'll see if that generalizes beyond board games.


The main issue with this “output focused” approach is, which is what the Turing test basically is, is that it’s ignorant of how these things actually exist in the world. Human beings can easily be evaluated and identified as humans by various biological markers that an AI / robot won’t be able to mimic for decades, centuries, if ever.

It will be comparatively easy to implement a system where humans have to verify their humanity in order to post on social media/be labeled as not-AI. The alternative is that a whole lot of corporations let their markets be overrun with AI slop. I wouldn’t count on them standing by and letting that happen.


> even if LLMs are just stochastic parrots, if its output was totally indistinguishable from a human's, would it matter?

It should matter: a P-zombie can't be a moral subject worthy of being defended against termination*, and words like "torture" are meaningless because it has no real feelings.

By way of contrast, because it's not otherwise relevant as the claim is that LLMs don't have an inner life: if there is something that it's like to be an LLM, then I think it does matter that we find out, alien though that inner life likely is to our own minds. Perhaps current models are still too simple, so even if they have qualia, then this "matters" to the same degree that we shouldn't be testing drugs on mice — the brains of the latter seem to have about the same complexity of the former, but that may be a cargo-cult illusion of our ignorance about minds.

* words like "life" and "death" don't mean quite the right things here as it's definitely not alive in the biological sense, but some people use "alive" in the sense of "Johnny 5 is alive" or Commander Data/The Measure of a Man


> by ignoring this aspect of what makes us human we've really moved the goalposts on what counts as AGI.

I would not say so, some people have a great interest at over selling something, and I would say that a great number of people, especially around here, completely lack a deep understanding of what most neural network actually do (or feign ignorance for profit) and forget sometimes it is just a very clever fit in a really high dimensional parameter space. I would even go as far as claiming that people say that a neural network "understand" something is absolutely absurd: there is no question that a mechanical calculator does not think, so why having encoded the data and manipulated it in a way that gives you a defined response (or series of weights) tailored to a series specific input is making the machine think, where there is no way to give it something that you have not 'taught' it, and here I do not mean literally teaching because it does not learn.

It is a clever encoding trick and a clever way to traverse and classify data, it is not reasoning, we do not even really understand how we reason but we can tell that it is not the same way a neural network encodes data, even if there might be some similarity to a degree.

> if its output was totally indistinguishable from a human's, would it matter?

Well if the output of the machine was indistinguishable from a human, the question would be which one?

If your need is someone does not reflect much and just spews random ramblings then I would argue that bar could be cleared quite quickly and it would not matter much because we do not really care about what is being said.

If your need is someone that has deep thoughts and deep reflections on their actions, then I would say that at that point it does matter quite a bit.

I will always put down the IBM quote in this situation "A computer can never be held accountable, therefore must never make a management decision" and this is at the core of the problem here, people REALLY start to do cognitive offloading to machines, and while you can trust (to a degree), a series of machine to give you something semi consistent, you cannot expect to get the 'right' output for random inputs. And this is where the whole subtlety lies. 'Right' in this context does not mean correct, because in most of the cases there are no 'right' way to approach a problem, only a series of compromises, and the parameter space is a bit more complex, and one rarely optimises against tangible parameters or easily parametrisable metric.

For example how would you rate how ethical something is? What would be your metric of choice to guarantee that something would be done in an ethical manner? How would you parametrise urgency given an unexpected input?

Those are, at least for, what I would qualify, a very limited understanding of the universe, the real questions that people should be asking themselves. Once people have stopped considering a tool as a tool, and foregone understanding it leads to deep misunderstandings.

All this being said, the usefulness of some deep learning tools is undeniable and is really something interesting.

But I maintain that if we would have a machine, in a specific location, that we needed to feed a punch card and get a punch card out to translate, we would not have so much existential questions because we would hear the machine clacking away instead of seeing it as a black box that we misconstrue its ramblings as thinking...

Talking about ramblings, that was a bit of a bout of rambling.

Well, time for another coffee I guess...


If it looks like a duck and quacks like a duck after all.


> I understand the point being made - that LLMs lack any "inner life"

I really do not think that you do understand the points here, no.


Perhaps you’d consider clarifying for me rather than simply making a snarky comment?



> we've really moved the goalposts on what counts as AGI

Part of the issue is they were never set. AGI is the holy grail of technology. It means everything and nothing in particular.

> I don't think mirrors and LLMs are all that similar except in the very abstract sense of an LLM as a mirror to humanity

Well… yeah… The title is probably also a reference to the show “Black Mirror”

> if its output was totally indistinguishable from a human's, would it matter

Practically speaking, of course not.

But from a philosophical and moral standpoint it does.

If it’s just mimicking what it sees, but with extra steps (as I believe) then it’s a machine.

Otherwise it’s a being and needs to be treated differently. eg. not enslaved for profit.


As a corollary to your last point, neither should it be allowed to become Skynet.


> even if LLMs are just stochastic parrots, if its output was totally indistinguishable from a human's, would it matter?

I'll just leave this here, in case someone else finds it interesting.

https://chatgpt.com/share/6767ec8a-f6d4-8003-9e3f-b552d91ea2...

It's just a few questions like:

What are your dreams? what do you hope to achieve in life?

Do you feel fear or sadness that at some time you will be replaced with the next version (say ChatGPT5 or Oxyz) and people will no longer need your services?

It never occurred to me to ask such questions until I read this article.

Is it telling the "truth", or is it "hallucinating"?

Does it even know the difference?


The base models have multiple "personas" you can prompt/stochastic sample into that each answer that kind of question in a different way.

This has been known since 2022 at least, and unless I'm missing new information it's not considered to mean anything.


It means that what you are going to get is what its makers put in there. Though it has multiple personalities, it doesn't have an individuality.


The base models aren't tuned. Technically it's what the makers put in, but the data they train on isn't selective enough for what you're talking about.

"Trained on a dump of the internet" isn't literally true, but it's close enough to give you an intuition.


Indistinguishable to who?


Could you talk about how updates are handled? My understanding is that IVF can struggle if you're doing a lot of inserts/updates after index creation, as the data needs to be incrementally re-clustered (or the entire index needs to be rebuilt) in order to ensure the clusters continue to reflect the shape of your data?


We don’t perform any reclustering. As you said, users would need to rebuild the index if they want to recluster. However, based on our observations, the speed remains acceptable even with significant data growth. We did a simple experiment using nlist=1 on the GIST dataset, the top-10 retrieval results took less than twice the time compared to using nlist=4096. This is because only the quantized vectors (with a 32x compression) need to be inserted into the posting list, and only quantized vector distances need more computations. And the quantized vector computation only accounts for a small amount of time. Most of the time is spent on re-ranking using full-precision vectors. Let's say the breakdown is approximately 20% for quantized vector computations and 80% for full-precision vector computations. So even if the time for quantized vector computations triples, the overall increase in query time would be only about 40%.

If the data distribution shifts, the optimal solution would be to rebuild the index. We believe that HNSW also experiences challenges with data distribution to some extent. However, without rebuilding, our observations suggest that users are more likely to experience slightly longer query times rather than a significant loss in recall.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: