Think the massive increase in demand is due to mass inference of open source LLMs? Or is the transformer architecture driving mass inference of other models too?
You might be thinking too far into this. The biggest customers are bulk buyers that are either training on a private cluster (eg. Meta or OpenAI), or selling their rack-space to other businesses. These are the people that are paying money and increasing the demand for GPU hardware; what happens to the businesses they provide for almost doesn't even matter as long as they pay for the compute. The "driver" for this demand is the hype. If people were laser-focused on the best-value solution, then everyone would pay for OpenAI's compute since it's cheaper than GPU hosting.
The real root of the problem is that GPU compute is not a competitive market. The demand is less for GPUs and more for Nvidia hardware, because nobody else is shipping CUDA or CUDA-equivalent software. Thus the demand is artificially raised beyond whatever is reasonable since buyers aren't shopping in a reactive market. Basically the same story as what happened to Nvidia's hardware during the crypto mining rush.
> then everyone would pay for OpenAI's compute since it's cheaper than GPU hosting.
This is absolutely not true. The gap is narrowing as providers (Google, Anthropic, Deepseek) introduce cross request KV caching but it’s definitely not true for OAI (yet).
“Own” is likely a very simplistic term for the complex agreement in place between SpaceX and NASA. I’m sure SpaceX retains all of the developed IP, but NASA can “do whatever they want” with the actual ship.
I’m a Googler who doesn’t do anything remotely close to setting capital return policy. Just remember Alphabet has been buying back 10s of billions for a while. Dividends are just a different capital return mechanism.
Could you please expand on this? If a person got issued RSUs but they have not vested yet, wouldn't they be happy that the RSUs are going to be worth more by the time they get vested?
In a vacuum where precise numbers do not exist, maybe.
In this real scenario, if someone's goog shares were vesting at an earlier value - let's take a rough average at a glance of goog YTD to be $145, they will have lost on a year's worth of dividends at $0.2 per share. However, the current share price is $175.
So, through this maneuver, a person holding N goog shares will lose at most 3 quarters of dividends: N * 0.2$ * 3 = N * 0.6$
But they will have gained whatever the stock has appreciated, which at this moment in time works out to: N * (175-145)$ = N * 30$
What am I missing which would make the scenario above result in OP's claim of "dividends devalue issued unvested RSUs"?
EDIT: This also fails to take into account "Dividend Equivalents (DEs)", which are not factored above, and would yield extra income to the person that owns unvested shares.
Ah, but part of the reason for CUDA's success is that the open source developer who wants to run unit tests or profile their kernel can pick up a $200 card. That PhD student with a $2000 budget can pick up a card. Academic lab with $20,000 for a beefy server, or tiny cluster? nvidia will take their money.
And that's all fixed capital expenditure - there's no risk a code bug or typo by an inexperienced student will lead to a huge bill.
Also, if you're looking for an alternative to CUDA because you dislike vendor lock-in, switching to something only available in GCP would be an absurd choice.
I'm really shocked at how dependent companies have become on the cloud offerings. Want a GPU? Those are expensive, lets just rent on Amazon and then complain about operational costs!
I've noticed this at companies. Yeah, the cloud is expensive, but you have a data center, and a few servers with RTX 3090s aren't expensive. A lot of research workloads can run on simple, cheap hardware.
Probably not many. However, 4090s would be a different situation. There are plenty of guides on running LLMs, stable diffusion, etc. on local hardware.
The H100s would be for businesses looking to get into this space.
Let’s compare. I’ve interviewed for and been offered both high paying tech jobs and investment banking jobs. My interviews for tech really required me to have a deep understanding of the technology and how to solve problems with it. They tested my critical thinking skills. For banking, sure there were some technical questions but a lot of it came down to what school I went to, how I acted, who I knew. I suppose either one could be a “nightmare” depending on who you are. I do think the tech interview process was much better at hiring generally competent people. There’s also consulting which I have not directly interviewed for but the interview style is more similar to tech with more of a focus on mental math, those can be very difficult if you don’t have that skill.
My wife was diagnosed in Nov '23. Stage 4. She had a mutation that allowed Tagrisso/osimertinib. There are many mutations, but this one is prevalent in Asian women who have never smoked.
There was a trial published just after her diagnosis that showed progression free survival at 2 years of 57% (chemo+osimertinib) vs 41% (osimertinib).
She's 6 days into the chemo. There are 4 rounds of pemetrexed+carboplatin and then maintenance carboplatin (forever). Every 21 days.
So far, it's been rough, but hopefully her body will adjust and once she's just on carboplatin hopefully her quality of life will be better.
She had few side effects from the osimertinib, but the main one was mouth sores. They're gone now, thankfully.
Another thing people need to know about this: this diagnosis often comes with regular thoracentesis (removal of fluid from the chest). In this case, it's in the lining of her lung, not lung itself.
Advice to anyone in this situation: make all future appointments for thoracentesis, because you don't want to go into the ER/ED to get it done. And, we've stopped allowing residents to practice on her (once it was quite painful).
EDIT:
Forgot to add: carboplatin crosses the blood-brain barrier, as do the cancer cells. The end state for this cancer, even with the above treatments, is usually tumors in the brain.
Hey, best of luck to you both. I've been there with close family members and cancer. It's no fun at all. Easily the hardest thing I've ever had to do in my life was just to care and watch.
Make sure to grab family and friends to help out. The want to help out, most of the time, at least. I know it's hard to ask, but make sure to do it anyways.
And yeah, standard generic advice to take care of yourself too. I hated that advice, as there just was never any time to do so. One thing that helped was to schedule that 'me' time in and make others aware of it. If you put it on the schedule, you'll have a better chance of taking it.
Two tips with MDs:
1) Carry and use a notebook in all MD interactions. Just a simple journal is all. Time, medications given, dosages, MD administering, etc. It's a good back up to have, sure, but you'll likely never use it. The real power is when MDs see you writing things down. They take you more seriously and they take themselves more seriously. I think they think that you'll sue now and have some proof. But who knows.
2) If you're in for a long term treatment (1+ nights) put a big bowl of candy outside the door. Nurses etc. will stop by more often to check up on things and generally seem to like you more. Consider putting cigarettes in there too, depending on if your nurses smoke. Then they will really like you and go above and beyond.
Sure, you can generally refuse any medical care in the US. With respect to residents, It can be as simple as asking and picking the right doctor for your procedure. Their credentials are generally public.
It feels so good to read a comment like this. Thank you for sharing and I’m really happy for you and your family (and modern medicine that makes it all possible).
I’m not completely read in to all the details but she told me prior to this drug her diagnoses was a “death sentence.” The lung cancer had metastasized throughout her bones. The indication was sudden lower body paralysis caused by a spinal tumor. She had spinal fusion surgery as well. I understand because it’s stage 4 she will never be in “remission” but it has regressed significantly.
My wife had stage 4 melanoma. Prior to the newish immunotherapies about 5 years ago, it was a death sentence — 6-9 months life expectancy from diagnosis. Now, it’s 60-70% 5 year survival rate. Unfortunately my wife wasn’t one of them.
In general, these types of cancers spew mets and spread quickly. Many are resistant to chemo, go to the brain (chemo cannot help there) and only respond to high focused radiation like SRS or proton beam.
Immunotherapies essentially suppress checkpoints that cancer cells use to avoid immune response and/or cause your body to target specific checkpoints. I can’t read FT.com, but I believe it’s talking about a targeted therapy that allows your body to control the cancer.
There’s alot of research happening around things like immunotherapy combined with custom versions of Moderna and other vaccines that will significantly improve treatment and save people going through what my wife went through. It’s a good time.
Sorry to hear about your wife. Yes, once those melanomas go deep enough in the skin, they spread everywhere like a plague.
I know there's even researchers creating AI-assisted treatment regimes that match specific mutations, other aspects of people's DNA, and all the immunotherapy drugs, in order to mix and match the best possible solution. Ongoing research and not yet widely available.
I look forward to learning more about the newer developments.
My dad was in a similar situation ~10 years ago. Stage 4 lung cancer, although a different underlying mutation (ALK). Prognosis was less than a year. But he got on some new-at-the-time inhibitor drugs (Crizotinib iirc) which eliminated ("no evidence") the cancer. What is left does slowly mutates around the drugs, and he goes in for regular monitoring and has had some radiation therapy to knock down specific growth spots, but he's still here with us.
I can't speak to the one in the article, but the drugs he's on aren't chemo, but they interrupt one of the pathways the cancer is using to infinitely grow. They have side effects, but not bad ones comparatively.
Amazing. Thanks for sharing. Oncology is such a complex field. I imagine discoveries in this field will only advance exponentially due to increasing investments and AI tech.
Why is RAG a horrible hack? LLMs can draw from only 2 sources of data: their parametric knowledge and the prompt. The prompt seems like a pretty good place to put new information they need to reason with.
RAG is a hack for lots of reasons, but the reason I'm focused on at the moment is the pipeline.
Say you are trying to do RAG in a chat-type application. You do the following:
1) Summarize the context of chat into some text that is suitable for a search (lossy).
2) Turn this into a vector embedded in a particular vector space.
3) Use this vector to query a vector database, which returns reference to documents or document fragments (which themselves have been indexed as a lossy vector).
4) Take the text of these fragments and put them in the context of the LLM as input.
5) Modify the prompt to explain what these fragments are.
6) Then the prompt is sent to the LLM, which turns it into it's own vector representation.
An obvious improvement to this is that the VectorDB and the LLM should share an internal representation, and the VectorDB should understand this. The LLM should take this vector input as a second input alongside the text context and the LLM should combine them (in the same way you can put a text and image into a multi-modal model)
I guess op may be envisioning an end-to-end solution that can train a model in the context of an external document store.
I.e.
One day we want to be able to backprop through the database.
Search systems face equivalent problems. The hierarchy of ML retrieval systems are separately optimized (trained). Maybe this helps regularize things, but, given enough compute / complexity, it is theoretically possible to differentiate through more of the stack.
When I prototype RAG systems I don’t use a “vector database.” I just use a pandas dataframe and I do an apply() with a cosine distance function that is one line of code. I’ve done it with up to 1k rows and it still takes less than a second.
This is exactly what I do. No one talks about how many GPUs you need to generate enough embeddings that you need to do something else.
Here's some back of the envelope math. Let's say you are using a 1B parameter LLM to generate the embedding. That's 2B FLOPs per token. Let's assume a modest chunk size, 2K tokens. That's 4 trillion FLOPs for one embedding.
What about the dot product in the cosine similarity? Let's assume an embedding dim of 384. That's 2 * 384 = 768.
So 4 trillion ops for the embedding vs 768 for the cosine similarity. That's a factor of about 1 billion.
So you could have a billion embeddings - brute forced - before the lookup became more expensive than generating the embedding.
What does that mean at the application level? It means that the time needed to generate millions of embeddings is measured in GPU weeks.
The time needed to lookup an embedding using an approximate nearest neighbors algorithm from millions of embeddings is measured in milliseconds.
The game changed when we switched from word2vec to LLMs to generate embeddings.
1 billion times is such a big difference that it breaks the assumptions earlier systems were designed under.
The embedding is generated once. Search is done whenever a user inputs a query. The cosine similarity is also not done on a single embedding, it's done on millions or billions of embeddings if you are not using an index. So what the actual conclusion is, is that once you have a billion embeddings a single search operation costs as much as generating an embedding.
But then, you are not even taking into account the massive cost of keeping all of these embeddings in memory ready to be searched.
Prototyping is one scenario I have seen this in. Prototyping is iterative - you experiment with the chunk size, chunk content, data sources, data pipeline, etc. every change means regenerating the embeddings
Another one is where the data is sliced based on a key, eg user id, particular document being worked on right now, etc
Everyone is piling on you but Id love to see what their companies are doing. Cosine similarity and loading a few thousand rows sounds trivial but most of the enterprise/b2b chat/copilot apps have a relatively small amount of data whose embeddings can fit in RAM. Combine that with natural sharding by customer ID and it turns out vector DBs are much more niche than an RDBMS. I suspect most people reaching for them haven’t done the calculus :/
1k rows isn't really at a point where you need any form of database. Vector or BOW, you can just bruteforce the search with such a miniscule amount of data (arguably this should be true into the low millions).
The problem is what happens when you have an additional 6 orders of magnitude of data, and the data itself is significantly larger than the system RAM, which is a very realistic case in a search engine.
1k is not much. My first RAG had over 40K docs (all short, but still...)
The one I'm working on right now has 115K docs (some quite big - I'll likely have to prune the largest 10% just to fit in my RAM).
These are all "small" - for personal use on my local machine. I'm currently RAM limited, otherwise I can think of (personal) use cases that are an order of magnitude larger.
Of course, for all I know, your method may still be as fast on those as on a vector DB.
I must be missing something -- why is the size of the documents a factor? If you embeded a document it would become a vector of ~1k floats, and 115k*1k floats is a couple hundred MB, trivial to fit in modern day RAM.
Embeddings are a type of lossy compression, so roughly speaking, using more embedding bytes for a document preserves more information about what it contains. Typically documents are broken down into chunks, then the embedding for each chunk is stored, so longer documents are represented by more embeddings.
Always felt they're more like hashes/fingerprints for the RAG use cases.
> Typically documents are broken down into chunks
That's what I would have guessed. It's still surprising that the embeddings don't fit into RAM though.
That said (the following I just realized), even if the embeddings don't fit into RAM at the same time, you really don't need to load them all into RAM if you're just performing a linear scan and doing cosine similarity on each of them. Sure it may be slow to load tens of GB of embedding info... but at this rate I'd be wondering what kind of textual data one could feasibly have that goes into the terrabyte range. (Also, generating that many embedding requires a lot of compute!)
> Always felt they're more like hashes/fingerprints for the RAG use cases.
Yes, I see where you’re coming from. Perceptual hashes[0] are pretty similar, the key is that similar documents should have similar embeddings (unlike cryptographic hashes, where a single bit flip should produce a completely different hash).
Nice embeddings encode information spatially, a classic example of embedding arithmetic is: king - man + woman = queen[1]. “Concept Sliders” is a cool application of this to image generation [2].
Personally I’ve not had _too_ much trouble with running out of RAM due to embeddings themselves, but I did spend a fair amount of time last week profiling memory usage to make sure I didn’t run out in prod, so it is on my mind!
Each vector is 1536 numbers. I don't know how many bits per number, but I'll assume 64 bits (8 bytes). So total size is 1536 * 115K * 8 / 1024^2 gives 1.3GB.
So yes, not a lot.
I still haven't set it up so I don't know how much space it really will take, but my 40K doc one took 2-3 GB of RAM. It's not pandas DF, but in an in-memory DB so perhaps there's a lot of overhead per row? I haven't debugged.
To be clear, I'm totally fine with your approach if it works. I have very limited time so I was using txtai instead of rolling my own - it's nice to get a RAG up and running in just a few lines of code. But for sure, if the overhead of txtai is really that significant, I'll need to switch to pure pandas.
Even on the production side there is something to be said about just doing things in memory, even over larger datasets. Certainly like all things there is a possible scale issue but I would much rather spin up a dedicated machine with a lot of memory than pay some of the wildly high fees for a Vector DB.
Not sure if others have gone down this path but I have been testing out ways to store vectors to disk in files for later retrieval and then doing everything in memory. For me the tradeoff of a sligtly slower response time was worth it compared to the 4-5 figure bill I would be getting from a vector DB otherwise.
Also, you are probably doing it wrong by turning a matrix to matrix multiplication into a for loop (over rows). The optimal solution results in better performance
There is certainly some scale at which a more sophisticated approach is needed. But your method (maybe with something faster than python/pandas) should be the go-to for demonstration and kept until it's determined that the brute force search is the bottleneck.
This issue is prevalent throughout infrastructure projects. Someone decides they need a RAG system and then the team says "let's find a vector db provider!" before they've proven value or understood how much data they have or anything. So they waste a bunch of time and money before they even know if the project is likely to work.
It's just like the old model of setting up a hadoop cluster as a first step to do "big data analytics" on what turns out to be 5GB of data that you could fit in a dataframe or process with awk https://adamdrake.com/command-line-tools-can-be-235x-faster-... (edit: actually currently on the HN front page)
It's a perfedt storm of sales led tooling where leadership is sold something they don't understand, over-engineering, and trying to apply waterfall project management to "AI" projects that have lots of uncertainty and need a re-risking based project approach where you show that it's liable to work and iterate instead of building a big foundation first.
Even up to 1M or so rows you can just store everything in a numpy array or PyTorch tensor and compute similarity directly between your query embedding and the entire database. Will be much faster than the apply() and still feasible to run on a laptop.
You may benefit from polars, it can multi-core better than pandas, and has some of the niceties from Arrow (which was the written / championed by the power duo of Wes and Hadley, authors of pandas and the R - tidyverse respectively).
I agree pandas or whatever data frame library you like is ideal for prototyping and exploring than setting up a bunch of infrastructure in a dev environment. Especially if you have labels and are evaluating against a ground truth.
You might be interested in SearchArray which emulates the classic search index side of things in a pandas dataframe column
Thanks for the article and definitely agree you are better off to start it simple like a parquet file and faiss and then test out options with your data. I say that mainly to test chunking strategies because of how big an effect it has on everything downstream whatever vector db or bert path you take -- chunking is a much bigger impact source than most people acknowledge.
I'm expecting to deploy a 6-figure "row count" RAG in the near future... with CTranslate2, matmul-based, at most lightly (like, single digits?) batched, and probably defaulting to CPU because the encoder-decoder part of the RAG process is just way more expensive and the database memory hog along with relatively poor TopK performance isn't worth the GPU.
That's kinda why I use LanceDB. It works on all three OSes, doesn't require large installs, and is quite easy to use. The files are also just Parquet, so no need to deall with SQL.