Hacker News new | past | comments | ask | show | jobs | submit | formercoder's comments login

Think the massive increase in demand is due to mass inference of open source LLMs? Or is the transformer architecture driving mass inference of other models too?


You might be thinking too far into this. The biggest customers are bulk buyers that are either training on a private cluster (eg. Meta or OpenAI), or selling their rack-space to other businesses. These are the people that are paying money and increasing the demand for GPU hardware; what happens to the businesses they provide for almost doesn't even matter as long as they pay for the compute. The "driver" for this demand is the hype. If people were laser-focused on the best-value solution, then everyone would pay for OpenAI's compute since it's cheaper than GPU hosting.

The real root of the problem is that GPU compute is not a competitive market. The demand is less for GPUs and more for Nvidia hardware, because nobody else is shipping CUDA or CUDA-equivalent software. Thus the demand is artificially raised beyond whatever is reasonable since buyers aren't shopping in a reactive market. Basically the same story as what happened to Nvidia's hardware during the crypto mining rush.


> then everyone would pay for OpenAI's compute since it's cheaper than GPU hosting.

This is absolutely not true. The gap is narrowing as providers (Google, Anthropic, Deepseek) introduce cross request KV caching but it’s definitely not true for OAI (yet).


“Own” is likely a very simplistic term for the complex agreement in place between SpaceX and NASA. I’m sure SpaceX retains all of the developed IP, but NASA can “do whatever they want” with the actual ship.


Third try always


I’m a Googler who doesn’t do anything remotely close to setting capital return policy. Just remember Alphabet has been buying back 10s of billions for a while. Dividends are just a different capital return mechanism.


Importantly, dividends devalue issued unvested RSUs.


I though this was the case, but Googlers will receive Dividend Equivalent Units on unvested GSUs on each dividend payment date.


Could you please expand on this? If a person got issued RSUs but they have not vested yet, wouldn't they be happy that the RSUs are going to be worth more by the time they get vested?


No, because they would be worth the amount less the dividends during the vesting period.


In a vacuum where precise numbers do not exist, maybe.

In this real scenario, if someone's goog shares were vesting at an earlier value - let's take a rough average at a glance of goog YTD to be $145, they will have lost on a year's worth of dividends at $0.2 per share. However, the current share price is $175.

So, through this maneuver, a person holding N goog shares will lose at most 3 quarters of dividends: N * 0.2$ * 3 = N * 0.6$

But they will have gained whatever the stock has appreciated, which at this moment in time works out to: N * (175-145)$ = N * 30$

What am I missing which would make the scenario above result in OP's claim of "dividends devalue issued unvested RSUs"?

EDIT: This also fails to take into account "Dividend Equivalents (DEs)", which are not factored above, and would yield extra income to the person that owns unvested shares.


Corporate finance theory says that when a dividend is issued, the price of the stock goes down by an equivalent amount.


Googler here, if you haven’t looked at TPUs in a while check out the v5. They support PyTorch/JAX now, makes them much easier to use than TF only.


Where can I buy a TPU v5 to install in my server? If the answer is “cloud”: that’s why NVidia is wiping the floor.


How many people are out there buying H100s for their personal use?


Ah, but part of the reason for CUDA's success is that the open source developer who wants to run unit tests or profile their kernel can pick up a $200 card. That PhD student with a $2000 budget can pick up a card. Academic lab with $20,000 for a beefy server, or tiny cluster? nvidia will take their money.

And that's all fixed capital expenditure - there's no risk a code bug or typo by an inexperienced student will lead to a huge bill.

Also, if you're looking for an alternative to CUDA because you dislike vendor lock-in, switching to something only available in GCP would be an absurd choice.


I'm really shocked at how dependent companies have become on the cloud offerings. Want a GPU? Those are expensive, lets just rent on Amazon and then complain about operational costs!

I've noticed this at companies. Yeah, the cloud is expensive, but you have a data center, and a few servers with RTX 3090s aren't expensive. A lot of research workloads can run on simple, cheap hardware.

Even older Nvidia P40s are still useful.


Probably many orders of magnitude greater than those buying TPU's for personal use...


Technically correct, but only because TPUs aren't for sale. H100s cost like 30,000 USD, if you can even get one.


So in other words, every AI company has at least 20.


Probably not many. However, 4090s would be a different situation. There are plenty of guides on running LLMs, stable diffusion, etc. on local hardware.

The H100s would be for businesses looking to get into this space.


You probably can't even rent them from Google if you wanted to, in my experience.



I think OPs point was Google claims to have TPUs in their cloud but in reality they are rarely available.


Let’s compare. I’ve interviewed for and been offered both high paying tech jobs and investment banking jobs. My interviews for tech really required me to have a deep understanding of the technology and how to solve problems with it. They tested my critical thinking skills. For banking, sure there were some technical questions but a lot of it came down to what school I went to, how I acted, who I knew. I suppose either one could be a “nightmare” depending on who you are. I do think the tech interview process was much better at hiring generally competent people. There’s also consulting which I have not directly interviewed for but the interview style is more similar to tech with more of a focus on mental math, those can be very difficult if you don’t have that skill.


What happens when your vm is live migrated 1000 feet away or to a different zone?


My aunt was diagnosed with stage 4 lung cancer and is only alive because of this drug. It’s a marvel of modern medicine.


My wife was diagnosed in Nov '23. Stage 4. She had a mutation that allowed Tagrisso/osimertinib. There are many mutations, but this one is prevalent in Asian women who have never smoked.

There was a trial published just after her diagnosis that showed progression free survival at 2 years of 57% (chemo+osimertinib) vs 41% (osimertinib).

https://www.nejm.org/doi/full/10.1056/NEJMoa2306434

She's 6 days into the chemo. There are 4 rounds of pemetrexed+carboplatin and then maintenance carboplatin (forever). Every 21 days.

So far, it's been rough, but hopefully her body will adjust and once she's just on carboplatin hopefully her quality of life will be better.

She had few side effects from the osimertinib, but the main one was mouth sores. They're gone now, thankfully.

Another thing people need to know about this: this diagnosis often comes with regular thoracentesis (removal of fluid from the chest). In this case, it's in the lining of her lung, not lung itself.

Advice to anyone in this situation: make all future appointments for thoracentesis, because you don't want to go into the ER/ED to get it done. And, we've stopped allowing residents to practice on her (once it was quite painful).

EDIT:

Forgot to add: carboplatin crosses the blood-brain barrier, as do the cancer cells. The end state for this cancer, even with the above treatments, is usually tumors in the brain.


Hey, best of luck to you both. I've been there with close family members and cancer. It's no fun at all. Easily the hardest thing I've ever had to do in my life was just to care and watch.

Make sure to grab family and friends to help out. The want to help out, most of the time, at least. I know it's hard to ask, but make sure to do it anyways.

And yeah, standard generic advice to take care of yourself too. I hated that advice, as there just was never any time to do so. One thing that helped was to schedule that 'me' time in and make others aware of it. If you put it on the schedule, you'll have a better chance of taking it.

Two tips with MDs:

1) Carry and use a notebook in all MD interactions. Just a simple journal is all. Time, medications given, dosages, MD administering, etc. It's a good back up to have, sure, but you'll likely never use it. The real power is when MDs see you writing things down. They take you more seriously and they take themselves more seriously. I think they think that you'll sue now and have some proof. But who knows.

2) If you're in for a long term treatment (1+ nights) put a big bowl of candy outside the door. Nurses etc. will stop by more often to check up on things and generally seem to like you more. Consider putting cigarettes in there too, depending on if your nurses smoke. Then they will really like you and go above and beyond.

Again, best of luck to you both.

FUCK cancer.


Thanks. Appreciate the tips.


Can you refuse allowing residents to practice on you or your loved ones for better quality of care?


Sure, you can generally refuse any medical care in the US. With respect to residents, It can be as simple as asking and picking the right doctor for your procedure. Their credentials are generally public.


Yes, you can, and the doctors we talked to said it would not negatively impact them.


> this one is prevalent in Asian women who have never smoked.

Does that mean the rate is lower for Asian women who do smoke?


My understanding, though not by any means an expert, is that lung cancer that non smokers tend to get is different to that which smokers tend to get.


This is my understanding.


I think they just mean that the incidence is much higher in that group than would otherwise be expected.

Not all lung cancers are related to smoking, but I'm not aware of any where smoking decreases incidence or is protective.


I wish her and you all the strength in the world.


Thank you.


[flagged]


Practice some empathy.


what did he say??


That we were being selfish for not allowing residents to practice.


Jesus.


It feels so good to read a comment like this. Thank you for sharing and I’m really happy for you and your family (and modern medicine that makes it all possible).


It's a great sight to see people outliving research protocols.

Best wishes.


Would you be willing to share a few more details for me to understand? It would help me talk more concretely with my colleagues.

So she had stage 4 lung cancer.

"Is alive only because of this drug":

- Was her life expectancy obviously prolonged? If so, by how much has she exceeded it so far?

- Has the cancer progressed more slowly, halted, or regressed?

Thanks.


I’m not completely read in to all the details but she told me prior to this drug her diagnoses was a “death sentence.” The lung cancer had metastasized throughout her bones. The indication was sudden lower body paralysis caused by a spinal tumor. She had spinal fusion surgery as well. I understand because it’s stage 4 she will never be in “remission” but it has regressed significantly.


Thanks for sharing. At such an advanced stage and aggressive metastasis, for it to have regressed at all... wow. She must have been in so much pain.


My wife had stage 4 melanoma. Prior to the newish immunotherapies about 5 years ago, it was a death sentence — 6-9 months life expectancy from diagnosis. Now, it’s 60-70% 5 year survival rate. Unfortunately my wife wasn’t one of them.

In general, these types of cancers spew mets and spread quickly. Many are resistant to chemo, go to the brain (chemo cannot help there) and only respond to high focused radiation like SRS or proton beam.

Immunotherapies essentially suppress checkpoints that cancer cells use to avoid immune response and/or cause your body to target specific checkpoints. I can’t read FT.com, but I believe it’s talking about a targeted therapy that allows your body to control the cancer.

There’s alot of research happening around things like immunotherapy combined with custom versions of Moderna and other vaccines that will significantly improve treatment and save people going through what my wife went through. It’s a good time.


Sorry to hear about your wife. Yes, once those melanomas go deep enough in the skin, they spread everywhere like a plague.

I know there's even researchers creating AI-assisted treatment regimes that match specific mutations, other aspects of people's DNA, and all the immunotherapy drugs, in order to mix and match the best possible solution. Ongoing research and not yet widely available.

I look forward to learning more about the newer developments.


My dad was in a similar situation ~10 years ago. Stage 4 lung cancer, although a different underlying mutation (ALK). Prognosis was less than a year. But he got on some new-at-the-time inhibitor drugs (Crizotinib iirc) which eliminated ("no evidence") the cancer. What is left does slowly mutates around the drugs, and he goes in for regular monitoring and has had some radiation therapy to knock down specific growth spots, but he's still here with us.

I can't speak to the one in the article, but the drugs he's on aren't chemo, but they interrupt one of the pathways the cancer is using to infinitely grow. They have side effects, but not bad ones comparatively.


Amazing. Thanks for sharing. Oncology is such a complex field. I imagine discoveries in this field will only advance exponentially due to increasing investments and AI tech.


That is indeed remarkable. Stage 4 lung cancer usually makes you seek out hospice care.


Wow, that's incredible. So wholesome reading a comment like this on HN, makes me more hopeful of the future. Thanks for sharing


Why is RAG a horrible hack? LLMs can draw from only 2 sources of data: their parametric knowledge and the prompt. The prompt seems like a pretty good place to put new information they need to reason with.


RAG is a hack for lots of reasons, but the reason I'm focused on at the moment is the pipeline.

Say you are trying to do RAG in a chat-type application. You do the following:

1) Summarize the context of chat into some text that is suitable for a search (lossy).

2) Turn this into a vector embedded in a particular vector space.

3) Use this vector to query a vector database, which returns reference to documents or document fragments (which themselves have been indexed as a lossy vector).

4) Take the text of these fragments and put them in the context of the LLM as input.

5) Modify the prompt to explain what these fragments are.

6) Then the prompt is sent to the LLM, which turns it into it's own vector representation.

An obvious improvement to this is that the VectorDB and the LLM should share an internal representation, and the VectorDB should understand this. The LLM should take this vector input as a second input alongside the text context and the LLM should combine them (in the same way you can put a text and image into a multi-modal model)


I guess op may be envisioning an end-to-end solution that can train a model in the context of an external document store.

I.e. One day we want to be able to backprop through the database.

Search systems face equivalent problems. The hierarchy of ML retrieval systems are separately optimized (trained). Maybe this helps regularize things, but, given enough compute / complexity, it is theoretically possible to differentiate through more of the stack.


When I prototype RAG systems I don’t use a “vector database.” I just use a pandas dataframe and I do an apply() with a cosine distance function that is one line of code. I’ve done it with up to 1k rows and it still takes less than a second.


This is exactly what I do. No one talks about how many GPUs you need to generate enough embeddings that you need to do something else.

Here's some back of the envelope math. Let's say you are using a 1B parameter LLM to generate the embedding. That's 2B FLOPs per token. Let's assume a modest chunk size, 2K tokens. That's 4 trillion FLOPs for one embedding.

What about the dot product in the cosine similarity? Let's assume an embedding dim of 384. That's 2 * 384 = 768.

So 4 trillion ops for the embedding vs 768 for the cosine similarity. That's a factor of about 1 billion.

So you could have a billion embeddings - brute forced - before the lookup became more expensive than generating the embedding.

What does that mean at the application level? It means that the time needed to generate millions of embeddings is measured in GPU weeks.

The time needed to lookup an embedding using an approximate nearest neighbors algorithm from millions of embeddings is measured in milliseconds.

The game changed when we switched from word2vec to LLMs to generate embeddings.

1 billion times is such a big difference that it breaks the assumptions earlier systems were designed under.


This analysis is bad.

The embedding is generated once. Search is done whenever a user inputs a query. The cosine similarity is also not done on a single embedding, it's done on millions or billions of embeddings if you are not using an index. So what the actual conclusion is, is that once you have a billion embeddings a single search operation costs as much as generating an embedding.

But then, you are not even taking into account the massive cost of keeping all of these embeddings in memory ready to be searched.


I think the context was prototyping.


Prototyping is one scenario I have seen this in. Prototyping is iterative - you experiment with the chunk size, chunk content, data sources, data pipeline, etc. every change means regenerating the embeddings

Another one is where the data is sliced based on a key, eg user id, particular document being worked on right now, etc


Everyone is piling on you but Id love to see what their companies are doing. Cosine similarity and loading a few thousand rows sounds trivial but most of the enterprise/b2b chat/copilot apps have a relatively small amount of data whose embeddings can fit in RAM. Combine that with natural sharding by customer ID and it turns out vector DBs are much more niche than an RDBMS. I suspect most people reaching for them haven’t done the calculus :/


People rushing to slap “AI” on their products don’t really know what they need? Yea that’s absolutely what’s happening now


1k rows isn't really at a point where you need any form of database. Vector or BOW, you can just bruteforce the search with such a miniscule amount of data (arguably this should be true into the low millions).

The problem is what happens when you have an additional 6 orders of magnitude of data, and the data itself is significantly larger than the system RAM, which is a very realistic case in a search engine.


1k is not much. My first RAG had over 40K docs (all short, but still...)

The one I'm working on right now has 115K docs (some quite big - I'll likely have to prune the largest 10% just to fit in my RAM).

These are all "small" - for personal use on my local machine. I'm currently RAM limited, otherwise I can think of (personal) use cases that are an order of magnitude larger.

Of course, for all I know, your method may still be as fast on those as on a vector DB.


I must be missing something -- why is the size of the documents a factor? If you embeded a document it would become a vector of ~1k floats, and 115k*1k floats is a couple hundred MB, trivial to fit in modern day RAM.


Embeddings are a type of lossy compression, so roughly speaking, using more embedding bytes for a document preserves more information about what it contains. Typically documents are broken down into chunks, then the embedding for each chunk is stored, so longer documents are represented by more embeddings.

Going further down the AI == compression path, there’s: http://prize.hutter1.net/


> Embeddings are a type of lossy compression

Always felt they're more like hashes/fingerprints for the RAG use cases.

> Typically documents are broken down into chunks

That's what I would have guessed. It's still surprising that the embeddings don't fit into RAM though.

That said (the following I just realized), even if the embeddings don't fit into RAM at the same time, you really don't need to load them all into RAM if you're just performing a linear scan and doing cosine similarity on each of them. Sure it may be slow to load tens of GB of embedding info... but at this rate I'd be wondering what kind of textual data one could feasibly have that goes into the terrabyte range. (Also, generating that many embedding requires a lot of compute!)


> Always felt they're more like hashes/fingerprints for the RAG use cases.

Yes, I see where you’re coming from. Perceptual hashes[0] are pretty similar, the key is that similar documents should have similar embeddings (unlike cryptographic hashes, where a single bit flip should produce a completely different hash).

Nice embeddings encode information spatially, a classic example of embedding arithmetic is: king - man + woman = queen[1]. “Concept Sliders” is a cool application of this to image generation [2].

Personally I’ve not had _too_ much trouble with running out of RAM due to embeddings themselves, but I did spend a fair amount of time last week profiling memory usage to make sure I didn’t run out in prod, so it is on my mind!

[0] https://en.m.wikipedia.org/wiki/Perceptual_hashing

[1] https://www.technologyreview.com/2015/09/17/166211/king-man-...

[2] https://github.com/rohitgandikota/sliders


Example from OpenAI embedding:

Each vector is 1536 numbers. I don't know how many bits per number, but I'll assume 64 bits (8 bytes). So total size is 1536 * 115K * 8 / 1024^2 gives 1.3GB.

So yes, not a lot.

I still haven't set it up so I don't know how much space it really will take, but my 40K doc one took 2-3 GB of RAM. It's not pandas DF, but in an in-memory DB so perhaps there's a lot of overhead per row? I haven't debugged.

To be clear, I'm totally fine with your approach if it works. I have very limited time so I was using txtai instead of rolling my own - it's nice to get a RAG up and running in just a few lines of code. But for sure, if the overhead of txtai is really that significant, I'll need to switch to pure pandas.


Even on the production side there is something to be said about just doing things in memory, even over larger datasets. Certainly like all things there is a possible scale issue but I would much rather spin up a dedicated machine with a lot of memory than pay some of the wildly high fees for a Vector DB.

Not sure if others have gone down this path but I have been testing out ways to store vectors to disk in files for later retrieval and then doing everything in memory. For me the tradeoff of a sligtly slower response time was worth it compared to the 4-5 figure bill I would be getting from a vector DB otherwise.


True.

Also, you are probably doing it wrong by turning a matrix to matrix multiplication into a for loop (over rows). The optimal solution results in better performance

sim = np.vstack(df.col) @ vec


There is certainly some scale at which a more sophisticated approach is needed. But your method (maybe with something faster than python/pandas) should be the go-to for demonstration and kept until it's determined that the brute force search is the bottleneck.

This issue is prevalent throughout infrastructure projects. Someone decides they need a RAG system and then the team says "let's find a vector db provider!" before they've proven value or understood how much data they have or anything. So they waste a bunch of time and money before they even know if the project is likely to work.

It's just like the old model of setting up a hadoop cluster as a first step to do "big data analytics" on what turns out to be 5GB of data that you could fit in a dataframe or process with awk https://adamdrake.com/command-line-tools-can-be-235x-faster-... (edit: actually currently on the HN front page)

It's a perfedt storm of sales led tooling where leadership is sold something they don't understand, over-engineering, and trying to apply waterfall project management to "AI" projects that have lots of uncertainty and need a re-risking based project approach where you show that it's liable to work and iterate instead of building a big foundation first.


> 5GB of data that you could fit in a dataframe or process with awk

These days anything less than 2TB should be done 100% in memory.


What’s your AWS bill like ?


Even up to 1M or so rows you can just store everything in a numpy array or PyTorch tensor and compute similarity directly between your query embedding and the entire database. Will be much faster than the apply() and still feasible to run on a laptop.


You may benefit from polars, it can multi-core better than pandas, and has some of the niceties from Arrow (which was the written / championed by the power duo of Wes and Hadley, authors of pandas and the R - tidyverse respectively).


I agree pandas or whatever data frame library you like is ideal for prototyping and exploring than setting up a bunch of infrastructure in a dev environment. Especially if you have labels and are evaluating against a ground truth.

You might be interested in SearchArray which emulates the classic search index side of things in a pandas dataframe column

https://github.com/softwaredoug/searcharray


Thanks for the article and definitely agree you are better off to start it simple like a parquet file and faiss and then test out options with your data. I say that mainly to test chunking strategies because of how big an effect it has on everything downstream whatever vector db or bert path you take -- chunking is a much bigger impact source than most people acknowledge.


I'm expecting to deploy a 6-figure "row count" RAG in the near future... with CTranslate2, matmul-based, at most lightly (like, single digits?) batched, and probably defaulting to CPU because the encoder-decoder part of the RAG process is just way more expensive and the database memory hog along with relatively poor TopK performance isn't worth the GPU.


That's kinda why I use LanceDB. It works on all three OSes, doesn't require large installs, and is quite easy to use. The files are also just Parquet, so no need to deall with SQL.


I mean, you have 1k rows and it is a "prototype".


Think about the number of flops needed for each comparison in brute force search.

You'll realize that it scales well beyond 1k.


use np.dot, takes 1 line


1k rows? Sounds like kindergarten.


up to 100k rows you don't get faster by using vector store, just use numpy


And often you have tags that filter it down even further.


What RAG systems do you prototype?


You could do it by hand at that scale too


Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: