I’d be surprised if they had, working on what you work on! I’ll bet you would find them interesting in other ways, though. I’ve had a ton of success using them as study guides in other areas (e.g., biology).
I predict this is likely to change in 2025, if you explore with them -- unless there are constraints that make using them impractical (security or policy or bureaucracy being the ones that come to mind). I've experimented continuously with LLMs for over a year as a solo developer. For example, I have been using a workflow where I write a design document and often work alongside the LLM to keep the code and document in-sync.
P.S. I used to do a lot of Clojure, and definitely appreciate your work on it!
It may not have been important or interesting to him, or maybe he just figured the 10-digit number of articles written on the topic in 2024 (most by LLMs) was enough.
Pretty much every public company, at least every bigtech company, follows the same conventions -- don't say incriminating things in chat, trainings for "communicate with care" (definitely don't say "we will kill the competition!!" in email or chat), automatic retention policy etc etc.
> definitely don't say "we will kill the competition!!" in email or chat
My first reaction when I went through this training was that the US legal system is completely absurd in the context of corporations, and can't result in anything but absurd outcomes, whether the litigation is successful or not.
There's quite literally a guidebook that says "don't say 'kill the competition', say 'we will make the best product' instead", etc. It's an obsession with words, and while I can understand how that's important in some civil cases, it's (a) trivial to conceal the coded speech to the point of cringe (see exhibit A: the TV show Billions, where the phrase "I am not uncertain" is used to somehow create plausible deniability), (b) you're gonna get a bunch of false positives from non-decision makers talking colloquially and sarcastically to each other and most importantly (c) it's completely meaningless anyway because intent means nothing to an amoral corporation. Wrongdoing by giant behemoths should be judged by what the company does, did they compete unfairly or not. Their nifty word-weaseling, or a random employee's clumsy lack thereof, should have no bearing on the case.
People saying that didn't mean it as "obey the science," they meant, "follow the science to the conclusions it leads you to."
For example, people would ask public health officials what they thought about things, and the data wouldn't be sufficient to say with certainty. So they said they'd follow the science, meaning "we'll make a decision based on data."
You can criticize a lot of unscientific decisions that people made after saying that, but you've misinterpreted the phrase.
> People saying that didn't mean it as "obey the science," they meant, "follow the science to the conclusions it leads you to."
People saying that absolutely meant "obey the science" to the point that a substantial number of them [4] wanted to incarcerate and deprive of their livelihood anyone that didn't obey their idea of science.
You're making an enormous leap from "follow the science" to health policy and then again to legal consequences of violating local laws.
No one said, "'The Science' told us to arrest people for going to church!" The science did (and does) say that a huge amount of the spread of Covid was due to church attendance (and gyms, concerts, and clubs) at the time, particularly because of rapid singing/breathing and close quarters in those settings.
What people decide to do with that isn't scientific. It's local policy.
When you have a system where people are legally entitled to free health care (as they are in emergencies in the US), then the government should have a right to tell them to cut out unnecessary activities in an extreme crisis that had depleted local medical resources. It's just as easy to hold religious services on Zoom.
I would have preferred that when people were caught violating these laws, they were allowed to continue, but only if they signed a document forfeiting their right to emergency medical care.
> You're making an enormous leap from "follow the science" to health policy and then again to legal consequences of violating local laws.
You are trying to mince words there.
Can you please explain how it is possible to "obey the science" (which these people called for) from a purely political "health policy" perspective but from not "legal consequences" perspective?
What are the "health policy" policies that are to be implemented to "obey the science" (as they were asking for), that don't demand any "legal consequences"?
P.S.: Also, your suggestion of denying aid to these people is just totalitarian, actually. Let's do the same about obese people, then: that would cut health spending by more than half for everyone else.
True, but practically speaking: on the flip-side, basically nobody affected by the pandemic had the resources to execute on hypothesis-testing during the pandemic. There wasn't anything else they could do but decide what sources they trusted and follow them.
Vaccine denial's conclusions are so woefully unscientific that one can excuse the lack of technical precision in the synthesis of a pro-vaccination slogan.
There are legitimate reasons to be concerned over the COVID vaccines that have nothing to do with 'vaccine denial' [1]. What does 'vaccine denial' even mean? I've never met anyone who denies the existence of vaccines.
[1] Such as the elevated cardio vascular risk for young men that exceeded their risk from COVID.
"Call by meaning" sounds exactly like LLMs with tool-calling. The LLM is the component that has "common-sense understanding" of which tool to invoke when, based purely on natural language understanding of each tool's description and signature.
It was really special to see how this pair basically laid out the foundations of large-scale distributed computing. Protobufs, huge parts of the search stack, GFS, MapReduce, BigTable... the list goes on.
They are the only two people at Google at level 11 (senior fellow) on a scale that goes from 3 (fresh grad) to 10 (fellow).
One of my coworkers got assigned Sanjay on one of her CLs recently and she had no idea who he was. I had the pleasure of working with Sanjay as his intern at SRC the summer before he joined Google, and he taught be a lot of cool tricks related to compiler development. Both Sanjay and Jeff Dean are PhDs with PL focuses.
It's actually quite interesting how many of the early Googlers came from PL research backgrounds, and how it impacted Google's culture. Jeff Dean's thesis on whole-program optimization of Cecil/Vortex [1] was a classic even before Google got big, and eventually he got his boss Craig Chambers hired to write Flume [2]. Urs Hoezle (Google's first employee #9) was the founder of the Self project, which pioneered many of the dynamic optimization techniques that made it into HotSpot. Much of the HotSpot team itself was hired by Google, notably former Search SVP Ben Gomes and several less famous employees. The Plan 9 team (notably Ken Thompson and Rob Pike) went on to create Sawzall and then Go within Google; Ken Thompson was himself famous for creating C before then. Guido van Rossum (Python's creator) wrote Mondrian, the first code-review tool in Google. Lars Bak worked on BETA, then joined Urs to work on Self and Strongtalk before implementing the first version of V8 at Google. Dan Sugalski of Parrot & Perl 6 fame has held a bunch of infrastructure rules within Google.
It's like nearly everyone in a who's-who of the language design & compiler implementation community circa 2002 ended up working for Google, and the few that didn't (notably Chris Lattner of LLVM and Slava Pestov of Factor) ended up at Apple.
IMHO most of Google's woes stem from the promotion system. The incentive is to launch, promote, replace so you can get other people promoted. There's basically zero incentive to build anything durable because if you do, it means your people won't get promoted, they'll leave, your headcount will disappear, and eventually you'll find yourself marginalized and forced out.
I think it's happening at every tech company. They are now being led by tech-illiterate company hopping businesspeople rather than passionate engineers.
hah, this happened to me recently. I submitted a small CL to a threading lib and suddenly I see some guy named "sanjay" asking for a small change. It's very cool to see how involved he is still!
Let's not exaggerate. Everything they did was already well-established in academia for decades and what they built was pretty dumbed-down rather than novel, at least conceptually.
Sure, they did good engineering to apply those techniques at Google, but they did not "lay out the foundations of large scale distributed computing".
Keep seeing DJI drones at local police dept open houses. They even have a "drone unit" that specializes in SAR, hazardous recon type scenarios. Given extensive existing use throughout US local law enforcement, fire depts etc, not sure if this will actually happen.
Or maybe they all mass-migrate to Anduril solutions?
It's going to be a disaster for SAR, policing, firefighting, and all kinds of public good. The whole thing is an incredibly shortsighted move that will literally cost lives.
The goal, I think, is that these organizations will migrate to Skydio or BRINC (as they have the only reasonably viable drones for most of these use cases IMHO).
The reality is that they'll buy Autel (just as Chinese as DJI) or just keep using DJI and hoping the FCC Radio Police don't show up, which is probably a safe bet. Anduril don't really sell into this space.
In five years the US will have a prolific consumer drone industry and will have hardly skipped a beat. This is a good move for the US from a long-term security and economic standpoint. There are some short term pains but long-term this is good.
Did you get paid 50 cents to write all these posts?
This was kind of conventional wisdom ("fine tune only when absolutely necessary for your domain", "fine-tuning hurts factuality"), but some recent research (some of which they cite) has actually quantitatively shown that RAG is much preferable to FT for adding domain-specific knowledge to an LLM:
But "knowledge injection" is still pretty narrow to me. Here's an example of a very simple but extremely valuable usecase - taking a model that was trained on language+code and finetuning it on a text-to-DSL task, where the DSL is a custom one you created (and thus isn't in the training data). I would consider that close to infeasible if your only tool is a RAG hammer, but it's a very powerful way to leverage LLMs.
This is exactly (one of) our use cases at Eraser - taking code or natural language and producing diagram-as-code DSL.
As with other situations that want a custom DSL, our syntax has its own quirks and details, but is similar enough to e.g. Mermaid that we are able to produce valid syntax pretty easily.
What we've found harder is controlling for edge cases about how to build proper diagrams.
Agree that your use-case is different. The papers above are dealing mostly with adding a domain-specific textual corpus, still answering questions in prose.
"Teaching" the LLM an entirely new language (like a DSL) might actually need fine-tuning, but you can probably build a pretty decent first-cut of your system with n-shot prompts, then fine-tune to get the accuracy higher.
#1 motivation for RAG: you want to use the LLM to provide answers about a specific domain. You want to not depend on the LLM's "world knowledge" (what was in its training data), either because your domain knowledge is in a private corpus, or because your domain's knowledge has shifted since the LLM was trained.
The latest connotation of RAG includes mixing in real-time data from tools or RPC calls. E.g. getting data specific to the user issuing the query (their orders, history etc) and adding that to the context.
So will very large context windows (1M tokens!) "kill RAG"?
- at the simple end of the app complexity spectrum: when you're spinning up a prototype or your "corpus" is not very large, yes-- you can skip the complexity of RAG and just dump everything into the window.
- but there are always more complex use-cases that will want to shape the answer by limiting what they put into the context window.
- cost-- filling up a significant fraction of a 1M window is expensive, both in terms of money and latency. So at scale, you'll want to filter out and RAG relevant info rather than indiscriminately dump everything into the window.
We're getting large context windows, but so long as pricing is by the input token, the 'throw everything into the context window' path isn't viable. That pricing model, and the context window limits, are a consequence of the quadratic cost of transformers though, and whatever the big context models like Gemini 1.5 are doing must have an (undisclosed) workaround.
What needs to happen is a way to cheaply suspend and rehydrate the memory state of the forward pass after you've fed it a lot of tokens.
That would be a sort of light-weight/flexible/easily modifiable/versionable/real-time-editable alternative to fine tuning.
It's readily doable with the open weights LLM's, but none of them (yet) have the context length to make it really worthwhile (some of the coding LLM's have long context windows, but it doesn't solve the 'knowledge base' scenario).
From a hosting perspective, if fine tunes are like VM's, such frozen overlays are like docker containers: many versions can live on the same server, sharing the base model and differing in the overlay layer.
(a startup idea? who wants to collaborate on a proof of concept?)
When you describe the overlay layer, that sounds similar to the idea of low rank adaptation (LoRA). LoRA is kind of like finetuning, but it doesn't update every parameter, it adds a relatively small number of parameters and finetunes those
Am I understanding what you're describing about the VMs and containers analogy?
Yup. I guess LoRA counts as fine tuning. Except I've never seen inference engines where they actually let you take the base model and the LoRA parameters as separate inputs (maybe it exists and I just haven't seen it). Instead, they bake the LoRA part into the bigger tensors as the final step of the fine tune. That makes sense in terms of making inference faster, but prevents the scenario where a host can just run the base model with any finetune you like, maybe switching them mid-conversation. Instead, if you want to host a fine-tuned model, you take the tensor blob and run a separate instance of the inference program on it. Incidentally, this is the one place where OpenAI and Azure pricing differs; OpenAI just charges you a big per-token premium for fine-tuned 3.5, and Azure charges you for the server to host the custom model. Likewise, the hosts for the open-weights models will charge you more to run your fine-tuned model than a standard model, even though it's the almost the same amount of GPU cycles, just because it needs to run on a separate server that won't be shared by multiple customers; that wouldn't be necessary if overlays were separated.
I wouldn't be surprised if GPT-4's rumored mixture of many models does something like this overlay management internally.
Great post. This exact limitation of web LLMs is why I'm leaning strongly towards local models for the easier stuff. Prompt caching can dramatically speed up fixed tasks.
But frontier models are just too damn good and convenient so I don't think its possible to fully get away from web LLMs.
Thanks, this is how I view it, too: there will always be relevant context that was unavailable at training, eg because it didn't exist yet, because it doesn't belong to the trainers, or because it wasn't yet known to be relevant.
One of these will remain true until every person has their own pet model which is fine-tuned, on keyup, on all public data and their own personal data. Still, something heinously parametric (like regional weather on some arbitrary date) I struggle to imagine fitting into a transformer.
Here's a question you can ask yourself: "where does my context fall within the distribution of human knowledge?" RAG is increasingly necessary as your context moves towards the tail.
In addition to what you’ve shared, I find RAG to be useful for cases where LLM has the world knowledge (say it knows how to write javascript) but I want it to follow a certain style or dependencies (eg use function definitions vs function expressions, newest vs es6, etc). From what I’ve heard, it’s still cheaper/more performant to feed everything into the context than finetune models.
Yeah I say cost is the biggest thing. Why doesn’t everyone just use GPT 4 for everything or Gemini ultra + RAG with all documents in the rag system with the best embedding model
Among other things because it’s way too expensive and narrowing your scope cuts huge costs and isn’t hard to do at a high level
They're calling the lie on needing bleeding edge hardware for performance.
5 yr old silicon (14 nm!!) and no hbm.
Their secret sauce seems to be an ahead-of-time compiler that statically lays out entire computation, enabling zero contention at runtime. Basically, they stamp out all non-determinism.
It's not really a lie though. They require 20x more chips (storing all the weights in sram instead of hbm is expensive!) than Nvidia GPUs for a ~2x speed increase. Overall the power cost is more expensive for groq than GPUs.
cost effective in what sense? groq doesn't achieve high efficiency, only low latency. but that's not done in a cost-effective way. compare sambanova achieving the same performance with 8 chips instead of 568, and with higher precision.
Most important, even ignoring latency, is throughput (tokens) per $$$. And according to their own benchmark [1] (famous last words :)), they're quite cost efficient.
From skimming the link above, it seems like they accepted it's extremely difficult (maybe impossible) to generate high ILP from a VLIW compiler on complex hardware (what Itanium tried to do).
So they attacked the italicized portion and simplified the hardware. Mostly by eliminating memory-layer non-determinism / using time-sync'd global memory instructions as part of the ISA(?).
This apparently reduced the difficulty of the compiler problem to something manageable (but no doubt still "fun")... and voila, performance.
The problem set groq is restricted to (known size tensor manipulation) lends itself to much easier solutions than full blown ILP. The general problem of compiling arbitrary Turing complete algorithms to the arch is NP hard. Tensor manipulation... That's a different story.
I wonder if the use of eDRAM (https://en.wikipedia.org/wiki/EDRAM), which is essentially embedding DRAM into a chip made on a logic process would be a good idea here.
EDRAM is essentially a tradeoff between SRAM and DRAM, offering much greater density at the cost of somewhat worse throughput and latency.
There were a couple of POWER cpus that used EDRAM as L3 cache, but it seems to have fallen out of favor.
The cited paper seems to be really bending over backwards to find some trace of bias. Unwinnable game for LLMs.
E.g. cited work claims "LLMs assign significantly less prestigious jobs to speakers of African American English... compared to Standardized American English". You don't say! Formal/business language has higher association with prestigious jobs than informal/street/urban language. How is that even classified as "bias"?
> Formal/business language has higher association with prestigious jobs than informal/street/urban language. How is that even classified as "bias"?
Judging people's fitness for a job based on the dialect of English that they speak is by definition a form of bias. That it's bias that is also reflected in the workplace pre-LLMs doesn't make it not bias.
If most of my employees can’t understand the way someone speaks or writes, it’s likely that person will have problems communicating ideas within the organization.
Also, by definition where? You’re just making this up, there isn’t an authoritative dictionary or even regulation that states judging fitness for a job based on the dialect of English a person speaks is “bias.” If there was, surely you would have provided a link instead of just asserting your own correctness.
The high prestige jobs the LLMs associated with SAE mostly require higher education, usually post-graduate.
Perhaps the model is only accurately learning that more educated people use a particular dialect of English taught by universities.
There's really no way to tell, because the researchers didn't include other dialects of English that aren't favored by universities, like southern, or Yorkshire.
Alright, but I think their point is "bias as reflected in the real world." There's a bias against hiring people that don't speak English at all for English speaking jobs, I don't think anyone of sound mind would call that xenophobia. Critically what we are trying to find out with the LLMs is if there is bias beyond social acceptability or beyond job requirements. If you wouldn't hire someone for a job that said "sup dog" upon meeting you in an interview, why should the same bias indicate something nefarious for an LLM?
If people speaking a certain dialect tend not to work in a particular role, an LLM is only being accurate in deprioritizing that occupation when guessing what they might do.
It does make it not bias when an LLM is accurately representing reality.
I went to a 95% percent black high school. I recall my English teachers stressing avoidance of AAE (then called Ebonics) in professional settings.
Speak however you wish with your friends, they said, but use Standardized American English for the job interview. Apparently this goes double for LLMs.
They even made us write the same paper twice. Once in standard English and again in AEE, so the kids would know the difference.
This is good advice for students who speak native AAE because there exists bias against the dialect. It's better for their own, personal life trajectory to adapt to the way that the world will judge their dialect.
However, that doesn't make it okay to continue to perpetuate the idea that AAE is somehow a lesser dialect and that AAE speakers ought to have to mask their accent in the way that an Indian, Brit, or Australian doesn't.
We can simultaneously teach students how to navigate the dangers of the world they live in while trying to fix said dangers for future generations.
>However, that doesn't make it okay to continue to perpetuate the idea that AAE is somehow a lesser dialect and that AAE speakers ought to have to mask their accent in the way that an Indian, Brit, or Australian doesn't.
As a Brit living in Britain, I habitually code-switch, because my natural dialect is coded as low-status. If I used my natural dialect in a job interview, it would very clearly communicate one of two things - either I am unwilling to conform to the behavioural norms of a professional workplace, or I lack the linguistic skills to do so.
I cannot pretend to understand the cultural context surrounding AAVE, but the prejudice against my own dialect is broadly rational. Learning to speak in mildly-accented standard English is just one of many shibboleths that signify membership of the professional middle class.
Can I ask what British dialect is associated with lower class? I was not aware there were class-indicating dialects besides AAVE to be honest. To me, anything British sounds higher class if anything
Makes me wonder if we have them in Dutch as well. I guess simply sounding like a foreigner, or making mistakes about word gender or such, will make you stand out as not knowing the language properly (even if you do and merely haven't got the pronunciation down), but I wouldn't know of anyone who grew up speaking Dutch in a native way who subsequently sounds lower class. The Belgian Flemish and southern Limburgians sound funny to most people, but it's not a lower class, just a region-of-origin indicator
Any strong regional accent is coded as working class. The upper class speak Received Pronunciation, which is broadly viewed as a "standard" accent and was once the only accent allowed on the BBC. The middle class will avoid using dialect and significantly soften their accent, shifting towards RP. There has been a long-standing debate about whether aspirational members of the working class should moderate their accents in order to advance professionally, but the reality is that most feel obliged to do so. Accent bias is still recognised as a major barrier to social mobility.
Not OP but in Southeastern England obviously Cockney, Estuary English, or Multicultural London English are considered lower prestige dialects whereas RP is high prestige. Other regions may have a prestige version of the local dialect that incorporates some RP features while retaining some local features, but broadly speaking, regional speech patterns have lower prestige because they tend to be more prominent among people with less education.
There are dialects that are associated with wealth and education that are still discouraged in the business world for various practical reasons. It certainly is a penalty to have a dialect and many educated parents will try to school their kids to speak differently. Not as a replacement, but as an alternative.
I doubt you can remove the preference due to the advantages. There is a force for consolidation, just as there is with English as a whole for example.
This isn't to agree or disagree with other points you might be making, but one thing I learned not too long ago is that AAE isn't quite "street" or "urban" English. Like other dialects of English (say, Scottish), it's a dialect of its own, with its own interesting rules, nuances, and pronunciations. A very short explanation of it is here, but you can find more if you're interested: https://www.youtube.com/watch?v=zw4pD4DNOHc
They’re pretending that an LLM recognizing patterns that exist in reality reflects a problem in its training data. So much of the conversation around “reducing bias” really means intentionally introducing biased data to “socially engineer” LLMs (and by proxy their users) so that they have a less accurate picture of reality.
reply