Field experimental evidence of AI on knowledge worker productivity and quality

gandalfgeek · 2024-01-16T18:02:04.000000Z

If you don't want to read the full thing, Mollick's substack has more accessible summary of this paper:

https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the...

There are two sides to this thing. The first author of this paper has also written another paper titles "Falling Asleep at the Wheel: Human/AI Collaboration in a Field Experiment". Abstract:

"As AI quality increases, humans have fewer incentives to exert effort and remain attentive, allowing the AI to substitute, rather than augment their performance... I found that subjects with higher quality AI were less accurate in their assessments of job applications than subjects with lower quality AI. On average, recruiters receiving lower quality AI exerted more effort and spent more time evaluating the resumes, and were less likely to automatically select the AI-recommended candidate. The recruiters collaborating with low-quality AI learned to interact better with their assigned AI and improved their performance. Crucially, these effects were driven by more experienced recruiters. Overall, the results show that maximizing human/AI performance may require lower quality AI, depending on the effort, learning, and skillset of the humans involved."

https://static1.squarespace.com/static/604b23e38c22a96e9c788...

digging · 2024-01-16T18:29:17.000000Z

While this is great information, I wish more could be said about Centaurs & Cyborgs, the title of the piece.

There's a brief example of a type of task that each approach excels at, but I'd definitely like more. The crux of this piece is that our human judgment of when/how to use AI tools is the most important factor in work quality when we're on the edge of the frontier. But there's no analysis of whether centaurs or cyborgs did better at those outside-the-frontier tasks; it's not clear why those categories are even mentioned since they appear to have no relevance to the preceding research results. And as the article mentions, the frontier is "invisible" (or at least difficult to see); learning to detect tasks that are in, on, or outside of the frontier seems like an immensely important skill. (I also realize it may become completely obsolete in <5 years as the frontier expands exponentially.)

I understand the goal of this research was not to find these "edges" and to determine how we can improve our judgment about when & how to use AI. But after reading these strong results, that's definitely the only thing I'm interested in. I use AI almost not at all in my work, web development. It has been most useful to me in getting me unstuck from a thorny under-documented problem. Over a year after ChatGPT released, I still don't know if modern LLMs can actually be a force multiplier for my work or if I'm correctly judging that they're not appropriate for the majority of my tasks. The latter seems increasingly unlikely as capabilities advance.

SlightlyLeftPad · 2024-01-16T21:11:17.000000Z

I think this a fascinating, yet unsurprising finding. We’ve known for a long time that automation leads to skillset atrophy on the part of the user. Perhaps most famously, in aviation where some automated systems make complicated procedures so easy that pilots effectively lose the skills to do the proper procedure themselves.

The discussions I hope are happening is the prospect of intentional capability limits of AI in critical industries/areas where humans must absolutely and always intervene. It seems analogous to having a superpower without knowing how to control it, creating a potentially deadly situation at best.

Then perhaps more long term at the generational scale, will humans cognitively devolve to a point where AI is making all of our decisions for us? Essentially we make ourselves obsolete and eventually we’re reclassified as “regular/inferior” species of animals. Profit and return on investments here cannot get in the way of handling these details with great oversight and responsibly.

ethbr1 · 2024-01-16T21:36:36.000000Z

That opines interestingly on only allowing full self driving cars on pre-approved, high-assurance routes.

It either needs to be 100% or (bad).

Anywhere in the grey zone, and you retard human performance while still occasionally needing it.

SlightlyLeftPad · 2024-01-18T12:05:41.000000Z

This is a particularly relevant use-case for sure. I think we will see a huge skillset disparity between those of us who grew up driving a regular car versus the incoming generations who will grow up with cars that drive themselves. Whether that ends up being a big deal or not, it’s hard to say.

supriyo-biswas · 2024-01-17T11:28:52.000000Z

For some more reading material on this subject, see Ironies of Automation[1].

[1] https://ckrybus.com/static/papers/Bainbridge_1983_Automatica...

esafak · 2024-01-16T18:08:05.000000Z

If you don't need a skill, you won't practice it, and it will atrophy. This frees you to learn different things. Like how most software engineers today don't know computer engineering (hardware) but they can write more complex applications.

hn_throwaway_99 · 2024-01-16T19:52:06.000000Z

But that is the exact problem with AI - 99% of the time it's good, so your skills atrophy, but then there's that one percent of the time it decides the reflective side of that 18-wheeler is the sun and drives you into it. That's very different from, say, a compiler, which as a programmer I completely trust to turn my source code into machine code accurately.

BeetleB · 2024-01-16T20:58:21.000000Z

You're just mirroring the parent example. Prior to AI (or even computers), this was a problem.

I used to be a more conventional engineering guy. We do a ton of calculus to get our degree. My first job - you just didn't need calculus (although all the associated courses in school used it to develop the theory). As a result, the one offs (less than once a year), when a problem did come up where calculus knowledge was helpful, they couldn't do the task and would go to that one guy (me) who still remembered calculus.

BTW, I guarantee all the employees there got A's on their calculus courses.

hn_throwaway_99 · 2024-01-16T21:07:54.000000Z

> You're just mirroring the parent example. Prior to AI (or even computers), this was a problem.

Not at all, and your argument isn't the same thing.

The problem with current AI, which is very different from past forms of automation, is that, essentially, its failure mode is undefined. Past forms of automation, for example, did not "hallucinate" seemingly correct but false answers. Even taking your own example, it was obvious to your colleagues that they didn't know how to get the answer. So they did the correct thing - they went to someone who knew how to do it. What they did not do, and which is what most current AI solutions will do, is just "wing it" with a solution that is wrong but looks correct to those without more expertise.

williamcotton · 2024-01-17T00:57:16.000000Z

“Since when is failure mode being undefined a new thing?”

- C programmer

hn_throwaway_99 · 2024-01-17T05:59:13.000000Z

This is the same bad argument made in a sibling comment.

Undefined behavior in C is actually strictly defined: if you want to avoid undefined behavior, don't do stuff the spec specifically says results in undefined behavior, like deference a null pointer.

Writing up a set of rules to guard against undefined behavior (which is a primary job of a C developer) is simply not possible with an LLM.

tikhonj · 2024-01-16T20:48:18.000000Z

> completely trust to turn my source code into machine code accurately

Say hello to undefined behavior :)

Appropriately enough, much of the reason undefined behavior is a consistent problem is exactly the same—it can do something reasonable 99% of the time and then totally screw you over out of nowhere.

Jensson · 2024-01-16T20:51:21.000000Z

Undefined behavior is a defined thing, you know about it so you can avoid it if you want, or look up what the compiler you use does in that case to see if it is useful to you. No such thing for a black box AI model.

warner25 · 2024-01-16T20:25:02.000000Z

This a nice, tight statement of the big picture.

This isn't new, like since LLMs were born, either. In the aviation community, and probably many other domains, there has been a running debate for decades about the pros and cons of introducing more automation into the cockpit and what still should or shouldn't be taught and practiced in training (to be able to effectively recognize and handle the edge cases where things fail).

And in military aviation, this isn't just a question of safety; it's about productivity just like it is for knowledge workers. If an aircraft can fly itself most of the time, the crew can do more tactical decision-making and use more sensors and weapons. Instead of just flying their own aircraft, the crew can also control a number of unmanned aircraft with which they're teamed up.

vlovich123 · 2024-01-16T18:43:48.000000Z

Depends on what you mean by computer engineering but they should definitely have a good mental model of how computers work. Otherwise your complex applications will have trouble scaling. That can matter less in many cases but early architecture mistakes can easily put you down an evolutionary dead end forcing you to experience exponentially rising maintenance costs to add new features / fix bugs or rewrite the thing by someone who understands computer architecture better (+ take advantage of a better understanding the problem domain)

WJW · 2024-01-16T20:00:35.000000Z

I see what you mean but there's a huge gap between a "good enough" mental model of how computers work and the actual in-depth details that GP means and that have been successfully abstracted away from us. For example: the amount of cases when I have needed to know about the details of how the transistors in RAM chips are laid out have been extremely minimal. Nor have I ever had to worry about the encoding scheme used by the transceivers for the optical fibers that my TCP packets move across.

Obviously there jobs out there that do have to worry about these things (mostly at companies like TSMC and Cisco), but in practice I doubt regular engineers at even the largest scale software companies have to worry about stuff like that.

SkyBelow · 2024-01-16T18:16:41.000000Z

That is if the AI fully replaces the need for a skill. What happens if it mostly replaces it, enough so the skill atrophies, but not enough that the user never needs the skill?

esafak · 2024-01-16T18:29:35.000000Z

Good point. That sounds like a temporary product problem, like a level 2-3 self-driving car, which still needs you to pay attention.

I think you also need to consider whether you can afford not to have the skill; what would happen if the AI were taken away, or malfunctioned? Airplane pilots are an example of this. In that case you simply have to learn the skill as if the AI will not be there.

hn_throwaway_99 · 2024-01-16T19:53:24.000000Z

> That sounds like a temporary product problem

The jury is still out on how "temporary" of a problem that will be. Improvements have been made but nobody really knows how to get an LLM, for example, to just say "I don't know".

Retric · 2024-01-16T19:57:33.000000Z

And this is before people start poisoning the well. Google search worked better in a world without Google search. I suspect many AI systems will run into similar issues.

hn_throwaway_99 · 2024-01-16T21:11:21.000000Z

> Google search worked better in a world without Google search.

Oooh, I really like that. Very succinct and topical statement of Goodhart's Law.

taway_6PplYu5 · 2024-01-16T20:04:29.000000Z

Oh, you mean like self-driving cars?

dang · 2024-01-16T20:55:29.000000Z

We've changed the URL from that to https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321. Thanks!

(In cases like this, it's nice to have both the readable summary and the link to the paper.)

mediaman · 2024-01-16T17:28:11.000000Z

This study had management consultants use GPT4 on a variety of different management consultant tasks.

Some were more creative in nature, and some were more analytical. For example, ideas on marketing sneakers versus an analysis of store performance across a retailer's portfolio.

In general they found that GPT4 was helpful for creative tasks, but didn't help much (and in fact reduced quality) for analytical questions.

I think these kinds of studies are of limited use. I don't believe raw GPT4 is that helpful in the enterprise. Whether it is useful or not comes down to whether engineers can harness it within a pipeline of content and tools.

For example, when engineers create a system to summarize issues a customer has had from the CRM, that can help a customer service person be more informed. Structured semantic search on a knowledge base can help them also find the right solution to common customer problems.

McKinsey made a retrieval augmented generation system that searched all their research reports and created summaries of content they could use, so a consultant could quickly find prior work that would be relevant on a client project. If they built that correctly, I imagine that is pretty useful.

GPT4 alone will especially not be that useful for analytical work. However, developers can make it a semi-capable analyst for some use cases by connecting it to data lakes, describing schemas, and giving it tools to do analysis. Usually this is not a generalist solution, and needs to be built for each company's application.

Many of the studies so far only look at vanilla GPT4 via ChatGPT, and it seems unlikely that, if LLMs do transform the workplace, that a standalone ChatGPT is what it will look like.

Fin_Code · 2024-01-16T17:51:44.000000Z

This is entirely unsurprising. A language generator is not designed for analytical analysis. The fact it codes well is what I think is throwing people. Since code is a form of language it works well for the language model. The issue with code will come from larger context where analytical thought is needed and it will break down.

svachalek · 2024-01-16T21:05:44.000000Z

People are trying really hard to make "search" or something like it the killer app for LLMs but from what I've seen, it's translation. Coding in its purest form is a kind of translation, from natural language requirements to more precise computer languages and vice versa. Keeping an LLM properly focused on that task, it's pretty astonishing what it can do. But as you say, assuming it can do anything more than that is a trap.

chewxy · 2024-01-16T22:40:33.000000Z

> Coding in its purest form is a kind of translation, from natural language requirements to more precise computer languages

Unfortunately not so true. A simple way to check this is to look at what the brain is doing. Coding barely triggers the language components of the brain (your usual suspects like the Vernicke's and Broca's areas + other miscelleneous things). In fact what you'll find is that coding triggers the IP1, IP2 and AVI pathways. These are part of the multiple demand network. We call them programming "languages" because we can represent them in a token-based game we call "language" but the act of programming is far from mere language manipulation.

nomel · 2024-01-17T01:08:50.000000Z

> from natural language requirements to more precise computer languages

This makes sense. Translation to precise code would include more of the logic/planning bits of the brain, since that's the context of the output. I suspect a person fluent in Lojban[1] would show similar areas activating during translation from English to Lojban. I also assume translating code to words, with documentation, would include more language, since that's the context of the result.

That doesn't mean it's not translation. It's just translation to a precise description of intent.

1. https://en.wikipedia.org/wiki/Lojban

ebiester · 2024-01-16T17:41:47.000000Z

The first generation of company-focused llm will need to be bespoke. Teams involved with this will learn the necessary abstractions and will build new companies to lower the on-ramp or join the LLM providers to do the same.

roveo · 2024-01-16T19:01:09.000000Z

> GPT4 alone will especially not be that useful for analytical work. However, developers can make it a semi-capable analyst for some use cases by connecting it to data lakes, describing schemas, and giving it tools to do analysis. Usually this is not a generalist solution, and needs to be built for each company's application.

I've been thinking about this a lot lately. I'm a data analyst and all these "give GPT you data warehouse schema, let it generate SQL to answer user's queries" products completely miss the point. An analyst has value as a curator of organizational knowledge, not translator from "business" to SQL. Things like knowing that when a product manager asks us for revenue/gmv, we exclude canceled orders, but include purchases made with bonus currency or promo codes.

Things like this are not documented and are decided during meetings and in Slack chats. So my idea is that in order to make LLMs truly useful, we'll create "hybrid" programming languages that are half-written by humans, and the part that's written by LLMs when translating from "human" language is simple enough for LLMs to do it reliably. I even made some weekend-prototypes with pretty interesting results.

jampekka · 2024-01-16T19:13:43.000000Z

> I think these kinds of studies are of limited use. I don't believe raw GPT4 is that helpful in the enterprise.

I'm not so sure about this. A lot of work in and for especially large organizations is writing text that often doesn't even get read, or is at most just glanced at. LLMs are great at writing useless text.

jprete · 2024-01-16T19:52:15.000000Z

That kind of text is valuable as a signaling exercise - the author is willing to claim the text as indicating their actual beliefs. Writing it with AI would actually damage the author's credibility since it's no longer their words - although this shouldn't be a problem if the author is carefully checking the text every single time to make sure it represents their position.

The effect would be even worse the first time the author missed a mistake from the AI and had to walk it back publicly. Nobody's going to trust anything they put in writing for a while after that.

jampekka · 2024-01-16T20:00:37.000000Z

There are genres of text that have little to do with the writer's beliefs and are a kind of formality with contents known by all interested parties beforehand. Sort of elaborate prose versions of "I Agree With the Terms and Conditions" textboxes.

bitzun · 2024-01-16T19:48:45.000000Z

At first I thought this was a flippant joke, but I can actually see the value in saving time writing junk without having to alter existing bureaucratic processes from which nobody gets much value.

jampekka · 2024-01-16T19:55:04.000000Z

I am using ChatGPT to write the genetic boilerplate that e.g. many funding applications require. Ethics statements or inclusivity policies etc where the contents are largely predefined and known by everybody but still have to be written in prose instead of e.g. clicking boxes.

jart · 2024-01-16T20:34:29.000000Z

Isn't most of science like that too?

jampekka · 2024-01-16T22:23:36.000000Z

There is a lot of boilerplate and fluff in scientific papers that everybody knows is fluff.

rich_sasha · 2024-01-16T18:54:45.000000Z

I am yet to to come across a programming task where ChatGPT saved me time.

For a start, most of my work is about using internal APIs. But nevermind. Sometimes I come across a generic programming problem. ChatGPT get me to 80% of a solution.

Getting it from there to a 100% solution takes as long as doing it from scratch.

Just my $0.02

BeetleB · 2024-01-16T21:02:09.000000Z

Where it saves me time is when using a framework I'm not used to - it more often than not solves my problem (but not by a large stretch - perhaps 60% of the time).

Even more common is getting shell/UNIX commands to do what I want. I don't have all the standard UNIX tools (cut, etc) in my head, and I definitely don't keep all the command line options in my head. GPT4 tells me what I need, and it's much quicker to confirm it via the --help or man pages than craft it myself by reading those pages.

datavirtue · 2024-01-16T23:41:25.000000Z

I have. Working in legacy code of a language I'm not familiar with (pascal) writing installer scripts with inno setup (pascal, again). "Write me a function to find the installer subkeys for an application by DisplayName and Product name for both 64bit and 32bit applications in pascal." Done.

Not something I relished spending days on. Instead I spent thirty seconds and our customers are getting a major update shipped a hell of a lot faster and I can get back to bigger things.

I had it write me a handful of sticky functions. The thought of using a search engine for all of that brought tears to my eyes.

Just searching simple things anymore is gut wrenching.

rich_sasha · 2024-01-17T11:27:23.000000Z

Yeah, so figuring out how to use a framework seems to crop up quite often. I don't use "big name" frameworks at work so this would be lost on me.

pphysch · 2024-01-16T19:33:52.000000Z

ChatGPT/Bard is good for answering documentation or common syntax questions. For actual programming, you should use a proper AI "copilot" that actually integrates into the context of your IDE, and therefore understands the syntax of your internal APIs to some extent. You will save time copy-pasting and get better results.

snapcaster · 2024-01-16T19:02:44.000000Z

Have you considered that it may be a personal skill issue? I'm not saying it is, but it seems like people with this opinion never seem to consider maybe they aren't using it properly or don't get it

elzbardico · 2024-01-16T19:15:22.000000Z

The reverse could also be true.

Some people could be not able to figure out that the code GPT produced is not good because they lack the skill level to review it efficiently.

rich_sasha · 2024-01-16T19:46:32.000000Z

Well, maybe. But it seems to me this somewhat defeats the purpose. If I tell ChatGPT to write me, say, a sorting algorithm (which I'm sure it can do), but if I ask it _wrongly_ it gives me a deceptively looking lemon of a solution, that's a liability.

Conversely, could you share an example of a nontrivial, practical programming problem ChatGPT can solve, but only if you use it right?

mewpmewp2 · 2024-01-16T19:51:13.000000Z

For coding it's mostly Copilot that saves time. Although ChatGPT in some cases as well, but not as reliably or frequently.

throwaway4aday · 2024-01-16T20:47:23.000000Z

Agree, Copilot is a huge speed boost for stubbing things out, auto-generating the dead simple parts, and refactoring. ChatGPT is awesome for rubberducking, pair programming, brainstorming, and figuring out what that thing that you kind of remember is called. We aren't at the point where we can just say "robot, do my work" but it's for sure good enough to take care of the boring stuff and be a sounding board.

datavirtue · 2024-01-16T23:46:37.000000Z

A lot of this attitude is seething below the surface these days. It's like linkedin. You may not like the thought of it but you better get over it unless you want a lot of opportunity to whiff by you unnoticed. Senior devs have an attitude against using chat gpt for some reason. Luddites. Use it or lose it.

yoyohello13 · 2024-01-17T00:03:21.000000Z

Maybe the senior devs just don't find it as useful as junior devs do. Or maybe the senior devs are better at picking apart the code being generated and finding the flaws.

I'm a senior dev at a python shop and I almost never find the solutions GPT4 gives in Python to be very good. However, I've been learning some C# lately in my free time and I find those solutions to be pretty good. I have a feeling I find the C# solutions higher quality, because I'm not experienced enough to see the problems.

neonsunset · 2024-01-17T00:18:44.000000Z

Haha I have the exact same experience as you except swap the languages...that one feeling you get when you are reading a newspaper article you are expert on and suddenly realize how inaccurate everything is.

rich_sasha · 2024-01-17T11:29:05.000000Z

Oh I'm keen for an unpaid robot to do a lot of my job for me. I just haven't been able to outsource that. Whenever I try, it's more work to fix its errors than to do it from scratch.

I would really like to know what I'm doing wrong (or if everyone else is in the same boat).

datavirtue · 2024-01-21T16:14:20.000000Z

You have be very specific in your prompts. "Senior engineers" are most likely treating it like a Google search from 2005.

datadrivenangel · 2024-01-16T16:49:08.000000Z

"Consultants across the skills distribution benefited significantly from having AI augmentation, with those below the average performance threshold increasing by 43% and those above increasing by 17% compared to their own scores. For a task selected to be outside the frontier, however, consultants using AI were 19 percentage points less likely to produce correct solutions compared to those without AI."

So AI makes consultants faster but worse. As a consultant, this should make my expertise even more valuable.

NemoNobody · 2024-01-16T17:04:33.000000Z

Seriously, you cherry picked the paragraph you use to cite your opinion, which contrary to the finding of the article itself and completely ignore the the first part of the second sentence of your paragraph "for a task selected outside the frontier" - so, something beyond the capabilities of AI at this time ONLY decreased their correct solutions 19 percentage points.

43% better it made below average people, 17% better it made above average people.

Same paragraph.

datadrivenangel · 2024-01-16T18:38:50.000000Z

You know, I missed that part. Oops.

Looking into the full paper, it looks like the task identified inside the frontier is much more broken down, with 18 numbered steps. The outside the frontier task is two bullet points.

So for clearly scoped tasks, ChatGPT is hands down good. For less well scoped work it reduces quality.

Interestingly, the authors also looked at the amount of text retained from ChatGPT, and found that people who received ChatGPT training ended up retaining a larger portion of the ChatGPT output in their final answers and performed better in the quality assessment.

steve1977 · 2024-01-16T17:12:07.000000Z

BCG consultants. Not average people.

quickcheque · 2024-01-16T17:09:10.000000Z

' As a consultant, this should make my expertise even more valuable. '

Consultant logic at its finest.

_teyd · 2024-01-16T17:20:39.000000Z

More garbage outputs being produced faster, paired with expertise which makes it credibly sound like more fast garbage will solve any remaining issues, probably _is_ approaching peak consultant value.

datadrivenangel · 2024-01-16T18:39:50.000000Z

Only if that's what the clients want or need to justify actual improvements!

Usually the goal is to produce less garbage outputs slower.

_teyd · 2024-01-16T18:54:00.000000Z

The goal isn’t to generate billable hours?

datadrivenangel · 2024-01-17T15:01:30.000000Z

We like to find win-win situations.

tempest_ · 2024-01-16T16:52:25.000000Z

Only if being wrong actually matters.

datadrivenangel · 2024-01-16T16:56:23.000000Z

The research was partially conducted by Boston Consulting Group, so in this case being wrong doesn't really matter as long as the Partners are smooth and the associates create billable hours.

CharlesW · 2024-01-16T16:59:34.000000Z

> So AI makes consultants faster but worse.

That was just for tasks not suited to GPT-4 (and I assume, LLMs in general).

> "For a task selected to be outside the frontier, however, consultants using AI were 19 percentage points less likely to produce correct solutions compared to those without AI."

For tasks suited to GPT-4, consultants produced 40% higher quality results by leveraging GPT-4.

My take on that is that people who have at least some understanding of how LLMs work will be at a significant advantage, while people who think LLMs "think" will misuse LLMs and get predictably worse results.

ParetoOptimal · 2024-01-16T17:22:39.000000Z

> while people who think LLMs "think" will misuse LLMs and get predictably worse results.

I've found that now that I see LLM's less as thinking and more as stochiastic parrot I get a lot less use from them because I give 3-5 turns at most versus 10 turns.

NemoNobody · 2024-01-16T17:57:40.000000Z

Well, maybe a balance between a stoic parrot and a "thinking machine" might be the best approach.

You know, everything in moderation and what not.

ParetoOptimal · 2024-01-16T18:09:17.000000Z

Yes, ideally.

The problem for me is it used to be an addition to my energy and confidence, but when you "take the magic out of it" it doesn't help out there so much.

In antirez's blog he mentioned how it made things "not worth it, worth it" and I agreed, but that happens much less often if you think an LLM helping you is a fluke.

siliconc0w · 2024-01-16T20:00:30.000000Z

There is a lot of noise from VC-types about how AI copilots turn random outsourced devs into 10x engineers. My experience is that it's pretty good at writing boilerplate, that is - it saves me typing what I would expended little thought in typing. I suspect this is where most of the productivity statistics are coming from. Co-pilot especially seems prone to hallucinating APIs or suggesting entire functions that are wrong, repetitive, or even dangerous. This is the case even for side projects where I'm working with public python libraries which should be like the best case scenario. It's generally better than IDE autocomplete when I already am familiar with the API and so can catch the errors.

I will say GPT-4 is mostly better than referencing stack-overflow or library documentation. Maybe once that gets fast/cheap enough to use as co-pilot we'll see some of these mythical productivity gains.

smackeyacky · 2024-01-16T20:24:24.000000Z

I don't think so. I think it will be more like a flare that burns out.

The reason being that the source of material for GPT-4 is rotting underneath. As the internet gets ruined by AI garbage, the quality of output of AI will get worse, not better.

edit: This is a modern day tower of Babel event. We had a magnificent resource and then polluted it with garbage and now it's quickly becoming useless.

throwaway4aday · 2024-01-16T20:40:45.000000Z

This is wrong for at least two reasons:

1) The data scraped from the web is filtered, deduped, ranked, and cleaned in a variety of other ways before being used for training. While quantity is necessary for a model to learn the structure of language, quality is even more important once you have a model that can produce coherent output so a lot of work goes into grooming the data.

2) There have been a bunch of papers and training runs that show synthetic data created specifically for training a model is as good or better than scraped human produced data. The importance of scraped web content is quickly declining because you can now generate infinite higher quality examples using existing trained models. The only relevance it has now is for knowledge about new developments and that is a much easier stream to filter since most of the important stuff comes from official sources and you don't need as many variations since you can just generate your own using one or more examples.

smackeyacky · 2024-01-16T21:22:12.000000Z

How much time is going to be wasted cleaning the data? It seems like AI can generate junk much faster than we could possibly filter it.

throwaway4aday · 2024-01-18T13:26:36.000000Z

There could be an infinite amount of junk, in fact the amount of junk available online before LLM output was added to the mix was already incredibly huge. At this point it no longer matters because you can take 1 or 2 examples of some knowledge you want to train a new model on and you can then feed it to an LLM with instructions saying to create 10,000 variations on this text while still retaining the factual details, you then filter this list and throw away any output with errors or other garbage and what you're left with are a good number of high quality examples of the information you want to add to your training set. Incredibly, there is new research that shows even this might not be needed at some point because there is evidence that transformers can learn from a single example.

spunker540 · 2024-01-16T21:41:44.000000Z

Fortunately there are now AIs that can help with the data cleaning task.

smackeyacky · 2024-01-17T00:34:57.000000Z

Seems like this is part of the feedback loop that will render AI useless though.

throwaway4aday · 2024-01-18T13:27:42.000000Z

No, it's an effective technique that has been used by Microsoft Research among many others.

majewsky · 2024-01-16T22:04:38.000000Z

But who monitors the monitor?

throwaway4aday · 2024-01-18T13:28:24.000000Z

At a certain point you just need to take a statistically representative sample of the output for review.

s1artibartfast · 2024-01-17T00:43:45.000000Z

you dont need to clean all the junk created. you just need to clean enough to feed the models.

ethanwillis · 2024-01-16T22:55:25.000000Z

Infinite?

throwaway4aday · 2024-01-18T13:30:00.000000Z

A bit hyperbolic, I'm sure the different variations are finite but you can produce a very large number of examples with essentially the same information content.

TillE · 2024-01-16T23:13:53.000000Z

A recurring type of problem I've seen goes like: there's an API function f(x) where x can be any of half a dozen different object types. You want to do something similar to f(x) on a different object, which actually requires a completely different multi-step approach, but the LLM will just try to use f(x).

It's really good at spitting out code in common, familiar patterns, but the failure rate is very high for anything a little unusual.

ChatGTP · 2024-01-17T02:38:23.000000Z

Overtime the products we build might become more and more generic so that they can be easily built by AI?

freedryk · 2024-01-16T19:18:37.000000Z

The biggest question I see with this study is it doesn't seem like subjects had access to tools outside of the AI. Were the subjects without AI able to do google searches? If not, then what is the performance gain of the AI users over people who can google stuff?

Ecoste · 2024-01-16T17:06:36.000000Z

I probably should know but I don't even have the vaguest of ideas. What do consultants actually do?

czbond · 2024-01-16T17:34:34.000000Z

You're going to get across the board replies here. To each constituent, they do different things. There are consultants across the board in speciality, quality, etc.

On HN, they get a bad rap from the "i'm smart, I know everything, worker bee crowd" who think business is just slinging JS code and product updates. Consultants are often brought into those people orgs to deal with / or around, valuable know-it-alls, fiefdoms, poor vision / strategy, or politics that are holding the company back. They're often cited as "just doing reports", when in fact consultants sometimes stay out of implementation to make sure the owning team owns & implements the solution.

Consultants often provide the hiring manager a shield on dicey projects or risky outcomes. Where if it goes wrong, the manager can say "it was those consultants".

Teams are brought in as trained, skilled, up to date staff the company cannot hire fire, or shouldn't due to duration of need, or skillset

Sometimes they're brought in because the politics of the internal company lead to stalemates, poor strategy, etc.

Often at the higher levels, they're brought in due to focusing on a speciality, market, or vertical to have large experience that isn't possible to get in house.

One's experience with consultants frames their opinion. I've only worked with very high quality teams in the past that provide a healthy balance of vision, strategy, implementation, etc

fhd2 · 2024-01-16T17:17:00.000000Z

I'm a consultant - for about a year now. Not at BCG or anywhere, I'm a one man company.

What I do is pretty similar to what I did before in my CTO roles, just a bit more strategy and less execution. Clients come to me with questions and problems and I help them solve them based on my experience. If you've got multiple C levels, middle management and a board, you're in a similarly consulting position also as a full time CTO from what I've seen.

I don't go anywhere _near_ LLMs for this, I'd find it somewhere between disrespectful and unethical. They're paying for _my_ expertise. I can imagine it could make a decent alternative to Google and Wikipedia for research purposes, but I'd have to double check it all anyway. I don't see how it'd make my work any easier.

danielbln · 2024-01-16T17:21:40.000000Z

> I don't go anywhere _near_ search engines for this, I'd find it somewhere between disrespectful and unethical. They're paying for _my_ expertise.

I bet someone said this 30 years ago. As it stands today, LLMs can't wholesale do the job for you, they merely augment it. And they are finicky and fallible, so expertise is needed to actually produce useful results and validate them. There are many reasons why an LLM isn't the right tool for a certain task, wanting to drive on some sort of moral ethical high road however is imo _not_ a great reason.

Continue on that road and people who have no qualms about using whatever tool at their disposal will eat your lunch. Maybe not today or tomorrow, but eventually.

fhd2 · 2024-01-16T17:44:41.000000Z

The value of advice is largely estimated based on who gives it. If a client is hiring me, it seems clear to me that they're paying for mine.

For support tasks (like research or even supporting a writeup) I'm not ethically opposed to using LLMs. It's just that I try every couple of months and conclude it's not a net improvement. Your mileage may vary of course.

penjelly · 2024-01-17T01:07:37.000000Z

augmenting your advice with chatgpt shouldnt be difficult. You could simply bounce your ideas off chatgpt before meeting a client, that way your work has already been crosschecked, its not like it would be providing advice (your product) for you.

paulsutter · 2024-01-16T17:10:00.000000Z

Bill. Sometimes hourly, sometimes by project.

They wake up every morning hell bent on maintaining existing customer accounts, growing existing accounts, and sometimes adding a new account. They maintain a professional demeanor, with just the right kind of smalltalk. Finally, they produce a report that justifies whatever was the decision that the CEO already made before he engaged the consultants

Thats pretty much the whole job

sockaddr · 2024-01-16T17:28:03.000000Z

Yeah I’ve struggled to understand what people mean by consultant and once heard that it was basically a person or group you can hire to take the blame for an unpopular decision and/or to weaponize it against others in the company since it was “someone else” that made the call.

warner25 · 2024-01-16T17:54:40.000000Z

I've mostly dealt with RAND consultants in the military / government. They were brought in to help answer specific questions about what we should do and how, presumably by writing up and delivering a report. These weren't questions that the staff couldn't handle itself, but nobody on staff had the time to focus on answering these questions (by digging deep and doing research, not just expressing a gut opinion) given their other duties and responsibilities. So the RAND people basically augmented the staff.

I guess they have the advantage of experience doing these things over and over again for similar organizations. It's also an outside perspective, which has both advantages and disadvantages. In the conversations that I was invited into, we spent most of our time just explaining things that any mid-level member of the organization would know, trying to get the RAND people up-to-speed, so that seemed wasteful. But the military is a huge bureaucracy dominated by people who've climbed up the ranks for 30-40 years, so there isn't a lot of fresh outside thinking, and it seems like it could be valuable to inject some.

tokai · 2024-01-16T17:12:23.000000Z

Provides you with a neat document that states what you wanted to do is the right thing to do.

paxys · 2024-01-16T18:03:09.000000Z

In simplest terms, they solve problems for companies. If your company wants to achieve something and doesn't want to spend the time/effort/money to set up a new department and staff it with full-time experts, you can call a consulting firm and they will figure it out for you.

sgt101 · 2024-01-16T19:16:10.000000Z

Don't worry, it's quite common for people who are busy and important like you not to know this. In our experience that's because the people that work for them are simply not using modern working practices.

It is a problem though, because frankly, although you might think that moving to a more modern way of working is a simple fix, we have found that your employees are actually your enemies and are plotting to kill you and your family. You might think that you depend on them to get work done, but we can fix that for you and get our much more skilled and low paid workers to do the job instead.

You will find out that if you allow us to take the required steps to prevent your family from rape and murder your shareholders will get a bump as well! And of course you will as well, you can expect your bonus to at least triple, and you are worth it for sure.

We have already set up the culling stations filled with whirling blades to deal with the scum downstairs, if you say nothing at all we will act and slaughter them instantly. There is no guilt, only reward.

We love you.

jLaForest · 2024-01-16T17:10:18.000000Z

Tell management who to fire and who to outsource

cleandreams · 2024-01-16T23:43:23.000000Z

I've been an engineer at FAANG. I think the problems consultants face are simpler than the ones confronting engineers who need to become expert at million line code bases. There are coding standards, styles, security practices. I think AI would help when boiler plate is needed but otherwise it would be a dicier proposition. But there is another issue that I think is long term. The apprenticeship of a top engineer is difficult. If the result of "AI shortcuts" is a less skilled engineer profession there will be considerable downsides in the future. I have read that the ability of people to navigate is negatively impacted by a reliance on google maps.

dragoncrab · 2024-01-16T23:07:58.000000Z

I don't want to be rude, but given the average quality of material produced by big consulting companies I have encountered over my career, I'm not that surprised that GPT4 beats them.

araes · 2024-01-16T17:59:53.000000Z

The actual paper can be found at: https://www.iab.cl/wp-content/uploads/2023/11/SSRN-id4573321...

Note: After reading the paper, there are serious methodology issues. ChatGPT without any human involvement (simply reading the task as instructions and producing an answer) produced a better "quality" result than any human involvement in many tasks.

My read: ChatGPT produces what we pre-view as "correct".

They even kind of state this: "when human subjects use ChatGPT there is a reduction in the variation in the eventual ideas they produce. This result is perhaps surprising one would assume that ChatGPT, with its expansive knowledge base, would instead be able to produce many very distinct ideas, compared to human subjects alone."

My read: The humans converge towards what is already viewed as "correct". Its like being in a business and your boss already knows exactly what your boss actually wants, and any variation you produce is automatically "bad" anyways.

There is also a drastic difference in the treatment of the in/out/on "frontier" subject. Lots of talk about how great adding AI is to "inside frontier" tasks, no similar graphs/charts for "outside frontier" tasks. Finally, if these people are consultants and business profs, they produce some crazily bad charts. Figure 3 in the appendix is so difficult to read.

lgleason · 2024-01-16T22:08:21.000000Z

I was underwhelmed when using it for code generation of things like tests.

totoglazer · 2024-01-16T17:39:08.000000Z

Note this paper was published in September 2023.

clbrmbr · 2024-01-16T17:56:08.000000Z

Can someone explain Centaurs vs Cyborgs? (I didn’t make an account to read the full paper…)

ThrowawayTestr · 2024-01-16T17:01:57.000000Z

I just started a new job and used chatgpt to create some cool Excel automation with Python.

RecycledEle · 2024-01-16T21:12:16.000000Z

From the article: "consultants using AI were significantly more productive (they completed 12.2% more tasks on average, and completed task 25.1% more quickly), and produced significantly higher quality results (more than 40% higher quality compared to a control group)."