More

_m9r2 · 2024-11-25T18:50:19 1732560619

Who cares. Everyone should collectively turn their websites off in the EU, so that they can continue to suffer in mediocrity. The EU doesn’t have to deal with their own laws because they don’t innovate or produce anything.

mnau · 2024-11-25T19:43:06 1732563786

I get the "Our content is not available in EU" more and more often. 16%-14% of world's GDP and sinking fast.

_m9r2 · 2024-10-31T00:21:19 1730334079

Can we move on to the next grift yet?

_m9r2 · 2024-10-31T00:04:34 1730333074

If you give an LLM a word problem that involves the same math and change the names of the people in the word problem the LLM will likely generate different mathematical results. Without any knowledge of how any of this works, that seems pretty damning of the fact that LLMs do not reason. They are predictive text models. That’s it.

alexwebb2 · 2024-10-31T00:10:10 1730333410

Demonstrably false.

https://chatgpt.com/share/6722ca8a-6c80-800d-89b9-be40874c5b...

https://chatgpt.com/share/6722ca97-4974-800d-99c2-bb58c60ea6...

TZubiri · 2024-10-31T00:53:45 1730336025

It's worth noting that this may not be result of a pure LLM, it's possible that ChatGPT is using "actions", explicitly:

1- running the query through a classifier to figure out if the question involves numbers or math 2- Extract the function and the operands 3- Do the math operation with standard non-LLM mechanisms 4- feed back the solution to the LLM 5- Concatenate the math answer with the LLM answer with string substitution.

So in a strict sense this is not very representative of the logical capabilities of an LLM.

digging · 2024-10-31T15:06:54 1730387214

Then what's the point of ever talking about LLM capabilities again? We've already hooked them up to other tools.

This confusion was introduced at the top of the thread. If the argument is "LLMs plus tooling can't do X," the argument is wrong. If the argument is "LLMs alone can't do X," the argument is worthless. In fact, if the argument is that binary at all, it's a bad argument and we should laugh it out of the room; the idea that a lay person uninvolved with LLM research or development could make such an assertion is absurd.

thomashop · 2024-10-31T04:04:01 1730347441

It shows you when it's calling functions. I also did the same test with Llama, which runs locally and cannot access function calls and it works.

TZubiri · 2024-10-31T06:27:42 1730356062

You are right I actually downloaded Llama to do more detailed tests. God bless Stallman.

astrange · 2024-10-31T08:25:57 1730363157

Minor edits to well known problems do easily fool current models though. Here's one 4o and o1-mini fail on, but o1-preview passes. (It's the mother/surgeon riddle so kinda gore-y.)

https://chatgpt.com/share/6723477e-6e38-8000-8b7e-73a3abb652...

https://chatgpt.com/share/6723478c-1e08-8000-adda-3a378029b4...

https://chatgpt.com/share/67234772-0ebc-8000-a54a-b597be3a1f...

_flux · 2024-10-31T08:55:17 1730364917

I think you didn't use the "share" function; I cannot open any of these links. Can you do it in a private browser session (so you're not logged in)?

astrange · 2024-10-31T09:03:03 1730365383

Oops, fixed the links.

mini's answer is correct, but then it forgets that fathers are male in the next sentence.

SequoiaHope · 2024-10-31T02:01:33 1730340093

At this point I really only take rigorous research papers in to account when considering this stuff. Apple published research just this month that the parent post is referring to. A systematic study is far more compelling than an anecdote.

https://machinelearning.apple.com/research/gsm-symbolic

og_kalu · 2024-10-31T02:11:24 1730340684

That study shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks(some even see increases). The one that isn't involves changing more than names.

Changing names does not affect the performance of Sota models.

gruez · 2024-10-31T02:18:55 1730341135

>That study very clearly shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks.

Which figure are you referring to? For instance figure 8a shows a -32.0% accuracy drop when an insignificant change was added to the question. It's unclear how that's "within the margin of error" or "Changing names does not affect the performance of Sota models".

og_kalu · 2024-10-31T02:29:31 1730341771

Table 1 in the Appendix. GSM-No-op is the one benchmark that sees significant drops for those 4 models as well (with preview dropping the least at -17%). No-op adds "seemingly relevant but ultimately inconsequential statements". So "change names, performance drops" is decidedly false for today's state of the art.

gruez · 2024-10-31T02:39:37 1730342377

Thanks. I wrongly focused on the headline result of the paper rather than the specific claim in the comment chain about "changing name, different results".

SequoiaHope · 2024-10-31T07:24:10 1730359450

Ah, that’s a good point thanks for the correction.

zmgsabst · 2024-10-31T14:47:56 1730386076

Only if there isn’t a systemic fault, eg bad prompting.

Their errors appear to disappear when you correctly set the context from conversational to adversarial testing — and Apple is actually testing the social context and not its ability to reason.

I’m just waiting for Apple to release their GSM-NoOp dataset to validate that; preliminary testing shows it’s the case, but we’d prefer to use the same dataset so it’s an apples-to-apples comparison. (They claim it will be released “soon”.)

gruez · 2024-10-31T02:12:32 1730340752

To be fair, the claim wasn't that it always produced the wrong answer, just that there exists circumstances where it does. A pair of examples where it was correct hardly justifies a "demonstrably false" response.

thomashop · 2024-10-31T07:30:28 1730359828

Conversely, a pair of examples where it was incorrect hardly justifies the opposite response.

If you want a more scientific answer there is this recent paper: https://machinelearning.apple.com/research/gsm-symbolic

EraYaN · 2024-10-31T10:28:38 1730370518

It kind of does though, because it means you can never trust the output to be correct. The error is a much bigger deal than it being correct in a specific case.

thomashop · 2024-10-31T14:10:35 1730383835

You can never trust the outputs of humans to be correct but we find ways of verifying and correcting mistakes. The same extra layer is needed for LLMs.

digging · 2024-10-31T15:09:18 1730387358

> It kind of does though, because it means you can never trust the output to be correct.

Maybe some HN commenters will finally learn the value of uncertainty then.

jklinger410 · 2024-10-31T00:28:28 1730334508

This is what kind of comments you make when your experience with LLMs is through memes.

Workaccount2 · 2024-10-31T00:21:45 1730334105

This is a relatively trivial task for current top models.

More challenging are unconventional story structures, like a mom named Matthew with a son named Mary and a daughter named William, who is Matthew's daughter?

But even these can still be done by the best models. And it is very unlikely there is much if any training data that's like this.

alexwebb2 · 2024-10-31T00:38:33 1730335113

That's a neat example problem, thanks for sharing!

For anyone curious: https://chatgpt.com/share/6722d130-8ce4-800d-bf7e-c1891dfdf7...

> Based on traditional naming conventions, it seems that the names might have been switched in this scenario. However, based purely on your setup:

>

> Matthew has a daughter named William and a son named Mary.

>

> So, Matthew's daughter is William.

rileymat2 · 2024-10-31T03:15:43 1730344543

How do people fair on unconventional structures? I am reminded of that old riddle involving a the mother being the doctor after a car crash.

adwn · 2024-10-31T07:31:04 1730359864

No idea why you've been downvoted, because that's a relevant and true comment. A more complex example would be the Monty Hall problem [1], for which even some very intelligent people will intuitively give the wrong answer, whereas symbolic reasoning (or Monte Carlo simulations) leads to the right conclusion.

[1] https://en.wikipedia.org/wiki/Monty_Hall_problem

vanviegen · 2024-10-31T07:34:15 1730360055

And yet, humans, our benchmark for AGI, suffer from similar problems, with our reasoning being heavily influenced by things that should have been unrelated.

https://en.m.wikipedia.org/wiki/Priming_(psychology)

_m9r2 · 2024-10-30T12:55:43 1730292943

I find LLM’s to be much worse than the junior engineers I work with. I have tried copilot and I always end up disabling it because it’s often wrong and annoying. Just read the docs for the things you are using folks. We don’t need to burn the rainforest for autocomplete.

snowfarthing · 2024-10-30T21:31:28 1730323888

I am currently convinced that the only thing useful about AI so far is that it's reviving nuclear power plants -- granted, they are to power AI engines -- but I hope that they'll be kept running after this AI fad passes!

_m9r2 · 2024-10-30T12:51:02 1730292662

Dog no they don’t lol. My family has been in both ecosystems over the years and by far the iPhones outlast the Android phones they buy both in quality and in updates.

sangnoir · 2024-10-30T17:37:17 1730309837

Counterpoint: one can only unlock the bootloader on Android phones and extend the device life for much longer than intended or supported by the OEM.

_m9r2 · 2024-10-31T00:08:18 1730333298

True, but what percentage of phone owners will actually do this. Less than a fraction of a percent? The remaining 99 percent chuck their phone in a landfill when it gets slow.

kelnos · 2024-10-30T18:02:35 1730311355

Both what you are saying and what the GP is saying can be true at the same time.

_m9r2 · 2024-10-28T12:10:30 1730117430

Yeah, it’s annoying, but the polyfill is like 3 lines.

_m9r2 · 2024-10-26T16:35:05 1729960505

I literally use BigInt on the backend every single day at work lol.

_m9r2 · 2024-10-26T16:31:29 1729960289

I’m really tired of this discourse. The JavaScript ecosystem is the lingua franca of the web. Furthermore, while a segment of the programming community has sat around complaining, JavaScript has gotten really good and continues to improve every passing year. Incremental progress is the key to making progress, not giant paradigm shifts.

righthand · 2024-10-26T17:04:45 1729962285

Well drink a cup of java because it’s not going away.

_m9r2 · 2024-10-26T16:28:07 1729960087

WASM is slower than running JavaScript on V8 in almost all scenarios and will likely continue to be for a very long time. Also, many of us don’t want a compile step.

peutetre · 2024-10-26T17:05:51 1729962351

No, wasm has higher performance and lower memory usage. Here are two practical, real world examples:

https://www.amazon.science/blog/how-prime-video-updates-its-...

https://web.dev/case-studies/google-sheets-wasmgc

zb3 · 2024-10-26T16:31:10 1729960270

While I don't want any compile step either (js should stay), I'm actually confused by your statement.. are there any benchmarks? Are you saying that for example v86 would run faster without wasm?

_m9r2 · 2024-10-26T16:47:11 1729961231

I think that would probably fall outside the norm. My information might be outdated, but I was under the impression that JavaScript usually wins in most algorithm benchmarks because the JIT is so good.

jas39 · 2024-10-26T19:48:50 1729972130

That is a misconception; there is a cost of abstraction, although this cost may disappear if AI gets really smart.

_m9r2 · 2024-10-26T16:22:07 1729959727

The US government needs to fast track breaking up Google asap. Chrome needs to be torn from their festering lich hands, so that the web can be free of their self serving, and frankly bad, proposals.

mdaniel · 2024-10-26T21:01:46 1729976506

Out of curiosity, which of the fragments of Google would you expect to take ownership over that codebase?

I had always imagined that if the DoJ took any action it would be to cleave the ad business away from Google. Although if they went so far as to take action against GCP I bet Amazon, Amazon Marketplace, and AWS would start to get sweaty palms