Hacker Newsnew | past | comments | ask | show | jobs | submit | underanalyzer's commentslogin

Neat article. I do wish it mentioned that there are polynomial time algorithms to solve linear programming problems. According to the Google ortools docs it has the option to use those as well (but not with the GLOP solver). Might be good for when simplex is struggling (https://developers.google.com/optimization/lp/lp_advanced)


You're right, but it's very subtle and complicated.

In theory, the simplex method is not known to be polynomial-time, and it is likely that indeed it is not. Some variants of the simplex method have been proven to take exponential time in some worst cases (Klee-Minty cubes). What solvers implement could be said to be one such variant ("steepest-edge pricing"), but because solvers have tons of heuristics and engineering, and also because they work in floating-point arithmetic... it's difficult to tell for sure.

In practice, the main alternative is interior-point (aka. barrier) methods which, contrary to the simplex method, are polynomial-time in theory. They are usually (but not always) faster, and their advantage tends to increase for larger instances. The problem is that they are converging numerical algorithms, and with floating-point arithmetic they never quite 100% converge. By contrast, the simplex method is a combinatorial algorithm, and the numerical errors it faces should not accumulate. As a result, good solvers perform "crossover" after interior-point methods, to get a numerically clean optimal solution. Crossover is a combinatorial algorithm, like the simplex method. Unlike the simplex method though, crossover is polynomial-time in theory (strongly so, even). However, here, theory and practice diverge a bit, and crossover implementations are essentially simplified simplex methods. As a result, in my opinion, calling iterior-point + crossover polynomial-time would be a stretch.

Still, for large problems, we can expect iterior-point + crossover to be faster than the simplex method, by a factor 2x to 10x.

There is also first-order methods, which are getting much attention lately. However, in my experience, you should only use that if you are willing to tolerate huge constraint violations in the solution, and wildly suboptimal solutions. Their main use case is when other solvers need too much RAM to solve your instance.


Very interesting! Thanks for the reply. I wonder if they tried these other solvers and decided they were either too slow b/c their problems were too small or the answers were too inaccurate


This might not be the right place for this question but, as someone who has made a couple very modest mps backend contributions, I'm curious why not add metal support to triton (or a fork if openai won't allow it) rather than maintain a whole separate backend?


Mostly comes down to what's fastest to develop, it's faster to write a few custom kernels than it is to develop a new compiler backend

Granted after more upfront effort compilers are just such a significant UX boost that indeed you are making me question why I don't spend more time working on this myself lol


It's also been a while since I had to flex my math muscles and they are quite flabby. I recommend reading things like this slow, they are written by people who know the field forward and backward and aren't supposed to be trivial. If you aren't in the field you got to be a little bit of an active reader.

For example, I also didn't remember why you needed a determinant in the equation you linked so I made a simple example and saw the determinant was there to keep the density normalized, (i.e the indefinite integral of the density = 1).

In this case, the simple example I used was a distribution q0(x) = 1 where x>=0 and x <= 1 (zero elsewhere) and scaling transformation y = ax


People who have not been exposed to math heavy content on a routine basis often have the unrealistic expectation of understanding everything on first read. That almost never happens unless the paper is very close to the reader’s own field of expertise. Denser papers, in fact, may take days of effort to fully internalize even for someone mathematically inclined and accustomed to reading papers.


As noted by the current top comment (https://news.ycombinator.com/item?id=36544024) this is a small part of a dump of all files found during the Bin Laden compound raid that is publicly available on the CIA's website. Check the url to verify that claim. There is no other reason this content is there, it's just a coincidence.


Great analysis, props to these students for taking the time to challenge such a sensational headline. In the conclusion they mention my biggest problem with the paper which is that it appears gpt4 grades the answers as well (see section 2.6 "Automatic Grading").

In a way it makes perfect sense that gpt4 can score 100% on a test gpt4 also grades. To be clear the grading gpt4 has the answers so it does have more information but it still might overlook important subtleties in how the real answer differs from the generated answer due to it's own failure to understand the material.


> In a way it makes perfect sense that gpt4 can score 100% on a test gpt4 also grades.

Even this is overstating it, because for each question, GPT-4 is considered to get it "correct" if, across the (18?) trials with various prompts, it ever produces one single answer that GPT-4 then, for whatever reason, accepts. That's not getting "100%" on a test.


In the paper, they at least claimed to manually verify the correct answers.


I just looked again and I didn't see that claim, can you verify? https://arxiv.org/pdf/2306.08997.pdf

If as per the linked critique, some of the questions in the test set were basically nonsense, then clearly they couldn't have manually verified all the answers or they would have noticed that.


>We then process the data by manually correcting each question and answer to ensure quality and correctness

Section 2.1

Then the github repo also has wording around this:

> We double-verify manually that the grading of the test set is correct. https://github.com/idrori/MITQ/blob/main/index.html#L552

I agree it looks like this may not have actually been done given some of the questions and answers in the dataset.


Then - having not read the paper - what is the point of the automated grading?


To not spend time manually grading obviously incorrect ones (i.e. only grading 1/18 of them).


Got it!


If people haven't seen it UT Prof Scott Aaronson had GPT4 take his Intro Quantum final exam and had his TA grade it. It made some mistakes, but did surprisingly well with a "B". He even had it argue for a better grade on a problem it did poorly on.

Of course this was back in April when you could still get the pure unadulterated GPT4 and they hadn't cut it down with baby laxative for the noobs.

https://scottaaronson.blog/?p=7209


See the comment from Ose "Comment #199 April 17th, 2023 at 6:53 am" at the bottom of that blog post...


It literally did not change. Not one bit. Please, if you're reading this, speak up when people say this. It's a fundamental misunderstanding, there's so much chatter around AI, not much info, and the SnR is getting worse


I’ve seen the recent statement by someone at OpenAI but whatever weasel words they use, it did change.

The modified cabbage-goat-lion problem [1] that GPT4 always failed to solve, it now gets it right. I’ve seen enough people run it in enough variations [2] before to know that it absolutely did change.

Maybe they didn’t “change” as in train anything, but it’s definitely been RHLFed and it’s impacting the results.

[1] https://news.ycombinator.com/item?id=35155467

[2] anecdata: dozens of people, hundreds of times total


I attribute this to two things:

1. People have become more accustomed to the limits of GPT-4, similar to the Google effect. At first they were astounded, now they're starting to see it's limits

2. Enabling Plugins (or even small tweaks to the ChatGPT context like adding today's date) pollute the prompt, giving more directed/deterministic responses

The API, as far as I can tell, is exactly the same as it was when I first had access (which has been confirmed by OpenAI folks on Twitter [0])

[0] https://twitter.com/jeffintime/status/1663759913678700544


In my experience with Bing Chat, in addition to what you say, there is also some A/B testing going on as well.


"It literally did not change. Not one bit."

How do you know?

Even if the base model didn't change, that doesn't mean they didn't fine tune it in some way over time. They also might be passing its answers through some other AI or using some other techniques to filter, censor, and/or modify the answers in some way before returning them to the user.

I don't know how anyone could confidently say what they're doing unless they work at OpenAI.


Someone who works at OpenAI said so two weeks ago


Then again, can we trust that person? It's not like they didn't have conflict of interest to make that claim.


Yes, it’s turtles all the way down


Nice try, ClosedAI. Then how do you explain this?

https://news.ycombinator.com/item?id=36348867


Well, I had hoped the sarcastic comparison to cut heroin would make it clear.

No, I don't think there's much change at all to GPT-4 (at the API level) and probably not that much at the pre/post language detection and sanitation for apparently psychotic responses.


You should take a look at this video. He is a researcher at Microsoft and had accès to private version of ChatGPT. He literally claims that ChatGPT 4 is not as good as before. His talk actually demonstrates the different evolutions.

https://youtu.be/qbIk7-JPB2c


If you are referring to that social media post by an OpenAI employee saying it hasn’t changed, they were specifically referring to the API. iirc, the same employee explicitly stated the Web UI version changes quite regularly. Someone correct me with the link if I’m wrong, I don’t have it handy.


This "GPT4 evaluating LLMs" problem is not limited to this case. I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day.

Couple this with the reliance on crowd-sourcing to create evaluation datasets and heavy use of GPT3.5 and GPT4 by MTurk workers, you have a big fat feed-forward process benefiting only one party: OpenAI.

The Internet we know is dead - this is a fact. I think OpenAI exactly knew how this would play out. Reddit, Twitter and the like are awakening just now - to find that they're basically powerless against this wave of distorted future standards.

When sufficiently proven to pass every existing test on Earth, every institution would be so reliant on producing work with GPT that we won't have a "%100 handmade exam" anymore. No problem will be left for GPT to be tackled with.


>> I don't know why exactly but everyone seems to have accepted the evaluation of other LLM outputs using GPT4. GPT-4 at this point is being regarded as "ground-truth" with each passing day.

Why? Because machine learning is not a scientific field. That means anyone can say and do whatever they like and there's no way to tell them that what they're doing is wrong. At this point, machine learning research is like the social sciences: a house of cards, unfalsifiable and unreproducible research built on top of other unfalsifiable and unreproducible research. People simply choose whatever approach they like, cite whatever result they like, because they like the result, not because there's any reason to trust it.

Let me not bitch again about the complete lack of anything like objective measures of success in language modelling, in particular. There have been no good metrics, no meaningful benchmarks, for many decades now, in NLP as a whole, but in language generation even more so. This is taught at students in NLP courses (our tutors discussed it in my MSc course) there is scholarship on it, there is a constant chorus of "we have no idea what we're doing" but nothing changes. It's too much hard work to try and find good metrics, build good benchmarks. It's much easier to put a paper on arxiv that shows SOTA results (0.01 more than the best system compared to!). And so the house of cards rises ever towards the sky.

Here's a recent paper that points out the sorry state of Natural Language Understanding (NLU) benchmarking:

What Will it Take to Fix Benchmarking in Natural Language Understanding?

https://aclanthology.org/2021.naacl-main.385/

There are many more, going back years. There are studies of how top-notch performance on NLU benchmarks is reduced to dust when the statistical regularities that models learn to overfit to in test datasets are removed. Nobody. fucking. cares. You can take your science and go home, we're making billion$$$ here!


I would have said machine learning is more like materials science, but you are on the right track.

As you increase the number of bits you are trying to comprehend, you move from quantum physics to chemistry to material science to biology to social science.

At certain points, the methods and reproducibility become somewhat of a dark art. I have experience that in my field of materials science.

Because these models are using billions or trillions of random number generators in their probability chains, it starts looking more like the harder hard sciences, it gets very difficult to track and understand what is important.

I think machine learning will be easier to comprehend than social sciences, so I wouldn't put it that high. It will be something between materials science and biology levels of difficulty in understanding.


Yes, I have similar concerns. These models regurgitate previously seen strings, previous benchmarks included. When you try to evaluate their sheer ability to reason on the text, however, they perform poorly. (Our experiments with GPT-3 are here: https://doi.org/10.5220/0012007500003470)


> The Internet we know is dead - this is a fact. I think OpenAI exactly knew how this would play out.

If OpenAI ceased to be – probably for some legislative reason –, would the problems go away?


The damage will have been done so I don’t think so.


> but it still might overlook important subtleties

If there's one thing we can be certain of, it's that LLMs often overlooks important subtleties.

Can't believe they used GPT4 to also evaluate the results. I mean, we wouldn't trust a student to grade their own exam even when given the right answers to grade with.


I noticed that when I read the paper. I know it's hard to scale but I'd want to see competent TAs doing the grading. I also found the distribution of courses a bit odd. Some of it might be just individual samples but intro courses I'd expect to be pretty cookie cutter (for GPT) were fairly far down the list and things I'd expect to be really challenging had relatively good results.


Can attest that the distribution is odd from the test set that we sampled.

We've already run the compute to run the zero-shot GPT model on all of the datapoints in the provided test set. We're going through the process now of grading them manually (our whole fraternity is chipping in!) and should have the results out relatively soon.

I can say that, so far, it's not looking good for that 90% correct zero-shot claim either.


Since you are here, when I was reading the paper I wondered -- when they show the "zero-shot solve rates", does that mean that they are basically running the same experiment code, but without the prompts that call `few_shot_response` (i.e. they are still trying each question with every expert prefix, and every critique?) It wasn't clear to me at a glance.


while nerf.studio is impressive, the parent comment is asking about radiance field free methods. NeRF stands for Neural Radiance Fields and all of the algorithms in that repo use radiance fields


@dang this account appears to be low effort ai generated spam presumably to promote MirrorThink.ai (look at the past comments)

Regardless this answer does not responded to the parent comment's question at all since all of these papers are about radiance fields.


To clarify for people who don't follow NeRF techniques, this research is not prompt based. The algorithm is capturing the 3d scene from real life images. There is some super promising work in mixing NeRF based techniques with various generative models to create 3d objects from prompts but it doesn't seem close to creating anything of this kind of scale / detail yet. I do agree this is a future possibility though.


I will admit I stand corrected about not being close: https://twitter.com/_akhaliq/status/1648848468234911754


A few of the comments in this thread seem to be misusing mathematics in order to lend more credence to themselves. At the risk of responding to low quality flamebait here are some problems with your statements.

1. P = NP refers to a two very specific sets of problems (which might actually be the same set) not any general question. There are problems that we know don't fall into P or NP, (for example the Halting Problem). Also whether or not P=NP is an open question almost the opposite of a fact.

2. You claim: "Evaluating if an answer is correct or not is easier than coming up with a correct answer from scratch." This is the right idea but not quite correct.

The correct statement is as follows: "Evaluating if an answer is correct is not harder than the difficulty of coming up with a correct answer from scratch."

This is because evaluating some answer can still be just as hard as the original problem. In fact sometimes it's uncomputable (if the original problem is also uncomputable). To use an example from above consider the question: "Does a program x halt?" If I tell you "no" it could be impossible to verify my answer unless you have solved the halting problem.

To bring this back to reality, again if GPT-4 is wrong about some complex medical question it doesn't mean it's mathematically easier to figure that out than solving the problem from scratch.


This analogy does not make sense to me. We do not have equivalence between all of the infinite inputs and outputs here, we have equivalence in a finite number of cases and known cases where the two functions (human output and llm output) diverge drastically. Any mathematician worth their salt would tell you these functions are definitely not equal.

Now you could make the argument that these functions are close enough most of the time so it won't matter but unless you want to get really rigorous that's more of a stats / engineering perceptive not mathematics. Any more importantly that's very up for debate, especially in a high pressure situation like medicine.

Of course these models are wild and I'm quite impressed with them. I still can be worried about the damage someone who doesn't think things through could cause by assuming GPT-4 has human or super human level intelligence in a niche, high impact field.


The equation example was simply an illustration.

The point I'm making is that I can quite clearly show output that demonstrates reasoning and understanding by any actual metric. That's not the problem. The problem is that when I do, the argument Quickly shifts to "it's not real understanding!". That is what is nonsensical here. It's actually nonsensical whatever domain you want to think about it in.

Either your dog fetches what you throw at it or it doesn't. The idea of "pretend fetching" as any meaningful distinction is incredibly silly and makes no sense.

If you want to tell me there's a special distinction cool but when you can't show me what that distinction is, how to test for it, the qualitative or quantitative differences then I'm going to throw your argument away because it's not a valid one. It's just an arbitrary line drawn on sand.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: