Hacker News new | past | comments | ask | show | jobs | submit login

Speaking as both a ML researcher and an applied ML business owner (one who hired stared at one point — hi Piotr :), I respectfully disagree.

The replication crisis in "AI" may not be as bad as psychology (I wouldn't know), but it's not great. Sadly, my brain has somehow learned to equate "SOTA" with "hot-stitched crap, stay away". Too many painful lessons.

On the subject of publishing code: this is useful to the degree that it removes bad faith as the possible reason for the lack of replicability. But otherwise helps little in practical terms. You just have the privilege to sift through the bugs and bad design in more close-up.

"I am afraid you are right. I used to reach ~72% via the given random seed on an old version of pytorch, but now with the new version of pytorch, I wasn't able to reproduce the result. My personal opinion is that the model is neither deep or sophisticated, and usually for such kind of model, tuning hyper parameters will change the results a lot (although I don't think it's worthy to invest time tweaking an unstable model structure)."

= quote [1] for one of the "new SOTA" papers from NLP (WikiQA question answering), where the replication scores came out 62% instead of claimed 72%.

I generally call this the "AI Mummy Effect" — looks great but crumbles to dust on touch.

[1] https://github.com/pcgreat/SeqMatchSeq/issues/1




Hi Radim!

To make it clear, I am not happy with the current state of reproducibility in AI. Yet, it is still better that in all disciplines I interacted with (quantum physics, mathematical psychology). There the standard prectice was to not include any code, even if the paper bases on it.

Vide my answer to "Why are papers without code but with results accepted?" (https://academia.stackexchange.com/questions/23237/why-are-p...).

So, I was so happy to see that in Deep Learninig a lot of code appears on GitHub (I am the most happy if it appears in different frameworks, implemented y different people).

Dirty code provides limited value. It's hard to learn from it, it's hard to re-use it, and its performance may depend on the phase of the Moon (and system setting, software versions, etc). Yet, IMHO, is much better than no code. It is not only about good faith, but about including all details. Some of them may seem unimportant (even to the author), yet crucial for the results.

The next level is resonably well written code, with clear environment setting (e.g. Dockerfile/requirements.txt), and the dataset. Otherwise it is hard to proof it against "on my environment it works":

> where the replication scores came out 62% instead of claimed 72%


The cited example was attempted reproduced and found wanting. A plausible reason for this was discovered, and is accessible online. This in much better than no replication being performed at all. But yes, we need to continuously up the game. Adopting a standard that ensures model stability would be a good next target, and not accepting papers that don't uphold it.

I'd also like to see open-source runnable code the default for paper acceptance, with some sort of 'punishment' for not having it. Maybe even make it prerequisite of empirical papers.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: