Deep Learning Breakthrough Made by Rice University Scientists

primitivesuave · on Dec 13, 2019

I would hardly characterize this as a breakthrough: https://openreview.net/forum?id=r1RQdCg0W

Audoenus · on Dec 13, 2019

A good rule of thumb I always use if a science article title has the word "breakthrough"in it, then it's probably not a breakthrough.

If the title does nothing to describe the actual discovery made and solely consist of "Breakthrough in [Field]" then it's definitely not a breakthrough.

jvm_ · on Dec 13, 2019

Sounds like the rule where if the headline ends in a question mark, the answer to the question is No.

taneq · on Dec 14, 2019

Oh, you mean Cunningham's law.

rflrob · on Dec 14, 2019

I see what you did there...

jkaptur · on Dec 13, 2019

I wonder why no one seems to have published an article titled "Is Betteridge's Law of Headlines True?"

edit: who else? https://www.johndcook.com/blog/2013/03/18/was-betteridge-rig...

t_mann · on Dec 14, 2019

A good rule that could be refined by applying it only to topics that the general public cares about. A breakthrough in analytic number theory or international accounting standards is probably genuine, one in AI or battery technology probably not.

19f191ty · on Dec 14, 2019

Unless it's in Math. If a Math breakthrough makes it to the popular media then it's most likely a very big breakthrough.

remarkEon · on Dec 14, 2019

Why is that? I suppose complex maths is harder for the science journalist to understand, and doesn’t get as many clicks, so if they are reporting on it it’s because it’s substantial?

smaddox · on Dec 14, 2019

Any rule that reduces the posterior of a breakthrough is generally going to be an improvement. Unless your definition of "breakthrough" is extremely generous.

allovernow · on Dec 14, 2019

>"[its] training times are about 7-10 times faster, and... memory footprints are 2-4 times smaller" than those of previous large-scale deep learning techniques.

Which matches the abstract. If this has general applications it's a pretty big leap to shrink model sizes and speed up training by orders of magnitude, especially at a time when many SOTA models are only feasible for well funded groups because of their size.

primitivesuave · on Dec 15, 2019

This is not a strong result as noted by the reviewers, and they have not proven state-of-the-art performance among other things. Those of us who did ML in academia also look down upon using the media to bolster ones claims before a thorough peer review of research.

billconan · on Dec 13, 2019

thank you for the link. I couldn’t understand it as explained by Ars. I found open reviews are better.

m0zg · on Dec 13, 2019

Word to the wise: as someone who actually works in the field, trust NO claims until you can verify them with real code.

Papers very often contain the very uppermost bound of what's _theoretically_ possible when it comes to benchmarks. Researchers rarely have the skill to realize those gains in practice, so any performance numbers in papers should be assumed theoretical and unverified unless you can actually download code and benchmark them yourself or unless they come from a research organization known for competent benchmarking (e.g. Google Brain). In particular any "sparse" approach is deeply suspect as far as its practical performance or memory efficiency claims: current hardware does not deal with sparsity well unless things are _really_ sparse (like 1/10th or more) and sparsity is able to outweigh architectural inefficiencies.

ganzuul · on Dec 13, 2019

https://github.com/Tharun24/MACH/

m0zg · on Dec 13, 2019

https://github.com/Tharun24/MACH/blob/master/amazon_670k/src...

Run on a single machine by logically partitioning GPUs. Don't get me wrong, I'm not disputing that this could work or that it could be a "breakthrough". I'm just saying that unless it's independently replicated and confirmed, it's just a paper like a million others.

ganzuul · on Dec 13, 2019

It's an interesting premise nonetheless. Perhaps another similar approach would be one from mathematical manifolds, where they have charts and atlases, and I believe they build the atlas by having overlapping charts.

mpoteat · on Dec 13, 2019

Not a full time ML researcher, but I thought I understood that batching is already an extremely common practice. I don't see the novelty here.

dnautics · on Dec 14, 2019

Bigger batches are good, but they result in locking. Picking a good batch size relative to how much data you have is important. This new technique lets you, effectively, buy a "meta batch" for free (that is a terrible analogy, but it's the best I can do.).

As batches get bigger and can't fit inside a single gpu or single compute node, your challenge becomes data transport. So anything that will be able to decouple your computatational agents can be a win.

In this case, it's a more clever way of decoupling your agents. Normally asynchronous batches are awful, but this is kind of a very clever way of allowing for asynchronous batching of your data.

If I may opine on the matter, I think we're reaching a point where machine learning researchers should start thinking about abandoning python as a programming medium. For example, the other decoupling strategy (decoupled neural net backpropagation) doesn't really seem like something I would want to write in python, much less debug someone else's code. Python is really not an appropriate framework for tackling difficult problems in distribution and network coordination.

comicjk · on Dec 14, 2019

As long as the big ML libraries support these strategies, people will use them. The choice of user language is not critical. Tensorflow/PyTorch are basically an ML-specific programming model with a Python interface.

dnautics · on Dec 14, 2019

They don't, that's my point. I can find only one library for this: https://arxiv.org/abs/1608.05343

allovernow · on Dec 14, 2019

What's the performance difference in cpp and python that's wrapping cpp for critical sections?

Pretty much all array operations in numpy are as I understand are calling into cpp libraries for cpu and GPU operations.

dnautics · on Dec 14, 2019

Did you read what I wrote? I'm not making any claims about numerical performance. I'm saying there are better choices (in terms of being easy for the programmer to write and debug) for programming other aspects, like network, asynchronous coordination, etc.

Random_ernest · on Dec 14, 2019

Which other programming language would you suggest?

ganzuul · on Dec 13, 2019

This seems to be their latest work: https://arxiv.org/abs/1910.13830

gambler · on Dec 13, 2019

>Instead of training on the entire 100 million outcomes—product purchases, in this example—Mach divides them into three "buckets," each containing 33.3 million randomly selected outcomes.

So, uh, they're doing what random forests were doing for decades? What is the key difference?

overlords · on Dec 14, 2019

Random forests split the features This splits the outcomes.

So each tree in RF only looks at a few features. In this, each model looks at all the features.

RF can handle multiclass problems of tens to hundreds (maybe thousands). This MACH algo can handle multiclass problems of millions/billions (extreme classification).

m3kw9 · on Dec 14, 2019

Looks like any advancement can be called a breakthrough, a onion paper breakthrough can be a breakthrough

deadens · on Dec 13, 2019

Umm... Here's an obvious idea, what if you don't store the entire model in memory and use message passing architecture to distribute the model kinda like how HPC people have been doing this entire time? Non distributed models are a dead end anyway.

derision · on Dec 13, 2019

Latency between GPUs kills performance

sudosysgen · on Dec 13, 2019

It depends on just how huge the model is. Some models take multiple seconds to run/backpropagate and might take hundreds of gigabytes of memory, in which case it could be useful.

strbean · on Dec 14, 2019

Also seems like a problem that could be partially solved by tailoring the NN architecture. Does that make sense?

ganzuul · on Dec 14, 2019

Do you mean like Stochastic Gradient Descent does?