A good rule of thumb I always use if a science article title has the word "breakthrough"in it, then it's probably not a breakthrough.
If the title does nothing to describe the actual discovery made and solely consist of "Breakthrough in [Field]" then it's definitely not a breakthrough.
A good rule that could be refined by applying it only to topics that the general public cares about. A breakthrough in analytic number theory or international accounting standards is probably genuine, one in AI or battery technology probably not.
Why is that? I suppose complex maths is harder for the science journalist to understand, and doesn’t get as many clicks, so if they are reporting on it it’s because it’s substantial?
Any rule that reduces the posterior of a breakthrough is generally going to be an improvement. Unless your definition of "breakthrough" is extremely generous.
>"[its] training times are about 7-10 times faster, and... memory footprints are 2-4 times smaller" than those of previous large-scale deep learning techniques.
Which matches the abstract. If this has general applications it's a pretty big leap to shrink model sizes and speed up training by orders of magnitude, especially at a time when many SOTA models are only feasible for well funded groups because of their size.
This is not a strong result as noted by the reviewers, and they have not proven state-of-the-art performance among other things. Those of us who did ML in academia also look down upon using the media to bolster ones claims before a thorough peer review of research.
Word to the wise: as someone who actually works in the field, trust NO claims until you can verify them with real code.
Papers very often contain the very uppermost bound of what's _theoretically_ possible when it comes to benchmarks. Researchers rarely have the skill to realize those gains in practice, so any performance numbers in papers should be assumed theoretical and unverified unless you can actually download code and benchmark them yourself or unless they come from a research organization known for competent benchmarking (e.g. Google Brain). In particular any "sparse" approach is deeply suspect as far as its practical performance or memory efficiency claims: current hardware does not deal with sparsity well unless things are _really_ sparse (like 1/10th or more) and sparsity is able to outweigh architectural inefficiencies.
Run on a single machine by logically partitioning GPUs. Don't get me wrong, I'm not disputing that this could work or that it could be a "breakthrough". I'm just saying that unless it's independently replicated and confirmed, it's just a paper like a million others.
It's an interesting premise nonetheless. Perhaps another similar approach would be one from mathematical manifolds, where they have charts and atlases, and I believe they build the atlas by having overlapping charts.
Bigger batches are good, but they result in locking. Picking a good batch size relative to how much data you have
is important. This new technique lets you, effectively, buy a "meta batch" for free (that is a terrible analogy, but it's the best I can do.).
As batches get bigger and can't fit inside a single gpu or single compute node, your challenge becomes data transport. So anything that will be able to decouple your computatational agents can be a win.
In this case, it's a more clever way of decoupling your agents. Normally asynchronous batches are awful, but this is kind of a very clever way of allowing for asynchronous batching of your data.
If I may opine on the matter, I think we're reaching a point where machine learning researchers should start thinking about abandoning python as a programming medium. For example, the other decoupling strategy (decoupled neural net backpropagation) doesn't really seem like something I would want to write in python, much less debug someone else's code. Python is really not an appropriate framework for tackling difficult problems in distribution and network coordination.
As long as the big ML libraries support these strategies, people will use them. The choice of user language is not critical. Tensorflow/PyTorch are basically an ML-specific programming model with a Python interface.
Did you read what I wrote? I'm not making any claims about numerical performance. I'm saying there are better choices (in terms of being easy for the programmer to write and debug) for programming other aspects, like network, asynchronous coordination, etc.
>Instead of training on the entire 100 million outcomes—product purchases, in this example—Mach divides them into three "buckets," each containing 33.3 million randomly selected outcomes.
So, uh, they're doing what random forests were doing for decades? What is the key difference?
Random forests split the features This splits the outcomes.
So each tree in RF only looks at a few features. In this, each model looks at all the features.
RF can handle multiclass problems of tens to hundreds (maybe thousands). This MACH algo can handle multiclass problems of millions/billions (extreme classification).
Umm... Here's an obvious idea, what if you don't store the entire model in memory and use message passing architecture to distribute the model kinda like how HPC people have been doing this entire time? Non distributed models are a dead end anyway.
It depends on just how huge the model is. Some models take multiple seconds to run/backpropagate and might take hundreds of gigabytes of memory, in which case it could be useful.