Is the following quote at odds with what you are saying about 50-way classification?
"On the other hand, the network is not merely classifying sentences, since performance is improved by augmenting the training set even with sentences not contained in the testing set (Fig. 3a,b). This result is critical: it implies that the network has learned to identify words, not just sentences, from ECoG data, and therefore that generalization to decoding of novel sentences is possible."
The difficulty of the problem is that of a 50-way classification. If the only goal was to minimize WER, a simple post-processing step choosing the nearest sentence in the training set could easily bring the WER down further. They've chosen to do it the way they did it presumably to show that it can be done that way, and I don't fault them for it.
They claim that word-by-word decoding implies that the network has learned to identify words. This may well be true, but it isn't possible to claim that from their result. For example, let's say you average all electrode samples over the relevant timespan, transform that representation with a FFW neural net, and feed that into the an RNN decoder. It would still predict word-by-word, on a representation that necessarily does not distinguish between words (because the time dimension has been averaged over). Such a model can still output words in the right order, just from the statistics of the training sentences being baked into the decoder RNN.
"On the other hand, the network is not merely classifying sentences, since performance is improved by augmenting the training set even with sentences not contained in the testing set (Fig. 3a,b). This result is critical: it implies that the network has learned to identify words, not just sentences, from ECoG data, and therefore that generalization to decoding of novel sentences is possible."