Both fantastic papers. For those who aren't aware, Mikolov also helped create word2vec.
One curious thing: this seems to use heirarchal softmax instead of the "negative sampling" described in their earlier paper http://arxiv.org/abs/1310.4546, despite that paper reporting that "negative sampling" is more computationally efficient and of similar quality. Anyone know why that might be?
Bag of Tricks for Efficient Text Classification: https://arxiv.org/abs/1607.01759v2
Enriching Word Vectors with Subword Information: https://arxiv.org/abs/1607.04606
Both fantastic papers. For those who aren't aware, Mikolov also helped create word2vec.
One curious thing: this seems to use heirarchal softmax instead of the "negative sampling" described in their earlier paper http://arxiv.org/abs/1310.4546, despite that paper reporting that "negative sampling" is more computationally efficient and of similar quality. Anyone know why that might be?