Both fantastic papers. For those who aren't aware, Mikolov also helped create word2vec.
One curious thing: this seems to use heirarchal softmax instead of the "negative sampling" described in their earlier paper http://arxiv.org/abs/1310.4546, despite that paper reporting that "negative sampling" is more computationally efficient and of similar quality. Anyone know why that might be?
Sentence classification is the generic term for bucketing sentences into different labels - those labels could be "positive", "negative" and "neutral", thus allowing for sentiment analysis.
But they could also be other labels such as "sports_news" or "finance_news". This library allows both.
I noticed that the C++ code has no comments whatsoever. Why would they do that? The code is clear enough and you can read the papers to figure it out or do they clean up comments before releasing internal code to the public?
I suspect it's the latter, since code not initially OSS likely has some references to IP, or org structure, some crudeness, etc. Probably easier to remove it all than rewrite.
Adding comments back in would be a great start to contributing to OSS.
I took bdcravens' comment to mean it'd be a great project for someone who wanted a way to start contributing to OSS, not a suggestion that Facebook wasn't contributing.
Like whafro said, that wasn't what I was trying to suggest. Never meant to denigrate their contribution, only saying that parts that were never intended to be open (like comments) were probably removed rather than rewritten, which is perfectly acceptable and consistent with their great open source contributions.
We usually strip the commit history since it's hard to audit the history for confidential info, but we don't normally remove comments when open-sourcing things. I'm not sure on the specifics of this case though.
The classification format is a bit confusing to me. Given a
file that looks like this:
Help - how to I format blocks of code/bash output in this editor ?
`fastText josephmisiti$ cat train.tsv | head -n 2
1 1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 1
2 1 A series of escapades demonstrating the adage that what is good for the goose 2
`fastText josephmisiti$ cat train.tsv | head -n 10 | awk -F '\t' '{print "__label__"$4 "\t" $3 }'
__label__1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .
__label__2 A series of escapades demonstrating the adage that what is good for the goose
__label__2 A series
__label__2 A
__label__2 series
__label__2 of escapades demonstrating the adage that what is good for the goose
__label__2 of
__label__2 escapades demonstrating the adage that what is good for the goose
__label__2 escapades
__label__2 demonstrating the adage that what is good for the goose`
For supervised classification this tool is suitable when your dataset is large enough. I performed some tests with binary classification (twitter sentiment) on the corpus with ~7.000 samples and the result is not impressive (~0.77). Vowpal wabbit performes slightly better here, with almost the same training time.
I'm looking forward to try it on some bigger datasets.
I also wonder if is it possible to use separately trained word vector model for the supervised task?
Thanks for pointing this out. We design this library on large datasets and some static variables may not be well tuned for smaller ones. For example the learning rate is only updated every 10k words. We are fixing that now, could you please send us on which dataset you were testing? We would like to see if we have solved this.
This might be a naïve question, but does anyone know if this is suitable for online classification tasks? All the examples in the paper ([2] in the readme) seemed to be for offline classification. I'm not terribly well versed in this area so I don't know if the techniques used here allow the model to be updated incrementally.
Can this be used to do automatic summarization? I have been really interested in that topic, and I've played with TextRank and LexRank, but they don't provide as meaningful summarizes as I would want.
Back in 2003-2005(!) I wrote a thing called classifier4j[0] which also did summarization. HP Labs in Brazil did a fairly comprehensive benchmark[1] of summarizers last year, which classifier4j won[2].
I'm quite surprised that (a) they found classifier4j at all, (b) they bothered to test it, and (c) it won, but anyway...
I'm also searching for automatic text summarization and I found a list with the 100 best automatic summarization software from GitHub. I would like to know what are the best from this list: http://meta-guide.com/software-meta-guide/100-best-github-au...
No I don't think it can do that. It can classify text. So suppose you have a bunch of sentences that describe cars, a cat etc. If you feed in data it can tell you if the data is about a car or a cat.
Thanks for the input! Text classification and semantic analysis seemed vague to me, so the clarification helped :). Maybe classifying text can help improve automatic summarization, as sentences that include or describe the main topic the best, should be in the summary.
Just to mirror what was said on the thread a month ago when the paper came out[1], if you're interested in FastText I'd strongly recommend checking out Vowpal Wabbit[2] and BIDMach[3].
My main issue is that the FastText paper [7] only compares to other intensive deep methods and not to comparable performance focused systems like Vowpal Wabbit or BIDMach.
Many of the features implemented in FastText have been existing in Vowpal Wabbit (VW) for many years. Vowpal Wabbit also serves as a test bed for many other interesting, but all highly performant, ideas, and has reasonable strong documentation. The command line interface is highly intuitive and it will burn through your datasets quickly. You can recreate FastText in VW with a few command line options[6].
BIDMach is focused on "rooflining", or working out the exact performance characteristics of the hardware and aiming to maximize those[4]. While VW doesn't have word2vec, BIDMach does[5], and more generally word2vec isn't going to be a major slow point in your systems as word2vec is actually pretty speedy.
To quote from my last comment in [1] regarding features:
Behind the speed of both methods [VW and FastText] is use of ngrams^, the feature hashing trick (think Bloom filter except for features) that has been the basis of VW since it began, hierarchical softmax (think finding an item in O(log n) using a balanced binary tree instead of an O(n) array traversal) and using a shallow instead of deep model.
^ Illustrating ngrams: "the cat sat on the mat" => "the cat", "cat sat", "sat on", "on the", "the mat" - you lose complex positional and ordering information but for many text classification tasks that's fine.
In principle if you just put a space between each character it would, though it would also make ngrams between words which you might not want.
edit: for vw, maybe the other lib has special support for character ngrams with word boundaries
I code something like this before for personal use, it allows me to evaluate my facebook/twitter status before posting online and classify them according to being "negative, sarcastic, positive, helpful" so that I can be careful on what I'm posting. I use bayesian filtering with trained words I gathered which contains negative, sarcastic, positive and helpful, then I use scoring to filter out what exactly the sentence means.
The simultaneous training of word representations and a classifier seems like it ignores the typically much larger unsupervised portion of the corpus. Is there a way to train the word representations on the full-corpus and then apply this to the smaller classification training?
Conceptnet Numberbatch (https://github.com/LuminosoInsight/conceptnet-numberbatch) is a pre-trained model that outperforms the results reported in this paper (and of course far outperforms the pre-trained word2vec models, which are quite dated).
The difference actually should be larger: Numberbatch considers missing vocabulary to be a problem, and takes a loss of accuracy accordingly, while FastText just dropped their out-of-vocabulary words and reported them as a separate statistic.
I'm using their Table 3 here. I don't know how Table 2 relates, or why their French score goes down with more data in that table.
What's the trick? Prior knowledge, and not expecting one neural net to learn everything. Numberbatch knows a lot of things about a lot of words because of ConceptNet, it knows which words are forms of the same word because it uses a lemmatizer, and it uses distributional information from word2vec and GloVe.
It would be nice to have a FB-curated classification model set, but I wonder if it would be much more than sentiment labels (as is mentioned). Those are a dime-a-dozen.
At train time, the code supports multiple labels by sampling one of the k label at random. At test time, it only predicts the most probable label for each example.
We will add more functionalities for multi label classification in the future (predict the top k labels, etc...).
Bag of Tricks for Efficient Text Classification: https://arxiv.org/abs/1607.01759v2
Enriching Word Vectors with Subword Information: https://arxiv.org/abs/1607.04606
Both fantastic papers. For those who aren't aware, Mikolov also helped create word2vec.
One curious thing: this seems to use heirarchal softmax instead of the "negative sampling" described in their earlier paper http://arxiv.org/abs/1310.4546, despite that paper reporting that "negative sampling" is more computationally efficient and of similar quality. Anyone know why that might be?