FastText – Library for fast text representation and classification

kough · on Aug 4, 2016

Links to the relevant papers:

Bag of Tricks for Efficient Text Classification: https://arxiv.org/abs/1607.01759v2

Enriching Word Vectors with Subword Information: https://arxiv.org/abs/1607.04606

Both fantastic papers. For those who aren't aware, Mikolov also helped create word2vec.

One curious thing: this seems to use heirarchal softmax instead of the "negative sampling" described in their earlier paper http://arxiv.org/abs/1310.4546, despite that paper reporting that "negative sampling" is more computationally efficient and of similar quality. Anyone know why that might be?

exgrv · on Aug 4, 2016

It is possible to chose between negative sampling (ns), softmax or hierarchical softmax (hs) by using the -loss option.

kough · on Aug 4, 2016

Cool, thank you!

samfisher83 · on Aug 4, 2016

What exactly does it do?

It says this: fastText is a library for efficient learning of word representations and sentence classification.

What does that meant? Is for sentiment analysis?

viksit · on Aug 4, 2016

For what word representations are, see http://technology.stitchfix.com/blog/2015/03/11/word-is-wort...

Sentence classification is the generic term for bucketing sentences into different labels - those labels could be "positive", "negative" and "neutral", thus allowing for sentiment analysis.

But they could also be other labels such as "sports_news" or "finance_news". This library allows both.

yelnatz · on Aug 4, 2016

Yes, it can be used for sentiment analysis.

This library basically means you don't have to write the code for sentiment analysis anymore (just one example).

Just feed it a model:

    $ ./fasttext supervised -input train.txt -output model

And then you can predict what the most likely label for a text is:

    $ ./fasttext predict model.bin test.txt

plusepsilon · on Aug 4, 2016

It learns word representations and sentiment classification at the same time.

Traditionally, word representations are learned by looking at surrounding words. So "good" and "bad" will have similar word representations.

By training on sentiment, similar sentiment words should be clustered together.

onewaystreet · on Aug 4, 2016

Read the "Example use cases" section

lucb1e · on Aug 4, 2016

(Not OP) I did and it's still rather vague. I totally see where s/he's coming from with this question.

slig · on Aug 4, 2016

I noticed that the C++ code has no comments whatsoever. Why would they do that? The code is clear enough and you can read the papers to figure it out or do they clean up comments before releasing internal code to the public?

bdcravens · on Aug 4, 2016

I suspect it's the latter, since code not initially OSS likely has some references to IP, or org structure, some crudeness, etc. Probably easier to remove it all than rewrite.

Adding comments back in would be a great start to contributing to OSS.

michael_storm · on Aug 4, 2016

I think open-sourcing the code in the first place was a great start to contributing to OSS. Facebook isn't a newcomer to the community.

whafro · on Aug 4, 2016

I took bdcravens' comment to mean it'd be a great project for someone who wanted a way to start contributing to OSS, not a suggestion that Facebook wasn't contributing.

michael_storm · on Aug 8, 2016

Oh, you're right. Whoops.

bdcravens · on Aug 4, 2016

Like whafro said, that wasn't what I was trying to suggest. Never meant to denigrate their contribution, only saying that parts that were never intended to be open (like comments) were probably removed rather than rewritten, which is perfectly acceptable and consistent with their great open source contributions.

michael_storm · on Aug 8, 2016

Sorry about that; I misunderstood your comment.

sophiebits · on Aug 5, 2016

We usually strip the commit history since it's hard to audit the history for confidential info, but we don't normally remove comments when open-sourcing things. I'm not sure on the specifics of this case though.

misiti3780 · on Aug 4, 2016

The classification format is a bit confusing to me. Given a file that looks like this:

Help - how to I format blocks of code/bash output in this editor ?

`fastText josephmisiti$ cat train.tsv | head -n 2 1 1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 1 2 1 A series of escapades demonstrating the adage that what is good for the goose 2

Are they saying to reformat it like this

cat train.tsv | head -n 10 | awk -F '\t' '{print "__label__"$4 "\t" $3 }'`

giving me

`fastText josephmisiti$ cat train.tsv | head -n 10 | awk -F '\t' '{print "__label__"$4 "\t" $3 }' __label__1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . __label__2 A series of escapades demonstrating the adage that what is good for the goose __label__2 A series __label__2 A __label__2 series __label__2 of escapades demonstrating the adage that what is good for the goose __label__2 of __label__2 escapades demonstrating the adage that what is good for the goose __label__2 escapades __label__2 demonstrating the adage that what is good for the goose`

throwanem · on Aug 4, 2016

HN doesn't support Markdown. Indent each line of a block by four spaces for fixed-width markup.

haddr · on Aug 4, 2016

For supervised classification this tool is suitable when your dataset is large enough. I performed some tests with binary classification (twitter sentiment) on the corpus with ~7.000 samples and the result is not impressive (~0.77). Vowpal wabbit performes slightly better here, with almost the same training time.

I'm looking forward to try it on some bigger datasets.

I also wonder if is it possible to use separately trained word vector model for the supervised task?

exgrv · on Aug 4, 2016

Thanks for pointing this out. We design this library on large datasets and some static variables may not be well tuned for smaller ones. For example the learning rate is only updated every 10k words. We are fixing that now, could you please send us on which dataset you were testing? We would like to see if we have solved this.

haddr · on Aug 4, 2016

sure, how can i send it to you?

exgrv · on Aug 4, 2016

If the dataset is public, could you post a link? Otherwise, could you please send me an email? (My address can be found on the github README). Thanks!

haddr · on Aug 5, 2016

I've sent you an email

nl · on Aug 4, 2016

What is SOA on Twitter sentiment these days? I thought it was around 80%

jgraham · on Aug 4, 2016

This might be a naïve question, but does anyone know if this is suitable for online classification tasks? All the examples in the paper ([2] in the readme) seemed to be for offline classification. I'm not terribly well versed in this area so I don't know if the techniques used here allow the model to be updated incrementally.

plusepsilon · on Aug 4, 2016

If it uses stochastic gradient descent (it should) to train using batches of data, you can apply that to online learning.

mendeza · on Aug 4, 2016

Can this be used to do automatic summarization? I have been really interested in that topic, and I've played with TextRank and LexRank, but they don't provide as meaningful summarizes as I would want.

nl · on Aug 4, 2016

Back in 2003-2005(!) I wrote a thing called classifier4j[0] which also did summarization. HP Labs in Brazil did a fairly comprehensive benchmark[1] of summarizers last year, which classifier4j won[2].

I'm quite surprised that (a) they found classifier4j at all, (b) they bothered to test it, and (c) it won, but anyway...

[0] http://classifier4j.sourceforge.net/

[1] https://www.semanticscholar.org/paper/A-Quantitative-and-Qua...

[2] https://www.semanticscholar.org/paper/A-Quantitative-and-Qua...

rsiqueira · on Aug 5, 2016

I'm also searching for automatic text summarization and I found a list with the 100 best automatic summarization software from GitHub. I would like to know what are the best from this list: http://meta-guide.com/software-meta-guide/100-best-github-au...

samfisher83 · on Aug 4, 2016

No I don't think it can do that. It can classify text. So suppose you have a bunch of sentences that describe cars, a cat etc. If you feed in data it can tell you if the data is about a car or a cat.

mendeza · on Aug 4, 2016

Thanks for the input! Text classification and semantic analysis seemed vague to me, so the clarification helped :). Maybe classifying text can help improve automatic summarization, as sentences that include or describe the main topic the best, should be in the summary.

Smerity · on Aug 4, 2016

Just to mirror what was said on the thread a month ago when the paper came out[1], if you're interested in FastText I'd strongly recommend checking out Vowpal Wabbit[2] and BIDMach[3].

My main issue is that the FastText paper [7] only compares to other intensive deep methods and not to comparable performance focused systems like Vowpal Wabbit or BIDMach.

Many of the features implemented in FastText have been existing in Vowpal Wabbit (VW) for many years. Vowpal Wabbit also serves as a test bed for many other interesting, but all highly performant, ideas, and has reasonable strong documentation. The command line interface is highly intuitive and it will burn through your datasets quickly. You can recreate FastText in VW with a few command line options[6].

BIDMach is focused on "rooflining", or working out the exact performance characteristics of the hardware and aiming to maximize those[4]. While VW doesn't have word2vec, BIDMach does[5], and more generally word2vec isn't going to be a major slow point in your systems as word2vec is actually pretty speedy.

To quote from my last comment in [1] regarding features:

Behind the speed of both methods [VW and FastText] is use of ngrams^, the feature hashing trick (think Bloom filter except for features) that has been the basis of VW since it began, hierarchical softmax (think finding an item in O(log n) using a balanced binary tree instead of an O(n) array traversal) and using a shallow instead of deep model.

^ Illustrating ngrams: "the cat sat on the mat" => "the cat", "cat sat", "sat on", "on the", "the mat" - you lose complex positional and ordering information but for many text classification tasks that's fine.

[1]: https://news.ycombinator.com/item?id=12063296

[2]: https://github.com/JohnLangford/vowpal_wabbit

[3]: https://github.com/BIDData/BIDMach

[4]: https://github.com/BIDData/BIDMach/wiki/Benchmarks#Reuters_D...

[5]: https://github.com/BIDData/BIDMach/blob/master/src/main/scal...

[6]: https://twitter.com/haldaume3/status/751208719145328640

[7]: https://arxiv.org/abs/1607.01759

SergeyHack · on Aug 4, 2016

Sounds interesting. Can these tools work on character n-grams as FastText does?

tensor · on Aug 5, 2016

In principle if you just put a space between each character it would, though it would also make ngrams between words which you might not want. edit: for vw, maybe the other lib has special support for character ngrams with word boundaries

jjuliano · on Aug 5, 2016

I code something like this before for personal use, it allows me to evaluate my facebook/twitter status before posting online and classify them according to being "negative, sarcastic, positive, helpful" so that I can be careful on what I'm posting. I use bayesian filtering with trained words I gathered which contains negative, sarcastic, positive and helpful, then I use scoring to filter out what exactly the sentence means.

tcamp · on Aug 4, 2016

How does this work with or replace other NLP solutions in the market. Is it only for training models or for actual interpretation as well.

merrellb · on Aug 13, 2016

The simultaneous training of word representations and a classifier seems like it ignores the typically much larger unsupervised portion of the corpus. Is there a way to train the word representations on the full-corpus and then apply this to the smaller classification training?

eefic · on Aug 13, 2016

You probably meant to initialize the input->hidden weight matrix with the result of unsupervised training on the full corpus. A little tweak on how these weights are initilized would do: https://github.com/facebookresearch/fastText/blob/master/src...

I was a bit curious why they did not offer this by default. It seems quite useful.

himavarsha · on Aug 8, 2016

This might be a naive question, but what should be the format of the training/test data? Is it like __label__1 John __label__2 Ram

d0100 · on Aug 4, 2016

As a side note, are dataset that have already been classified available for free anywhere?

riyadparvez · on Aug 4, 2016

Did they release any trained model like Google did for word2vec?

rspeer · on Aug 5, 2016

Conceptnet Numberbatch (https://github.com/LuminosoInsight/conceptnet-numberbatch) is a pre-trained model that outperforms the results reported in this paper (and of course far outperforms the pre-trained word2vec models, which are quite dated).

Here are the almost-comparable evaluations:

              fastText    Numberbatch
    en:RW          .46           .601
    en:ws353       .73           .802
    fr:rg65        .67           .789

The difference actually should be larger: Numberbatch considers missing vocabulary to be a problem, and takes a loss of accuracy accordingly, while FastText just dropped their out-of-vocabulary words and reported them as a separate statistic.

I'm using their Table 3 here. I don't know how Table 2 relates, or why their French score goes down with more data in that table.

What's the trick? Prior knowledge, and not expecting one neural net to learn everything. Numberbatch knows a lot of things about a lot of words because of ConceptNet, it knows which words are forms of the same word because it uses a lemmatizer, and it uses distributional information from word2vec and GloVe.

nkozyra · on Aug 4, 2016

It would be nice to have a FB-curated classification model set, but I wonder if it would be much more than sentiment labels (as is mentioned). Those are a dime-a-dozen.

kwrobel · on Aug 4, 2016

Is it multi label text classification or only multi class?

exgrv · on Aug 4, 2016

At train time, the code supports multiple labels by sampling one of the k label at random. At test time, it only predicts the most probable label for each example.

We will add more functionalities for multi label classification in the future (predict the top k labels, etc...).

aantix · on Aug 4, 2016

Was hoping for Java bindings as I'd like to try it out on a long running Map/Reduce classification job..

rabidsnail · on Aug 4, 2016

then write some. there's also not that much code here so you could port it to java in a day or two.

meetapoorvgupta · on Aug 4, 2016

Google released a C++ implementation of Mapreduce sometime back. It had been developed at Skybox IIRC

ali053944 · on Aug 8, 2016

drstrangevibes · on Aug 4, 2016

how fast is it? does it outperform tensorflow or torch-rnn?

smhx · on Aug 4, 2016

Link to the paper: https://arxiv.org/abs/1607.01759

Quotes from the paper:

Both char-CNN and VDCNN are trained on a NVIDIA Tesla K40 GPU, while our models are trained on a CPU using 20 threads.

Table2 shows that methods using convolutions are several orders of magnitude slower than fastText.

Our speed-up compared to CNN based methods increases with the size of the dataset, going up to atleast a 15, 000× speed-up.

Table 2 shows the speedups of:

ConvNets: 2 to 5 days on GPUs

FastText: 52 seconds on CPU