NLTK has always seemed like a bit of a toy when compared to Stanford CoreNLP. I'd be very curious to see performance/accuracy charts on a number of corpora in comparison to CoreNLP.
The Cython implementation makes it somewhat believable that it's faster than CoreNLP, but I'd also like to hear a deep-dive on why it's several times faster, beyond that control over memory layout is the best way to win performance (stipulated). In particular, it would be good to know whether CoreNLP is doing more processing than spaCy or otherwise handling more concerns.
Finally, I'd really love to see a feature table comparing spaCy with CoreNLP.
The Cython implementation makes it somewhat believable that it's faster than CoreNLP, but I'd also like to hear a deep-dive on why it's several times faster,
Time complexity. The Stanford parser is a phrase structure parser that creates dependencies as a post processing step. So, assuming that they use some variation of CKY, the time complexity is O(N^3 |G|) where |G| is the size of the grammar. This uses Nivre-style greedy parsing, which is O(N).
So, a slightly fairer comparison would be e.g. the Malt parser. Although this will probably be better than that in terms of accuracy, since last time I checked the Malt parser doesn't use dynamic oracles yet and by default doesn't integrate Brown clusters or word embeddings (though you could do that yourself). Though I wonder a bit about feature set construction, because in my experience perceptrons are far more sensitive to adding 'wrong' features than e.g. SVM classifiers. This becomes interesting especially when you train a model for another language or dependency scheme annotation, since the features that are relevant differ per set-up.
That's not true --- I'm comparing against their neural-network shift-reduce dependency parser, which is very fast. Actually I don't know of a faster parser than theirs, other than spaCy.
First of all, the target output of the two systems is exactly the same --- labelled dependency parses with the same schemes, parts-of-speech, tokens, and lemmas. At least, that's how I run the CoreNLP in this benchmark. It has some other processing modules, but I turn them off for the speed comparison.
Second, very very similar algorithms are being run. The new CoreNLP model uses greedy shift-reduce dependency parsing, same as spaCy. That CoreNLP model was published late last year; before that, CoreNLP only had the older polynomial-time parsing algorithms implemented, which are much slower, and often less accurate.
The contribution of Chen and Manning's paper is to use a neural network model, where I'm using a linear model. (More specifically: they show some interesting tricks to make the neural network actually perform well. I suspect many people have tried to do this and failed.)
Chen and Manning say that their model is much faster than a linear model, because the linear model must explicitly compute lots of conjunction features --- I use about 100 feature templates.
So, they probably have something of an algorithmic advantage over my parser, although the extent of it is unclear. I'll only know when I implement their model. It's not terribly hard to do --- it's just a neural network --- but it's lower on my queue than a number of other things I want to work on. My hunch is that I won't see nearly as much benefit from it as their results suggest, because their baseline is quite weak.
So, I do think all we're seeing here is the same algorithm implemented in Java and C, so the C version is coming out 7x quicker. This makes sense to me. But, possibly the CoreNLP parser has to do some contortions to integrate into their framework. I don't know.
There's also a meta-level point. Maybe I just tried harder. The Stanford paper would still have been accepted, and still have been great, if it ran at 50% of the speed that it does. And we'll probably never know what would happen if the author spent a month doing nothing but trying to optimise the code --- I can't imagine he/she ever will. That wouldn't get a publication.
For your other question, about what spaCy offers and what CoreNLP offers. These are the main things I'm missing at the moment:
* Named entity recognition
* Phrase-structure parsing
* Coreference resolution
I have some preliminary work on NER. I plan to roll that out next, along with some word-sense disambiguation. PSG parsing is no problem to do either.
Thanks for the suggestion to include an evaluation of OpenNLP -- I'll do that.
> spaCy’s parser offers a better speed/accuracy trade-off than any published system: its accuracy is within 1% of the current state-of-the-art, and it’s seven times faster than the 2014 CoreNLP neural network parser, which is the previous fastest parser that I’m aware of.
First of all: great work! Even though I think OpenNLP should also be mentioned, since it's released under the commercially-friendly Apache 2 license, it's great that you provide this. Also, I think it is a shame that science lost you (I assume) ;).
I think the other interesting problem to tackle currently is training data. The situation for English is ok, if you have a couple of thousand of dollars to spare for a commercial license (which may be problematic for bootstrapping). But for many other languages there aren't even treebanks available that can be used for commercial purposes.
It would be great if some annotation project started that aimed to provide annotations under a liberal license.
(Ps. I have a statistical dependency parser written in Go, which I will probably release soon in case anyone is interested ;).)
Well, I'm distributing trained models with this. Users shouldn't need to retrain unless they're doing research, in which case they should have access to the data.
I agree that the data situation is troubling, though. I don't understand why Google gave the English Web Treebank to the LDC. Why not just distribute it themselves?
The LDC is really more of a problem than a help now. For instance, the OntoNotes corpus costs $0 for non-commercial use. Great! How do you get it? Send the LDC a fax, and when they get around to it, they send you a log in to their ancient website.
It used to be a valuable service to host and distribute this data. Now, this is no longer really the case, but it's still standard to distribute via them.
Well, I'm distributing trained models with this. Users shouldn't need to retrain unless they're doing research, in which case they should have access to the data.
I haven't read LDC's license on Penn Treebank recently, but AFAIR you cannot just redistribute models that were trained on the Penn Treebank. Or put differently, you can distribute the model, but any users still have to obtain a license for the treebank. That's why we are still stuck with the Brown corpus and such.
I don't understand why Google gave the English Web Treebank to the LDC. Why not just distribute it themselves?
"since it's released under the commercially-friendly Apache 2 license"
I do hate the proliferation of the idea of BSD-style licenses being "commercially friendly". You can do one hell of a lot of things commercially with GPL software. It's only under very specific circumstances that you are obliged to release source for GPL software. And as (I would wager probably) ~99% of NLP software is used internal to a company and never distributed as such, the differentiation between licensing models would never actually come into play.
The idea of implementing cutting-edge NLP algorithms is fantastic and greatly needed. However I believe the multi-licensing will not be sustainable in the long term. It limits the ability (and interest) in others contributing into the library because you'll have to get a copyright transfer for any pull requests in order to merge into the commercial branch. It seems difficult to imagine one person being able to develop and maintain a library of this scope. This is particularly true when they are dependent on it for income rather than some tenured academic who can invest all of their time into the project without much risk or need for short-term gains.
The idea of a tenured academic spending all day coding their non-research library is...let's just call it unrealistic. If you find an hour to write code, it will not be during business hours.
I'm curious why you think there's a market failure here. If a library like this can produce N salaries of value, then it should be able to earn N salaries of revenue. Maybe it takes >N salaries of work to produce it --- in which case, okay. That means this project is more costly than it is useful.
I definitely believe that market failures exist, and are quite common. But the service I'm trying to sell is trying to create economic value very directly. I'm not writing poetry here.
I'm not saying that tenured faculty working on these projects is the norm; that's not at all the case. It's just that the few instances were I know of someone successfully building and maintaining a large low-level library like this on their own come from dedicated academics who can de-facto make it the majority of their job. The best example I know of being Tim Davis' SuiteSparse C library for sparse matrix algebra.
Aside from that, the market failure I foresee for your project is the following: Say you keep building this out and write a fantastic, state of the art general purpose NLP python library. Now an a academic like myself comes along and forks the AGPL version of your repository, contributing additional functionality to the parts of the pipeline I am most familiar with. You cannot re-incorporate my work into your commercial license (unless I sign those rights away, which I won't), so now you're stuck trying to license an inferior version of my fork. Meanwhile, since mine truly is just open source, my version can freely accept both bug fixes and added functionality from one-off contributors who are using said fork. Better yet, unless you change your business model, I can continue to re-fold in any changes you make upstream, as well as include parts of other GPL libraries that build up in the intervening time.
Now, I'm not saying this is a perfect argument. Perhaps enough people are still interested in paying you for a a commercial version of the original software, but I think in the long run as the two version diverge that's unlikely to generate sufficient revenue for you.
My thinking is that there will be commercial contexts where an AGPL license is a non-starter --- it's incompatible with the business model. If so, I think it makes sense for them to buy a commercial license from me, even if it lacks features in your fork. If they can't use AGPL code, your fork may as well not exist, except as this tantalizing something-I-can-never-have that makes the main library look worse.
I'd also note that you gain absolutely nothing from maintaining your separate fork, other than the principle of the thing. I'm compelled to distribute the code under the AGPL, just as you are. If your features are compelling you could instead negotiate with me for a cut of the license fee.
You may want to take careful note of the issue ~laGrenouille raises about others' contributions, though.
Say a patch/pull-request is freely offered to your AGPL project. I believe it's well-understood that such a submission includes implicit permission for distribution under that same open-source license.
But you will also be licensing that same contributed code under another custom commercial license. Contributors should understand that – you've made it clear in surrounding docs – but whether you can legally assume their re-licensing permission, without an explicit copyright-assignment, is a bit murkier of an issue.
In fact I would wish for more thematic articles from people like OP, who know the topic. It is easy to find some introductory course on NLP, but introductory is introductory, and as OP states there's a visible gap between what Google does in 2015, and what some GPL/MIT/BSD-licensed project does in 2015 in that area. While there's relatively large amount of material on DSP or CV, all these linguistics-related areas seem to have quite a barrier to entry even for those willing to learn.
I have a feeling that's because a great deal of NLP use is either proprietary or academic. It's either buried away in a combination of academic papers and academic heads or hidden away in a corporate codebase the world doesn't have access to.
The best way as far as I'm aware of to get into NLP is to take a course at a university. My university (University of Melbourne) was lucky to have an undergraduate subject taught by one of the lead authors of NLTK (Stephen Bird) and that was a great help. You can even take the subject on its own without enrolling in a full course.
So, how does this compare to Pattern (http://www.clips.ua.ac.be/pages/pattern) -- a (in my opinion) very high-quality, BSD-licensed data mining and NLP library coming from the academic world?
From a quick speed test on my laptop, Pattern is 48x slower at POS tagging, and 8x slower at parsing. I last benchmarked its accuracy in 2013, where I found it got 93.5% on the WSJ corpus, vs 97% for the state-of-the-art taggers --- so twice as many errors. It was also more domain-dependent. Its parser doesn't produce exactly the same representations as mine, so I can't easily evaluate its accuracy. But, I doubt it's very high.
Pattern doesn't really use machine learning, just some pre-computed statistics from the annotated data, and some hand-crafted rules. Machine learning is good. It's really the right way to build these systems.
One of the great things of modern NLP is that a system can easily be trained for another language (assuming that you have training data, see my other comment). Since this uses statistical NLP techniques, it should be easy to add languages.
Sorry you're dealing with so many licensing questions here but a quick clarification:
> If their company is acquired, the license will be transferred to the company acquiring them. However, to use spaCy in another product, they will have to buy a second license.
Is the second license only required because they sold the company on (and the license along with it), or is a license per product generally required? In other words, if I buy a single license, can I make and sell two different products?
Actually I appreciate the license questions, because it makes it easier to know how to re-write the docs to clarify.
One license allows you to develop one product. If you stop work on one thing you can re-use the license on something else, though --- it would be silly to ask at what point a change of focus becomes a different product.
This seemed the sanest way to do it. I think per-site, per-user etc licenses are really stupid. The license then impinges on your technical decisions.
This is great thanks! I had a few weeks ago the problem of not being able to get a passive verb parser in CoreNLP fast enough to work. Does SpaCy support reduced passives?
You can write rules to find them in the dependency parse, although the parse tree won't necessarily be correct.
I've thought a lot about passive reduced relative clauses over the years --- they were a big part of my PhD thesis. So I happen to know that the first one in the WSJ data is wsj_0003.1. This isn't in the training or development data, but it's in the same data set --- so, this is a fair but optimistic spot-check. The sentence is:
> A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago, researchers reported.
There are two reduced passives here --- "used" and "exposed", and a potential (but unlikely) false positive in "reported".
spaCy correctly attached "exposed" to "workers", but didn't attach "used" correctly --- it attached it to "reported" instead of "form". This doesn't really make syntactic sense, but that's what it did --- the system's entirely statistical; there's no grammar licensing certain attachments.
To see the parse, run:
from spacy.en import English
nlp = English()
tokens = nlp(u'A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago, researchers reported.')
for word in tokens:
print word.orth_, word.tag_, word.dep_, word.head.orth_
Glad to see the progress and will pass the word. Maybe you want to comment regarding semantic parsing, if your parser can help it in a pipeline, if one can factor it through your parser, or maybe as in task 8 of semEval '14 [1], you need to rethink your structure (dep. tree vs. dependency graph).
This is a really neat project OP, and I hope you can make enough money to sustain development. I really, really wish academia did a better job of sponsoring people to maintain high quality software libraries, so you didn't have to 'strike out' on your own though.
I appreciate the thought, but if this isn't useful to support itself, then obviously I was wrong, and I should find a more valuable project. But I don't think that's the case --- I think this will help a lot of people build useful products, so the commercial license should fund its development quite adequately.
Then don't consider it a donation, but as a separate pricing tier. Currently, you are charging $5000 to those who don't want AGPL, and $0 to those who are fine with AGPL. You can charge, say, $100 to those who are fine with AGPL but want to pay.
If you want to increase sale, you could include donors' names in the documentation in return to $100, for example.
I've considered it, yes --- but it's hard. Currently I segfault under PyPy. I've got the learner and hash-table code working, but I need to debug the NLP. I suspect it's the way I'm interning my strings.
I didn't know what NLP is, maybe you should explain it (or simply write full words in stead of the abbreviation) once in the beginning of your website.
TextBlob is a wrapper around NLTK and Pattern. Those libraries don't use very sophisticated statistics, so in 2013 I wrote a small Python POS tagger for TextBlob, which performs much better: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-spe...
spaCy's POS tagger works like the one in the blog post, but it's implemented in Cython, and has some extra features.
Pattern has some nice morphological processing features. I don't do morphological generation, for instance, and I haven't hooked up the morphological analysis to the Python API yet.
TextBlob also gives you a few extra bits and pieces, like a wrapper of the Google translate API.
I'm really targeting the situation where you want to build a product around some NLP. In this use-case, you need the NLP to be fast, you need it to be as accurate as anyone knows how to make such a system, and you need it to be entirely in your control.
As far as GenSim goes: it's good. It does different things from spaCy, though --- topic modelling, etc. It would be nice to interoperate between the two libraries. I have no plans to implement topic models.
...and I have no plans to add NLP tools in gensim. The connection between gensim and tokenizing/tagging/parsing libs is intentionally loose and flexible.
I'm a fan of "do one thing, do it well".
Having said that, it would be great to facilitate "spaCy + gensim" pipelines for users.
For example, the "word vector representations" can be trained easily with gensim, on arbitrary user-specified corpora, whereas spaCy loads something pre-trained, in a specific format. Maybe room for some interoperability there?
> But the academic code is always GPL, undocumented, unuseable, or all three.
I'm not sure why this author is further propagating FUD that suggests GPL code is unsuitable for commercial use. Just because companies are irrationally afraid of the GPL doesn't make it true.
This was my understanding --- actually I designed the licensing structure of this project around the assumption that companies would not want to use GPL licensed code commercially. I offer an AGPL license, and offer a commercial license for a fixed fee.
My understanding is that if you link to the library, your code must also be GPL, which means that anyone linking to your code must be GPL, etc.
This is a problem if you're trying to sell your code. Probably you don't want to make it GPL, and you probably don't want to force your customers to make their code GPL.
> This is a problem if you're trying to sell your code.
I think relatively few tech companies are trying to sell their code directly these days. Most are hoping to build a product and/or service and then charge customers for access to it.
An API or a web service counts as distribution under the AGPL. If you run such a service, and you use spaCy --- either by linking the binary, or using it as a network service --- you'll have to AGPL your code. Which introduces equivalent restrictions on anyone who uses your service.
Because GPL is hard to work with. I want to write code, not dig through legalese or discuss with a lawyer whether it's okay for me to do X or Y. More often the easier option is either to completely isolate GPL'ed bits (i.e. run GPL software on a separate machine so that there's no possible interpretation under which we're modifying it), or use something with a different license.
Or are you saying companies should be more willing to GPL their products? That's unlikely to happen.
The Cython implementation makes it somewhat believable that it's faster than CoreNLP, but I'd also like to hear a deep-dive on why it's several times faster, beyond that control over memory layout is the best way to win performance (stipulated). In particular, it would be good to know whether CoreNLP is doing more processing than spaCy or otherwise handling more concerns.
Finally, I'd really love to see a feature table comparing spaCy with CoreNLP.
Compelling work!