It heaps on a lot of theoretical musing without providing either empirical evidence of practical relevance or providing a theoretical guarantee along the lines of "as long as the model is within delta of [something intuitively verifiable] it will have [useful property] with p>1-epsilon."
For what you're trying to do, there is a relatively solid baseline, namely doing something trivial to first-order logic (e.g. assigning a dimension to each Herbrand formula or ground term or whatever) and e.g. turn generalized quantifiers into tensors (with some mathematical plumbing required). This would allow you to reuse model-theoretic semantics with a serving of tensors and vector spaces on top, and it would do at least those things that generalized quantifiers can do. You could then argue empirically for some kind of finite-dimensional approximation to that, or expose neat theoretical properties from a formal viewpoint.
As is, you don't do any of that. It's more or less like going into a hardware store, getting some pipe and toilet bowls and making an abstract sculpture without ever talking about the house for which you want to provide the plumbing.
You're right - I talk about how to incorporate logical representations in my thesis, but the reviewers asked me to remove it as it wasn't complete enough. We had some more thoughts about the correct way to do this in this paper: http://homepages.feis.herts.ac.uk/~dc09aav/publications/iwcs...
Personally, I think the NLP problem is, and will always be, a graph problem. To the extent that you can approximate and accelerate graph algorithms with vectors, then fine, but vectors are not the fundamental space.
One interesting aspect of Coecke's work is the reuse of Hilbert spaces and mathematical formalism from quantum mechanics. In fact, the most fascinating papers on his site are the ones where he simplifies and visualizes QM intuition based on monoidal categories, very much in the spirit of Baez & Stay:
Neither Clarke nor Coecke give enough prominence to the great work of Aerts & Gabora on QM-style attribute spaces and 'collapse' of knowledge vectors, e.g.
The question ultimately boils down to: Can we represent variable-length sequences (e.g. sentences) in a fixed-length representation (vector with k bits), without losing any meaning?
Fixed-length representations are more useful, because we can use standard learning machinery (ML) to predict over them. Learning techniques over variable-length sequences are more primitive. e.g. a CRF for token sequences cannot incorporate long-distance dependencies, whereas a two-layer neural network can model ANY mathematical function.
I used to believe that a variable-length sequence would, ipso facto, require a variable-length representation. However, Leon Bottou argued there must be an upper bound (1000?) on the number of bits required to represent English sentences that a human could recognize and parse in the course of normal conversation. I'm not talking about a pathological grammatical case or some Old Testament-like inventory of someone's possessions. I mean simply a sentence that you could parse and repeat back to me in your own words.
My problem with the cited work is that it is purely theoretical, and does no empirical work to explore potential limitations of the framework. It is difficult, without throwing the approach at real data, to see if it is actually an effective model for practical use. I haven't evaluated the approach deeply enough to poke any specific holes in it.
The author writes "there is currently no general semantic formalism for representing meaning in terms of vectors". However, I believe this is untrue. The author is seemingly unaware of the entire connectionist literature on fixed-length representations, which are based upon recursive neural-networks. For example, the recursive auto-associative memory work (RAAM) by Pollack in 1988, the Labeled RAAM architecture, the holographic reduced representation (Plate, 1991), and the recursive nets used by Sperduti and collaborators in the mid-90's, these works are all highly germane, but remain uncited. In principle, these architectures are powerful enough to represent all meaning in fixed length vectors, and operate over these vectors effectively. The problem with these approaches isn't theoretical, it's practical. We simply don't know how to train these architectures effectively. I find it annoying when a theoretician makes claims on the basis of existing theoretician models, and is ignorant of existing empirical models.
RAAM in particular is pretty cool. It's a fixed-length machine trained to eat the input left-to-right. It is designed so that it can uncompress itself right-to-left. So it has two basic operations: Consume, and uncompress. Each time it eats an input, it outputs a new machine of the same fixed-length. Each time it uncompresses a token, it outputs the token and a new machine of the same length. Very cool!
As you can tell, I am more excited by purely empirical and data-driven vector-based methods. For vector-based word meanings, see the language model of Collobert + Weston, which I summarized in this paper: http://www.aclweb.org/anthology/P/P10/P10-1040.pdf You can also download some word representations and code to play with here: http://metaoptimize.com/projects/wordreprs/
I have done some work with RAAMs and other simple recurrent network based methods. It's terribly tedious, you can hardly represent more than five symbols or so.
I did my Bachelor's thesis on related techniques (pressing variable length strings into fixed sized vectors to deal with neural nets - URL below) - I can only say pertaining to neural networks that the decoding capabilities of NNs are severly limited. I actually developed a Jordan Network with a conventional additive sigmoid NN that could encode/decode more than 40 symbols! The technique was based on Cantor coding, but I had to set the weights by hand and could not retrain without losing performance (unstable fixed point attractor of Backpropagation through Time) - i.e. it's kind of a dirty hack.
It's also really hard to adapt parameters for sufficiently large nets (so my idea is that you need huge nets to represent language). In order to deal with this complexity limitation I've also looked into reservoir computing style networks (Echo State Networks) which use a large randomly initialized network paired with a linear learned network. ESNs may be good at modelling many kinds of temporal dynamics, but their capacity to represent relations seems rather limited as well.
So making an arbitrary length string walk into a vector may sound attractive, but either (a) don't expect to be able to decode it using conventional NNs or (b) don't expect to acheive compression, i.e. blowing up your representation may help (cf LVQ nets) and using compressing techniques such as Hinton's deep belief networks won't.
I believe the problem is the training algorithm, backprop, not the model (NNs). "unstable fixed point attractor of Backpropagation through Time", as you said.
Ilya Sutskever in Geoff Hinton's lab has had great success recently using hessian-free optimization to train recurrent networks. He has trained character-level RNNs on Wikipedia, and they can generate very long sequences of quasi-plausible text. In particular, it seems like they can remember many symbols.