Data is good, code is a liability

snprbob86 · on Nov 7, 2008

I worked on Google's statistical machine translation system during my internship. There, I learned that data really is king. The Google Translator team spends equal effort collecting data as they do improving their algorithms.

The 2008 NIST results [1] show that Google's translator swept every category with unconstrained training sets. That is, when Google was allowed to use all of the data that they collected, they smoked the competition. When the training sets were constraint to a common set for all competitors, better algorithms prevaled. You can be sure that the very talented team at Google will be improving their algorithms to ensure that never happens again. But you can also be sure that competitors will be collecting even more data to counter Google's victories.

[1] http://www.nist.gov/speech/tests/mt/2008/doc/mt08_official_r...

liuliu · on Nov 7, 2008

But who remember that in 1998, how many webpages Google indexed and how many Altavista indexed? Data is important, for spam filter, translation etc. but is far from a "king". We have the fancy that data is much important than algorithm Because now we actually get some really good statistical learning methods.

curiousgeorge · on Nov 7, 2008

Sounds like a great job. GIZA++ is an incredible piece of software and it sounds like they've got a great team.

zmimon · on Nov 7, 2008

I see this at a micro level frequently when coding. Very often code that is a complex bunch of if / else statements is dramatically simplified by turning it into a map / dictionary with pointers to either data or functions to handle that type of data (object oriented polymorphism being an instance of this).

There are also interesting parallels with REST vs RPC as well. You can create a rich API of function calls for accessing and manipulating data, but it's nearly always less flexible than just exposing the data and letting people manipulate it directly.

I think the tendency to favor algorithms when it might otherwise not be wise to do so comes from how our minds work: we remember things primarily in terms of stories, scenarios, sequences of events. This causes us to interpret the world in terms of behavior as if behavior is the primary construct on which the universe is modeled. But of course behavior is not primary, data is primary, things are primary - behavior is just a fiction we impose on them. This often leads our instincts in the wrong direction.

jamongkad · on Nov 7, 2008

Hmmm replacing if/else statements with a map/dictionary with pointers to either data or functions. A little off topic here but how do you propose to do this? Assuming we know what polymorphism is. Your map/dictionary style is quite interesting.

etal · on Nov 7, 2008

For languages with first-class functions, it looks like this:

    # Algorithms
    def double(x): return x*2
    def square(x): return x*x
    def fact(x): return (x*fact(x-1) if x > 1 else 1)

    # Data
    choices = { 'A': double, 'B': square, 'C': fact }

    # I/O
    choice = raw_input('Choose A, B or C: ')
    x = input('Enter a number: ')
    if choice in choices:
        print choices[choice](x)
    else:
        print 'Initiating self-destruct sequence.'

It's actually similar to how a switch block works, if each case in the switch statement just calls a function or evaluates one expression.

Also worthwhile: Instead of functions, let the dictionary values be lists of arguments for another (multi-argument) function. Then the lookup is like choosing from a set of possible configurations for that function. A little redundancy is OK, since the table is so easy to read and edit.

orib · on Nov 7, 2008

Parse tables or regex compiling are a perfect example of this style. Lookup tables are another good example.

fauigerzigerk · on Nov 7, 2008

I know it's a popular opinion nowadays, but here's what keeps me from adopting this view (huge amounts of data over algorithms) wholesale: Humans make smart decisions on very little data. How many faces does a child have to "process" in order to learn to recognise faces? Not many. How much does a person have to read in order to learn correct spelling? Not the entire google index I suppose.

Humans work neither on simple deterministic rules nor on huge amounts of data. It's something else. Some very smart "algorithm" that we haven't found yet (Bayes nets don't get there either but they look promising).

If there's a way for humans to be smart without much data there must be a way for machines to do the same. That is unless you believe in some kind of spirit/soul/god cult and I don't.

pchristensen · on Nov 7, 2008

Humans are superior to machines in several ways:

- we get tons of data, just not all textual. We have visual (~30fps in much bigger than HD resolution all day long), audio (again, better than CD quality all day long), smell, taste, and touch, not to mention internal senses (balance, pain, muscular feedback, etc). By the time a baby is 6 months old, she's seen and processed a lot of data. Don't know if it's more than Google's 18B pages, but it's a lot.

-we get correlated data. Google has to use a ton of pages for language because it only gets usage, not context. Much (most?) of the meaning in language comes from context, but using text you only get the context that's explicitly stated. Speech is so economical because humans get to factor in the speaker, the relationship with the speaker, body language, tone of voice, location, recent events, historical events, shared experiences, etc, etc, etc. Humans have a million ways to evaluate everything they read or hear, and without that, you need a ton of text to make sure you cover those situations.

-we have a mental model. Everything we do or learn adds to the model we have of the world, either by explicit facts (A can of Coke has 160 calories) or by relative frequencies (there are no purple cows but a lot of brown ones). My model of automobile engines is very crude and inaccurate while my model of programming is very good. Also, because I have (or can build) a model, I have a way to evaluate new data. Does this add anything to a part of my model (pg's essays did this for me)? Does it confirm a part of the model that wasn't sure (more experimental data)? Does it contradict a weakly held belief? Does is contradict a strongly held belief? Is it internally consistent? Is the source trustworthy?

This mental model might just be a bunch of statistically relevant correlations, but that sounds like neurons with positive or negative attractions of varying strength. Kind of like a brain. I believe Jeff Hawkins is on to something (see On Intelligence http://www.amazon.com/o/asin/0805078533/pchristensen-20), but there needs to be correlated data (like vision/hearing/touch are correlated) and the ability to evaluate data sources.

I agree that if humans can do it, machines can do it, but I think you're vastly underestimating the amount and quality of data humans get.

fauigerzigerk · on Nov 7, 2008

Yes I think you do have a point, but I don't think it's about things like visual resolution and the amount of data it generates. It may be about the much greater variety of data we see and about our ability to experiment and interact with the world around us in order to test our beliefs.

So maybe you could say it's about the quality of information not just the amount of data of one particular kind.

In any event, this is a debate that is only at the very beginning. I don't claim to have come to a conclusion. I just think those brute force statistical techniques are not the end of the road but rather a practical workaround for the brittleness and the complexity of traditional rule based systems.

evgen · on Nov 7, 2008

Don't want to be pedantic here, but your info on our visual bandwidth is a bit out of date. We actually only process about 10M/sec of visual data. Your brain does a very good job of fooling your conscious self, but what you are perceiving as HD-quality resolution is actually only gathered in the narrow cone of your current focal point. The rest of what you "see" is of much lower bandwidth and mostly a mental trick. We also don't store very much of this sensory data for later processing.

pchristensen · on Nov 7, 2008

Yeah, I knew all that but my comment was already pretty long. Still, 10M/sec * every waking hour of life is still a lot of data.

JulianMorrison · on Nov 7, 2008

Ask an evolutionary biologist.

Millions of years worth of data has been reified into hard-coded algorithms by a process, evolution, that is perfectly happy working with the most horrendous spaghetti code in existence.

Humans use this, but computers ought not to.

liuliu · on Nov 7, 2008

I cannot find the one I read, this paper looks like:

http://www2.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2...

In restricted environment, human don't outperform the machines.

Retric · on Nov 6, 2008

But, Data creates Algorithms. For some set's of problems using Machine Learning / AI works well. But, it's inportant to understand what limitations your data creates in the same way that you need to understand what bugs exist in your code.

felideon · on Nov 6, 2008

Sounds like a paradox in Lisp since code is data.

Does this mean all Lisp code is good? :)

sridharvembu · on Nov 6, 2008

I believe we need a "converse of Lisp" - in Lisp code is data, I believe what we need is the notion "data replaces (most) code". That leads to the question, what really is data, and I believe Codd supplies the best answer to that question. One of the truly original ideas in Computer Science that post-dates Lisp (and is not anticipated by Lisp) is Codd's relational model of data, which is not to be confused with relational databases used for storage.

Note that Codd's model is not Turing-complete, while all but the most trivial definitions of code lead to Turing-complete systems, hence the parenthetical most in my "data replaces (most) code". Data is easy, code is hard could be another way to state that.

We have experimented with such ideas, and we can report that they do significantly improve clarity and therefore productivity.

As an aside to a pg essay, I believe clarity is not the same as succinctness and as a corollary, succinctness does not imply productivity except in the somewhat trivial sense of ease of typing.

jackchristopher · on Nov 7, 2008

But when writing code for yourself succinctness is power; Because abstraction is power; More stuff, smaller space.

But when writing for others, abstractions kill clarity.

It's true in language too. You use fewer words explaining something to yourself than to others.

For example I can say to myself, "Our election are no different than high school elections (decided on popularity, not issues)."

It would be seen as heretical to most, and wrong to others. But I'd be sure that I'm right.

They can't see that behind that thought is long reflection on high school popularity, evolutionary psychology, and more things than I can recall.

Incidentally, Paul makes this same point in, It's the Charisma, Stupid. And while reading it I thought, "He's just saying that things don't change (after high school). What an elegant theory; Occam's razor at it's best."

ken · on Nov 7, 2008

It almost sounds like you're describing Subtext: http://subtextual.org/

magoghm · on Nov 7, 2008

Subtext looks wonderful. Thanks for the link!

gruseom · on Nov 7, 2008

If data is code, how does it get interpreted and executed?

silentbicycle · on Nov 7, 2008

Well, what that really means is that Lisp can handle its own code as smoothly as it can data, since the two take the same form.

The reverse can also be true, though; data "is" code, in that if you design a data structure sufficiently well, for example, the necessary code to work with it should be self-evident. (Interestingly, "data is code" is probably more characteristic of Forth and (in a very different way) the relational paradigm.)