I see your point but would argue the foundations of a language should be simple enough to test in a fashion like this. Perhaps this particular test isn't good because numbers are skewed from optimizations that don't reflect the languages abilities. A full application will have many, probably more, design decisions made that aren't necessarily a reflection of the language too.