Hacker News new | past | comments | ask | show | jobs | submit login

How good is the type inference? I did some research on PHP back in the day, and reckoned that a huge number of types (80%-90%) could be inferred - but I didn't run it on anything like Facebook's codebase.

Also, more info on the type inference please :) Context-sensitive? Do you model reference counts in your analysis? If so, do you get optimizations out of that?

Did the article really say a compile is 15 minutes? That's insane - how is it so fast? Does it limit the optimizations? You mustn't be able to do very deep analysis - or do you cache it somehow? What about FDO?

I've argued for years that the PHP interpreter's implementation leak into the language semantics, but you seem to have cast iron proof about the refcounts. Is it just destructors, or do you see errors due to copy-on-write semantics (like the weird array semantics when a reference is in a copy-on-writed array)?

(Sorry for all the questions, it's just very interesting)




I haven't done any work on the type inference stuff, so my understanding is second-hand. My peers who actually know what they're talking about are working on a more complete write-up, so stay tuned.

1. (Type inference.) The compiler's type inference is basically a first-order symbolic execution without inlining; we don't iterate to a fixed point or anything, and any union types at all cause us to give up and call it a variant. Many other PHP surprises mess up type inference; for example, function parameters by reference aren't marked in the caller, so if anywhere in the codebase there exists a unary function that takes its first parameter by reference, $f($a) will spoil our inference of $a: $f might mutate it.

Still, simple things work pretty well. One major area we do well on is inferring the return types of function calls. It's hard to say how much we're missing out by not doing more sophisticated type inference; while we can count the sites at compile time, this doesn't tell us which sites are dynamically important. Our anecdotal experience has been that type inference is one of the more powerful optimizations we can perform.

We don't model refcounts, and I'm certain there are opportunities there.

2. (Compile times.) 15 minutes sounds about right to me, and chuckr would certainly know better. Compiler speed itself is something we pay attention to, since it inhibits our ability to turn pushes around. The actual global AST analysis is multithreaded, so we keep the whole build machine busy. So far the trade-offs between optimizations and compile time have not been too dire; there haven't been things that would make the binary twice as fast if only we could compile twenty times as long, e.g.

3. (PHP implementation/semantics). Who've you been "arguing" against? :). It's a manifest fact that accidents of PHP's implementation have been sunk into the language. Destructors are one of the ways to witness refcounts, but there are others. var_dump() of arrays exposes the refcount of internal data; things whose refcount is >1 are displayed with a '&' preceding them, e.g.:

<? $a[0] = 1; var_dump($a); $data =& $a[0]; var_dump($a); unset($data); var_dump($a); ?>

produces

array(1) { [0]=> int(1) } array(1) { [0]=> &int(1) } array(1) { [0]=> int(1) }

Less trivially, it is hard to efficiently support the copy semantics for arrays without reference counting; other systems have overcome this, but it is less trivial than with refcounting.


1) So what about the (I would think fairly common case) of loops where the variable is first initialized in the loop. On one path, it's null, on the other, it's a typed value. Can you optimize that?

2) Cool! phc went the route of really advanced, context sensitive analysis, and the result was massive memory use and huge compile times for even simple programs (if using optimization).

3) Have you read IBM's POPL 2010 paper on it? Interesting stuff, though maybe not actionable. They argue (there's that word again!) that PHP's copy-on-write is flawed for arrays containing references - the reference can lose it's reference-ness when the array gets altered and copied. They have an implementation which fixes this for a relatively small slowdown. Interesting paper as I recall.

FWIW, I think it's allowable to change the output of var_dump. Optimizing reference counts kinda requires it, and I know Hiphop isn't in the business of strict conformance to PHP semantics.


1) Great question! Yes, it is super-common. Most of our type inferences on local variables are actually of form "Null-or-X". So we can emit code that skeptically checks for Null, but then goes ahead and assumes X otherwise.

2) I'm not claiming we have all the answers here; but decent compile times has been a strong evolutionary pressure all along.

3) Yeah, we basically implement the "slightly relaxed" semantics described in that paper, and are addicted to the performance wins from doing so. WRT loosening semantics, though, in general you'd be surprised how close we come to quirk-for-quirk compatibility. The problem is that real PHP programs, including ours, end up depending, often unintentionally, on those quirks. I.e., nobody meant to rely on the order function params are evaluated in, but in practice we rely on it because param 0's evaluation has some side effect that param 1's evaluation needs to run correctly...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: