How good is the type inference? I did some research on PHP back in the day, and ...

kmavm · on April 6, 2012

I haven't done any work on the type inference stuff, so my understanding is second-hand. My peers who actually know what they're talking about are working on a more complete write-up, so stay tuned.

1. (Type inference.) The compiler's type inference is basically a first-order symbolic execution without inlining; we don't iterate to a fixed point or anything, and any union types at all cause us to give up and call it a variant. Many other PHP surprises mess up type inference; for example, function parameters by reference aren't marked in the caller, so if anywhere in the codebase there exists a unary function that takes its first parameter by reference, $f($a) will spoil our inference of $a: $f might mutate it.

Still, simple things work pretty well. One major area we do well on is inferring the return types of function calls. It's hard to say how much we're missing out by not doing more sophisticated type inference; while we can count the sites at compile time, this doesn't tell us which sites are dynamically important. Our anecdotal experience has been that type inference is one of the more powerful optimizations we can perform.

We don't model refcounts, and I'm certain there are opportunities there.

2. (Compile times.) 15 minutes sounds about right to me, and chuckr would certainly know better. Compiler speed itself is something we pay attention to, since it inhibits our ability to turn pushes around. The actual global AST analysis is multithreaded, so we keep the whole build machine busy. So far the trade-offs between optimizations and compile time have not been too dire; there haven't been things that would make the binary twice as fast if only we could compile twenty times as long, e.g.

3. (PHP implementation/semantics). Who've you been "arguing" against? :). It's a manifest fact that accidents of PHP's implementation have been sunk into the language. Destructors are one of the ways to witness refcounts, but there are others. var_dump() of arrays exposes the refcount of internal data; things whose refcount is >1 are displayed with a '&' preceding them, e.g.:

<? $a[0] = 1; var_dump($a); $data =& $a[0]; var_dump($a); unset($data); var_dump($a); ?>

produces

array(1) { [0]=> int(1) } array(1) { [0]=> &int(1) } array(1) { [0]=> int(1) }

Less trivially, it is hard to efficiently support the copy semantics for arrays without reference counting; other systems have overcome this, but it is less trivial than with refcounting.

pbiggar · on April 6, 2012

1) So what about the (I would think fairly common case) of loops where the variable is first initialized in the loop. On one path, it's null, on the other, it's a typed value. Can you optimize that?

2) Cool! phc went the route of really advanced, context sensitive analysis, and the result was massive memory use and huge compile times for even simple programs (if using optimization).

3) Have you read IBM's POPL 2010 paper on it? Interesting stuff, though maybe not actionable. They argue (there's that word again!) that PHP's copy-on-write is flawed for arrays containing references - the reference can lose it's reference-ness when the array gets altered and copied. They have an implementation which fixes this for a relatively small slowdown. Interesting paper as I recall.

FWIW, I think it's allowable to change the output of var_dump. Optimizing reference counts kinda requires it, and I know Hiphop isn't in the business of strict conformance to PHP semantics.

kmavm · on April 10, 2012

1) Great question! Yes, it is super-common. Most of our type inferences on local variables are actually of form "Null-or-X". So we can emit code that skeptically checks for Null, but then goes ahead and assumes X otherwise.

2) I'm not claiming we have all the answers here; but decent compile times has been a strong evolutionary pressure all along.

3) Yeah, we basically implement the "slightly relaxed" semantics described in that paper, and are addicted to the performance wins from doing so. WRT loosening semantics, though, in general you'd be surprised how close we come to quirk-for-quirk compatibility. The problem is that real PHP programs, including ours, end up depending, often unintentionally, on those quirks. I.e., nobody meant to rely on the order function params are evaluated in, but in practice we rely on it because param 0's evaluation has some side effect that param 1's evaluation needs to run correctly...