Hacker News new | past | comments | ask | show | jobs | submit login
Scaling PHP Book: I will teach you to scale PHP to millions of users (scalingphpbook.com)
143 points by stevencorona on June 5, 2012 | hide | past | favorite | 104 comments



Here's a short list:

1. Cache the output at the edges: Use Varnish or other reverse proxy cache.

2. Cache byte code: Use APC or XCache PHP opcode cache.

3. Cache and minimize database I/O: reduce database touches using memcached, redis, file caches, and application-level caches (ie. global vars)

4. Do event logging in local files, not to the database: Make all write operations as simple and fast as possible, any data that is not needed in realtime can be written to a plain old file and processed later.

5. Use a CDN, especially for delivering static assets

6. Server tuning: Apache, MySQL, and Linux have lots of settings that affect performance, especially the timeout settings ought to be turned down.

7. Identify bottlenecks: At the system level use tools such strace, top, iostat, vmstat, and query logging to see which layer is using the most time and resources. Also there's an excellent PHP code profiling service called New Relic that drops you right into the function and db query that's eating up most of the time in slow requests.

8. Load testing: DoS yourself. Stress test your stack to find bottlenecks and tune them out

9. Remove unused modules: For each component in the stack unload any default modules that are not needed to deliver your service.

10. Don't use ORMs and other dummy abstractions: Take off the training wheels and write your own queries.

11. Make the entry pages fast, simple, and cacheable. Nobody is reading that silly news feed in bottom corner of your front page and it's killing your database, so take it out.

Most of the time a PHP slows down because each PHP process is blocked waiting for I/O from some other layer, either a slow disk, or overloaded database, or hung memcached process, or slow REST API call to a 3rd party service ... often just strace'ing a live PHP process will show you what its waiting for ... in short, blocking I/O slows down everything. The key to going faster is:

* keep it simple

* cache as much as possible in local memory

* do as few blocking I/O operations as possible per request


Could you post this on StackOverflow or would you allow me to? It's very useful, way more people could benefit from this!


Copy away. I don't care. Information wants to be beer.



"Identify bottlenecks" - I hope this is the first topic covered. There's no point in fiddling about with PHP if the actual bottleneck is the fact that you're sharing a slow connection with dozens of other companies or are hosted on the opposite side of the world to most of your customers.

Having said that, I'm looking forward to the book, as there doesn't seem to have been much on this topic since the O'Reilly books on scaling web applications (2006 I think).


What's the best way to identify bottlenecks?


Not sure about 'best', but the way I go about it is to measure the total time required for an application to complete a task, then attempt to measure the individual components (time spent in PHP, data transfer, rending on client side etc.) and keep narrowing it down until you find the exact cause, such as a database query which requires optimisation.


I'm expecting the book to go much further than this. Your list is a fine starting point, but its fairly basic and lacks specifics.

I want to know about the weird things that happen when you push your PHP stack to really ridiculous limits. I think that's what this book will offer. If it ends up being stuff like "use APC! minimize trips to the DB!" I would be very surprised and disappointed.


I agree- good starting point, but the book will be much more in depth. It's going to include a fair amount of high-level design, identifying how and where to scale, weird issues you run into (fun fact: we don't use APC @ Twitpic), solutions to PHP anti-scaling quirks (you can't set sub-1s mySQL timeouts, for instance, although I think this is due to libmysql, not PHP), lots of case studies, scaling design patterns, and way more cool stuff.


What was your experience with APC? You still use some opcode cache, right?


If you use a persistent PHP stack like mongrel2 + photon an opcode cache is just a waste of a php module.

http://www.photon-project.com/


Wouldn't you lose one of the advantages of PHP which is that every request is stateless and that when it fails it doesn't fail hard killing the application?


Not any more nowadays. I am the original author of Photon and today PHP is extremely robust and you can catch all the exceptions/errors (outside the ones which are making your code invalid, but hey, do your homework). The current PHP 5.3 is also wonderful at not leaking memory. So you get the speed of bare PHP with a framework (because nothing is reloaded, the framework can really load once, reuse many times).

On a small system, a standard HTTP request through Mongrel2, to the framework and back is about 2ms (including 1ms of network latency between the hosts, data from my production monitoring). This is the "hello world" latency. It makes using PHP very fun and very fast. Also, a single PHP process running in the background uses about 12MB, it means that with 12MBx3 + 10MB Mongrel2, you can serve 1000's of users with nearly no load on your system (if you push the load only when needed at the DB level). Tracing a PHP process, you can run it without system calls at all (only gettimeoftheday for logging).

But, this is cutting edge, so, you need to be open minded and ready to dig in the code (less than 25k SLOC anyway I think).


Current PHP is 5.4. Did you perform any test or benchmark with the new version? Do you gain anything from it? By the way, your project looks very interesting!


Very limited tests, one PECL extension for my projects was updated to support 5.4 only recently. The biggest win will most likely be for the POST payload (multipart) parsing because 5.4 is faster on array/strings. The good point is that with PHP farm, you can run both 5.3 and 5.4 in parallel to compare :)


If it's pushing PHP to ridiculous limits you seek then you want Facebook's HipHop compiler. But web stack design is really an art form, hard to summarize in an HN comment.


I'm curious why you suggest avoiding ORMs. I've been debating this for a while and while I like the accessibility and abstraction that ORMs provide in Django and Rails I'm not familiar enough with Doctrine, DB_DataObject or the other PHP options to really have formed my own opinions yet.

Is this a language specific suggestion or something broader? I would love to hear a deeper debate on the subject.


I think the advice to not use an ORM* is very misguided. Use the ORM for the 80-90% of queries that are "SELECT * FROM table WHERE id=5" and "UPDATE table SET name='jim' WHERE id=5". For the rest of the queries, use your ORM's ability to run custom queries. People who have problems with ORMs are probably not using them correctly.

There was a debate on HN about ORM's a while ago, and one of the most significant comments I read was something along of the lines of "Every project I've ever worked on that 'didn't use an ORM' implemented the same functionality of one in a less-maintainable way."

* Or other object-oriented abstraction layer for database access, ORMs aren't the only option.


I totally agree. ORMs are not a bad thing and save you from writing the same code over and over again. The key is to understand what it's doing behind the scenes and not be afraid to go in and change it if it's generating stupid queries.

Is there some extra overhead? Of course. Probably a lot less than cleaning up the damage from some cowboy coding.


> * Or other object-oriented abstraction layer for database access, ORMs aren't the only option.

What do you mean by this? That you should consider a non-relational database? Clearly any mapping from an OOP language to a relational database is an object-relational mapping.


What I meant to say is that ActiveRecord isn't the only pattern.


Personally, one of my favourite discussions is Laurie Voss' "ORM is an anti-pattern": http://seldo.com/weblog/2011/08/11/orm_is_an_antipattern


The key criticism in that article seems to be "ORMs are imperfect abstractions for SQL." Who the hell cares? This is the real world. We don't need perfect abstractions.


Which, really, should be "ActiveRecord is an anti-pattern"


It could be as simple as the performance loss incurred by an ORM is greater than the benefit of using it. You have PDO with PHP, which can be used quite adequately as an ORM itself (you can fetch a tailored SQL query into a class as opposed to working with basic arrays).

I'd argue that, in some cases, an ORM is only used as a reaction to the separation of concerns. Writing SQL mixes up your languages, so why not try to build that SQL in the language you're working with? Or maybe SQL is considered too difficult.

I quite like having full control over a 'raw' query without requiring an ORM's opinion on how it should be built.


PDO doesn't handle mapping of nested relationships, though which is where it falls down when compared to ActiveRecord implementations. For example I want to fetch an author and all of the author's books in a single query then loop over and print out author name, and each book. A standard ActiveRecord ORM will give you an array of authors where each author contains an array of books, where as with PDO you're still left to manually detect a change in author by a change in the primary key which is more complex than just writing a nested for loop.


Forgive me if I misunderstand, but if you can't pull this off in a single SQL query, then ActiveRecord can't either. It just hides the fact it's performing several queries to get what you want.


What your opinion about the use of PHP frameworks like Cake PHP, Yii, Symfony, etc?


Using PHP for a 30+M registered users website, I'm using a modified Code Igniter for the base framework, and specific Zend Framework's modules on some functionality where they provide good libraries.

As the parent said, ultimately cpu cycle on your web servers don't matter much, you will be limited by IO everywhere. Knowing how to properly use redis' data types and handle multi-level cache invalidation will be a much more valuable asset than using a framework you might not enjoy the most / be the most fluent with just because it's a wee bit faster.

Even in terms of actual php's calculation, your main strong point is figuring out what you can remove from the page processing and put in a daemon instead. "If the user wants to do action A, what is the minimal amount of things I can do to make him believe that it has been done", do that in your controllers, and put all the actual hard work in an event processing queue that doesn't have an user waiting for it to answer back.


This fetishism with scaling... to me it's just procrastination. It feels like work because you're doing something technical but, in the end, you're adding very little to your product.

Yesterday I had a meeting with a potential customer and I hated it. I hate to try to explain my SaaS software to non-technical people who treat me like some 17-year-old webmaster. I'd much rather be refactoring Clojure code. But I got out of my comfort zone and this client will probably add hundreds of thousands to my bottom line. And I'm glad I was at that meeting while my competitors were fetishizing about non-existent scaling issues.

It's 2012 for god's sake, you can rent a 32GB server for less than $100.


Performance isn't a problem until it is, and then when it is a problem, it's bigger than any other problem in the world because you have nothing to sell if it does not operate.

I always thought I was being wise by not doing any premature optimization, but after a few lessons learned the hard way I certainly factor in performance to the design of software before I build now.

Scalability is not a "feature" tacked on at the end development.


Scaling will only become a problem when/if you achieve product/market fit.

What percentage of startups on HN have achieved product/market fit, are past a 128GB commodity box AND have no dedicated engineering team for scaling issues?


As an engineer, not knowing how to scale when working on something that may need to scale could spell the downfall of whatever product you're working on. Yes, all is well and fine until you hit a natural growth cycle and can't commission new boxes fast enough because each request is taking 300ms. Your database isn't accepting enough requests (plus, some of your bad queries that aren't indexing are blocking too long). Then your web servers run out of memory because you are using an ORM for large lists of items that you're just returning as an array...

Once you get to the point of no return you have to know what to do or you'll suffer. Learning that when fire is falling from the skies is the worst way in retrospect.


Part of being a good developer is knowing where the scalability bottlenecks are likely to be as you are developing software, and making intelligent decisions about the algorithms, data structures, and architectures you use. I would never advise a developer to not "factor in" performance or future scalability needs at all. The problem is when you spend significant extra effort building highly scalable architectures that you don't need now and may (likely) never need.


To a point. However, it's also rather painful to add after the fact in many architectures, and you've ever been up against that wall once, you'll never want to be in that situation again.

Plus, almost all of these techniques will also speed up page generation on a lightly-loaded server, so that's a win in any circumstance.


scale is one of those problems that sneaks up on you (well unless you're an instant hit site I suppose)

In all of my cases dealing with scale issues it's been due to the DB growing very large in size to the point where you can't use the same techniques that you're used to using. That usually happens in combination with more concurrent users. Depending on how large and how organized your code is, things can get very ugly when the site starts throwing errors, being unresponsive, etc.

I usually know where I want to go as a next step with scaling and monitor our resource usage until it gets pretty high. Then we take the next step. But we don't spend a huge amount of time scaling until we know we need it.


One of the issues I've run into in the past was a MySQL problem where their system was creating tons of temporary tables, and because of the massive amount of traffic they were getting, it was killing their database server because it was choking out the disk's I/O, and the server was at 80% iowait most of the time because the disk cache was paging things in and out like crazy.

I went back and forth with them about optimizing the app, but it was apparently some huge labyrinthine monstrosity. They insisted that they didn't have the resources to do any of the significant rewrites that it would require to fix the app to do proper queries (or at least, not enough to be worthwhile).

Eventually I gave up. /tmp was mounted onto a separate partition, so I disabled ext3 journalling and set commit=30 so that it only sync'ed to the disk every 30 seconds. Since no temporary tables lasted that long, the VFS layer never wrote to the disk if it didn't have to. /tmp became an in-memory cache, and CPU use dropped to 5%.

Optimizing isn't about a checklist, it's about looking at the system that you have, understanding what it's doing and why, and understanding how the other systems around it behave so that you can resolve the issue. Moving onto another database server wouldn't have helped them. Moving onto a RAID would have reduced the impact, but their load didn't scale linearly so they'd hit their limit in a few months anyway.


tmpfs for mysqld's tmpdir is a similar option, perhaps even better if, as I'd hope, tmpfs's internals are simpler and faster because it has an easier job to do that ext3-sans-journal. Obviously needs the RAM though.


> It's 2012 for god's sake, you can rent a 32GB server for less than $100.

Where, in the US, can I rent a 32GB server for less than $100?


Hetzner[1]. I never said "in the US", but even in the US you can get a pretty good deal: 16GB for $79/month[2].

[1] http://www.hetzner.de/en/hosting/produkte_rootserver/ex4s

[2] http://www.honelive.com/xml/#new-york-city-dedicated-servers


No, you didn't, but we all knew you were talking about Hetzner. One company, selling consumer-grade desktop parts unfit for use in a production server, in one country, does not lend credibility to statements like you made. Neither does the NYC site with no info, no SLA, and again, desktop parts. It's something like saying salary isn't important to startups because labor costs in China are pennies per hour.

It's 2012 for god's sake, and 32GB RAM servers are still hundreds of dollars a month to rent from any first class data center.


This is a bit of a tangent, but do you have any sources for consumer-grade desktop parts being lower quality than enterprise-grade? I'd like to read an in-depth hardware analysis that describes circuit board design, capacitor sourcing, etc.

I'm really curious to see how different equivalently priced ASUS and supermicro boards differ.



You can split hairs all you like, Dan, but you're still missing the point: scalability is only an issue for a fraction of the entrepreneurs/engineers who worry about it.


Hetzner doesn't use ECC ram modules. Might not be important if you use it for a memcached server, or it might be a disaster in other cases.


I don't think he meant RAM. Otherwise, I want to know too!


you can rent a 32GB server for less than $100 - where?


What is a 32GB server?


He means a web server with 32 gigabytes of memory. Citation: those are the stats of the server he links in his comment at http://news.ycombinator.com/item?id=4071035.


You're worried about people caring about scaling? What about the rampant PHP addiction? Focus on the real problem!


Good to know other people want to share about successful scaling with PHP.

On the downside, I have a book cooking on the same exact subject, with release planned September (self-publish, about 80% done), but now not sure if it's worth continuing. [edit]Slight moment of "panic", as seeing someone else releasing a book on the same subject made me sad, but you're right, no reason not to continue.


Just because one person releases a book doesn't mean you don't have something valuable to share as well. I say go for it. There's bound to be a lot of stuff in your book this one misses, and probably vice versa. People who are facing large-scale situations generally want all the information they can get their hands on.


If you need to know PHP scaling, a) the amount of money that two books cost is still absolutely meaningless to you, b) you will, very probably, still have questions after thoroughly reading your first book and c) it is worth having a second book just to have a second shot at book cover art which will convince one of your engineers to actually read the damn thing instead of re-inventing it poorly as need arises.

Absolutely finish and release your book. Half-writing a book is an even worse decision than writing a book. (Tongue only somewhat in cheek there: if you have actionable information on how to scale PHP, a book is one of the worse ways to change people's lives with that information while simultaneously making money from it.)


Absolutely, positively, finish & publish your book. Feel free to shoot me an email, there is a contact link on my blog, http://stevecorona.com/


At my first company, I freaked when we got a major competitor. The CEO smiled and said it was the first sign of a good market. He was concerned about the lack of competition up to that point.


>Good to know other people want to share about successful scaling with PHP.<

Absolutely. And it is an important enough topic to warrant the purchase of two books from my perspective.


Publish your book and let the market decide. Certainly don't let a pre-announcement of a book that's not fully written yet (yet alone reviewed) discourage you!


Yes, please do not forgo the completion of your book. I would love to hear about it once available as well. I have never had the chance to work in an environment where scaling PHP was necessary, and I am trying to gain all of the knowledge I possibly can on the subject so that I can take on bigger projects.


Dont get discouraged by this. For example, Dropbox didnt get discouraged by the rumors of GDrive in 2007 and look where they are :) Of course totally different ball game, just saying: Try all you can before giving up especially since you are almost there!

Good Luck!


What are you waiting for?

Go with http://leanpub.com - publish it and then complete it.


Finish it, and please post/announce it to HN when you can! I'm looking forward to reading both books.


Definitely publish. What if your book covered all the stuff he doesn't.


The majority of readers of these books will not have products that require scaling to millions and will probably purchase both books. Sounds like market validation to me.


Sounds interesting. Do you have any method of "keeping in touch", so I can check it out when it's ready?


Is there anywhere to register to receive more information on your book when you get nearer?


Just continue your work and let us know when it's ready!


You could collaborate?


My understanding is that no one finds it hard in a purely technical sense to scale the pure page-serving portion of your website, whether it's PHP, Rails, Lift, etc., because you can always throw up another caching layer or another box to serve pages. The hard part is scaling access to your underling data, which heavily depends on your exact use case.


In what formats will this book be available? When I see a new self-published book, I assume it will be available digitally. Your site (which is really well-designed) mentions nothing of the format.

Either way, I would like to see images of what the product will look like. For someone like me who doesn't need this book but is still interested in its topic, images of a well-designed book might make the difference in whether or not I purchase it.


Thanks for the feedback- all really good points. As the month goes on, I'll have some pictures of the cover-art and chapter list available.

It will be self-published, DRM-free in PDF, mobi, epub. Looking into what it takes to publish on the Amazon Store/Kindle/iBooks, but hopefully that's something I can figure out after launching.


Thanks! This is perfectly timed. I'm looking forward to it. I signed up immediately.

However, it looks like your mailchimp account is setup to link to phpscalingbook.com (which is the wrong domain). I clicked on the "continue to website" link after confirming.


Thanks for letting me know, I fixed it!


Much appreciated. Good luck to you! Looking forward to the book.


I have a pretty comprehensive list of chapters, but I'm still adding content to the book, so if you guys have any ideas or suggestions for topics you'd like to see covered, feel free to email me or post them here. Thanks so much!


I'm very interested what differences microptimizations (like using while(list() = each()) instead of foreach) make when scaling.

Also, I noticed that when I clicked the return to website after clicking on the subscribe button your website, it went to google.com instead, and when I clicked continue to our website button after clicking the email confirmation link, it went to http://www.phpscalingbook.com/ (404 error).

And if you need any proofreaders, I'd be more than happy to help!


On a typical LAMP web site the hottest bottlenecks are usually PHP is waiting for database I/O, it's not typical that slight code optimizations will produce a dramatic improvement.


I'd be interested in your opinion about premature optimizations and how your experiences with bottlenecks matched up compared to your initial expectations. I often find myself overthinking performance and scalability issues that turn out to be not relevant in the end...


Are you looking for any proofreaders?


Sounds interesting, especially since your stack matches mine pretty well, so count me in. However, it should be pointed out there is nothing there at your site to see yet (apart from the discount and announcement thing).


"Don't use ORMs"? I'm sorry, that's just plain wrong, or you were exposed to the wrong ORMs. Doctrine2 has great scalability - it has all sorts of caching built into it - result caching, query caching, and so on and so forth. It's actually -way- more scalable, and easier to develop for, than writing raw SQL (which, by the way, is a portability nightmare). Also, if you want to use a decent MVC framework, not using an ORM would be quite dumb. And if you're not using a good, modern, scalable MVC framework in this day and age, well, I pray for your soul.

So, USE AN ORM!!!


Every layer of abstraction you add, not only adds a layer of complexity and extra points of failure, but also adds restrictions.

By definition, an abstraction is more restrictive ... otherwise, you're not really abstracting anything.

If you're making a simple web app, an ORM is just fine. If you're building enterprise-level software, it is a really really bad idea. Half your code will be using it, the other half will be forced to use half-assed SQL queries that try to fit into your ORM. Because, quite simply, you will need all that SQL has to offer to make things work right. You can't afford to abstract SQL away. Trust me, I tried. In the end, the best you can do is go with something like LINQ.

I built a LINQ like system for PHP a while before Microsoft did it for .NET =)

On a side note, I would also stay away from all the frameworks and build one yourself. If you're on a long-term project, it's worth it. You'll understand what is happening and what each call really costs you. You can also refactor an existing framework. Either way works.


I can't stand the 'portability nightmare' comment/sentiment from ORM (ab)users. How often do you swap out your DBMS from MySQL to Oracle to Postgres? Writing SQL should not be anathema.


I agree. We use our own lightweight ORM at Twitpic (similar to PHP ActiveRecord). It produces SQL for you unless you feed it your own optimized query, but "does the right thing" 99.9% of the time.

IMO, "ORM produces bad SQL" is a myth and more often a case of bad indexes and not bad SQL.


What is the expected price point of the book? Since you're in the middle of writing, do you have any opportunities for "beta" testers who can provide feedback in exchange for complementary copies?


Probably $39.99, similar to bootstrappingdesign. Shoot me an email using the contact form on my website and we can coordinate some kind of beta testing once I push it to leanpub.


Will do - I see the subscribe link, but no contact info, just twitter. I've signed up for the list.


Nice to see someone who actually created a large volume site discuss their learnings vs. theoretical works discussing how stuff (c|s)hould be done without real experience.

Would love to see a chapter list!


I'll have one posted in the next day or two.


+1 for a chapter list (even a rough one)


You don't scale a language, you scale an architecture.


Sounds awesome; using my favorite php stack too :)

Would be nice to see a preview of different bits of the book when written as well.


I'll be pushing it to leanpub soon, so there will be an opportunity to check it out and get beta editions.


Good timing on the pre-release promotion. Now keep us updated in a weekly basic and you will sell a lot of copies here.

I'm glad you are writing the book. It seems like a worthy addition to my bookshelf. Will you be blogging about specifics mentioned in the book?


Seems your site doesn't scale very well..

404 Not Found

Code: NoSuchBucket Message: The specified bucket does not exist BucketName: www.scalingphpbook.com RequestId: 403C0590E064E19F HostId: a+UggS1lMBgPJrT5X/kbdzsRK1kx+iKBQw6u4dZxieNkspwHbLZBWzXMa9CiHEAu


Sorry man. The site is completely static and hosted on Amazon S3/CloudFront. I made a typo fix earlier and Transmit totally nixed the permissions, but CloudFront didn't pick it up until now. It should be fixed.


404 errors are not due to lack of scaling. I can't think of a case where you'd get that.

Scaling problems will usually return 5xx errors, most commonly 502, 503 and 504.


It was a joke :p


Is it just me, or is the site 404ing. Good joke whomever put this all together.


My thoughts exactly. "I can teach you to scale to millions." Site dies with large amounts of traffic...


Sounds very interesting, will be keeping a close eye on this.

On a side note: What is the icon for "Happy Users" supposed to be? It looks like Facebook's like icon but without a thumb.


How long has this been in the works? Domain was registered yesterday... Great way to measure interest :)


Very exciting! I hope this will be released on amazon or have affordable shipping to Germany.



Neat! I'll definitely be making use of this, thanks.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: