Last time I checked (maybe 2 years ago?) there wasn't a good open source html to pdf workflow. Specifically page-numbers and anything else involved with paged media is a nightmare, the CSS standards in that regard are not implemented. There is "Prince", but it isn't OSS and rather expensive.
Phantomjs (and its ilk) are based on browser engines and just don't support this. Also I would love to be able to change layout or content based on where particular elements turn up.
Docraptor https://docraptor.com/ is a monthly subscription service that uses Prince under the hood. It can be a cheaper option (though I wish for a PAYG model).
WeasyPrint looks to have progressed a lot since last I looked at it (when there was no way I could use it at all, though I can’t remember the reason), but looks to still be quite limited. A couple of things that spring to my attention immediately: no flexbox, and no CSSOM (so that you can’t adjust the document based on layout at all unless you can do it in straight CSS). Still, in practice probably usable for what I was doing several years ago, with similar limitations to Prince (which also has no CSSOM implementation, though at least it has a JavaScript engine). But anything more advanced in the way of fine page-dependent layout, I get the impression that WeasyPrint won’t be able to do, while Prince is superb at handling such things.
Yup, Prince is great. There are reasons they can get paid when WeasyPrint is free. But my parent post already knew about Prince and specifically wanted to know about an OSS one.
One of the problem with CSSOM and similar things is the problem of iterative layout: Let's say you have a list which should be split automatically on two pages. You also want to add a table header on the second page. But now the second part of the list doesn't fit on the page anymore. So you need to split the list further, and add another header.
Basically every change to the CSS or DOM requires a reflow (or clever optimizations to avoid that).
That’s no different from what browsers do at present.
Yes, modifications may cause a reflow. So? That just means that it’s slow. That’s not a problem.
That’s how you implement such things. The initial implementation throws away all layout information as soon as you modify any CSSOM property, and recalculates it. You release that to people saying “it can now do this, but it’ll be extremely slow; let us know what sorts of things you do with it and when you find particularly awful performance cases, and then we’ll look into speeding it up”. Then, as people try using it, you determine where it’s worth putting effort into speeding it up. This is exactly what Michael Day of Prince said they’d do if/when they implemented CSSOM, when I asked him about whether it might come, several years ago. This is an entirely reasonable approach.
I have a theory how this should be solved in web standard, which in turn has other uses besides printing, but well, I don't feel like I can affect web standard in any meaningful sense.
The basic idea is to add (placeholder name) stopUpdating/resumeUpdating to window, which can be polyfilled as no-op. The semantics is that CSSOM view methods are allowed to return the value when update was not stopped, or any later value. That is, current web standard forces you to do things "live". New methods give option to do things in batch.
So long as you don’t read layout information in between, you can already batch your modifications just fine. All it takes is care in how you structure and implement things, and you’re fine. And before you object, your proposed solution would require almost as much care, have more hazards to trip over, and require opting in in a way that few would—and those that would, already know how to be careful. I suspect it would also increase complexity and possibly memory usage in the browser.
If you were designing something from scratch, such approaches would be worthwhile considering, but I think that boat has sailed, and the architecture would fight against you.
Then again, I believe it was generally accepted that web browsers were stuck with UTF-16, until Simon introduced WTF-8 for Servo.
I think the approach of WeasyPrint and Prince, implementing a dedicated layout engine for paged media, is better than making these things work in Browser engines.
In any case, html/css for paged media should be mostly separate from website code. "Printing out" web pages works in many cases, but it's crappy.
The only two reasons why printing web pages out is lousy are because web developers put little to no effort into it, and because the browser manufacturers put little to no effort into it. I would love the likes of WeasyPrint and Prince to be rendered obsolete by one or more mainstream web browsers. If any of them decided that it was a strategic priority, they’d get a lot done very quickly. It’s just that there’s no compelling reason for them to, while there is for the people behind engines with a specific purpose—and so Prince is pretty safe in its position.
In general I think Tex/LaTex is the way to go in terms of reporting and generation of pdf. The biggest problem with Tex is that it is so different from HTML, and it gets progressively more different and difficult if you have specific layout or style requirements.
What I wish for is a replacement for LaTeX, based mostly on web standards, extensible in javascript... Unfortunately I don't have the resources to do that.
Phantomjs (and its ilk) are based on browser engines and just don't support this. Also I would love to be able to change layout or content based on where particular elements turn up.