Show HN: Webpage to PDF Microservice

krn · on Oct 21, 2018

A basic command line alternative using Headless Chrome[1]:

  chrome --headless --disable-gpu --print-to-pdf https://www.chromestatus.com/

[1] https://developers.google.com/web/updates/2017/04/headless-c...

rav · on Oct 21, 2018

Similar functionality is packaged in wkhtmltopdf, which essentially runs Webkit headless to print to PDF.

https://wkhtmltopdf.org/

forapurpose · on Oct 21, 2018

The article talks about wkhtmltopdf; in fact, they developed their server in response to its limitations:

The wkhtmltopdf utility has been around awhile and works great when you get it working correctly on your platform. However, the newest version as of this writing 0.12.5 has a bug prevening TOC generation on some platforms. Some Linux platforms require the installation of Microsoft font packs, and compiling from source leads you down a rabbit hole of dependency hell.

cjimti · on Oct 21, 2018

txPDF is a simple containerized web services wrapper around wkhtmltopdf, intended to be used as a Microservice component in a larger system.

ashkulz · on Oct 22, 2018

wkhtmltopdf maintainer here. That's really cool!

Did you manage to find a workaround for https://github.com/wkhtmltopdf/packaging/issues/2? If so, would appreciate a PR :-)

cjimti · on Oct 28, 2018

Thanks, I'll check out that issue and see if there is anything I can contribute. wkhtmltopdf is a great utility and we rely on it heavily.

stevekemp · on Oct 21, 2018

I've reported many bugs in projects that turn "URL" to "PDF".

You need to be sure you're limiting the kind of URLs that people can submit. For example ensure that nobody makes a PDF of :

* file:////etc/passwd

* http://169.254.169.254/latest/meta-data/local-hostname

* http://localhost:8080/

I'd say over half of the "PDF-creation" projects posted here have been vulnerable to some/all of those attacks. (I continue to be surprised at how many web-to-pdf services exist. I guess there must be a lot of people paying for them?)

cjimti · on Oct 21, 2018

These are great security suggestions and I should make some clarifications on the intended use. We use txPDF as a backend Microservice and not open to direct public use. It is good for automating report generation from other portions of a larger system.

jarofgreen · on Oct 21, 2018

Also that people can't use them to mine crypto currency. Seen owner of one such project blog about how that happened to them.

cjr · on Oct 21, 2018

I'm the owner/dev of one of those paid services, and yes, competition is fierce, but people do still pay for the convenience of not having to manage it themselves. One look at the issue count of puppeteer/phantomjs/selenium/slimer... tells its own story.

thomasfromcdnjs · on Oct 21, 2018

Awesome timing. Just started work on a LinkedIn alternative called https://jaresume.com

We need a reliable way of turning peoples resumes into PDF's

Going to give this a go today or tomorrow.

Doing it with https://github.com/GoogleChrome/puppeteer also works quite well

vfulco2 · on Oct 22, 2018

I venture there is huge money in a sweet path from latex resumes in pdf format to ms word. I want to offer my clients a basic template but if I choose the latex route, I will inevitably have requests for the latter no matter how lame the format.

ivanche · on Oct 21, 2018

Such an interesting concept! I just signed up one minute ago so I can't give much more of a feedback but I wish you a great success with this!

dvh · on Oct 21, 2018

Why not simply press Ctrl+p and print to PDF?

thomasfromcdnjs · on Oct 21, 2018

We have tried it in the past, just doesn't work reliably with different html configurations.

Does ctrl+p on this page -> https://jaresume.com/thomas look good for you?

rcfox · on Oct 21, 2018

Have you tried using a print media stylesheet? You could hide the navigation, reduce the whitespace, maybe shrink the font size a little bit, and remove link text decoration.

thomasfromcdnjs · on Oct 21, 2018

Great idea. I have used print media sheets in the past, but found them easy to have regressions e.g. elements that are introduced but not hidden. A webpage to pdf process is also vunerable to that though.

I think ideally, because the resume renderer is a react component, I'd rather just boot up chromium with the react component and resume data and do a fully clean render of the page into pdf.

We shall see.

rcfox · on Oct 21, 2018

For me, when I load the page, I see a resume display and then a split second later it's all replaced with "AN UNEXPECTED ERROR HAS OCCURRED."

Is the error so critical that it must hide your content? Did it accidentally include your AWS keys or something?

thomasfromcdnjs · on Oct 21, 2018

Sorry about that. I had just introduced a bug for anonymous users. Should work now.

NetOpWibby · on Oct 21, 2018

Seems like a neat project.

thomasfromcdnjs · on Oct 21, 2018

Thanks! Feedback always welcomed.

ernsheong · on Oct 21, 2018

You can achieve this using just the browser.

In Chrome Dev Tools, click on the devices button (the icon with the phone and tablet). Using the top-right menu, select "Capture full size screenshot".

Walla, you now have a full size screenshot that you can convert into PDF.

Incidentally, I am author of https://www.pagedash.com, which is a personal web scrapbook which allows you to capture the current page as HTML and generate links to share with others.

superasn · on Oct 21, 2018

I tried it with this page only but it didn't work for me. Got a 110Kb png file but it's empty. It is a valid PNG but it's completely blank. Maybe it's buggy.

ZeKZ · on Oct 21, 2018

I find wkhtmltopdf very difficult to work with, for instance the official documentation is just a man [1].

I discovered the project Weasyprint[2] a few months ago. I find it easier to use, and very powerful when using Python. You can define a custom loader to inject images or styles generated on the fly for instance.

There are still some missing features compared to wkhtmltopdf, such as defining a custom footer and header, but it's a very promising project.

[1] https://wkhtmltopdf.org/usage/wkhtmltopdf.txt

[2] https://github.com/Kozea/WeasyPrint/

jimnotgym · on Oct 21, 2018

Since you mention Python, I have found pdfkit[1] to be a pretty good wrapper for wkhtmltopdf. I have a document generation engine that uses it dozens of times a day. Worst part is that wkhtmltopdf in the Ubuntu repos is still compiled (when last checked) without some patch that allows it to run headlessly. I built from source, which was not too difficult.

[1]https://pypi.org/project/pdfkit/

Globz · on Oct 21, 2018

One of my application running at work has a task of creating a user ordersheet made through the main app workflow and transposing it to an HTML document which is then converted to a PDF document by wkhtmltopdf and dispatched via email, etc.

I found this setup to be really stable and easy to maintain, so far it has produced around 70k orders per year and has been running for over 4 years now without any hiccups.

Before that I was using phantomjs but it wasn’t as fast and reliable for some reasons that I can’t quite remember now, since I havent touch that part of the app in a long time.

All I remember is that wkhtmltopdf was easier to tweak and compose with.

btown · on Oct 21, 2018

https://prerender.com/ is a great service (fully MIT-licensed at https://github.com/prerender/prerender ) for this type of thing, both for rendering internal pages and for scraping/rendering external sites that rely heavily on client-side code.

liftbigweights · on Oct 21, 2018

You can also use pdf printers available in linux distros and even windows now.

bramd · on Oct 21, 2018

I'm still looking for a service like this, but that creates a nicely tagged PDF and conveys the HTML structure in the PDF tags.

Tagged PDFs are a requirement in many processes for accessibility or archival reasons.

gildas · on Oct 21, 2018

Why not using HTML instead of PDF? I'm the author of an extension that allows to save faithfully a web page into an HTML file [1]. From my point of view, that should be the best solution for archiving web pages in a file. Votes on HN disagree with me though [2], I wished I could understand why.

[1] https://github.com/gildas-lormeau/SingleFile

[2] https://news.ycombinator.com/item?id=18243721

Ibethewalrus · on Oct 21, 2018

read recently PDF is defacto standard by government

jotto · on Oct 21, 2018

alternatively if you want a SaaS REST API:

   curl https://service.prerender.cloud/screenshot/https://google.com/ > out.jpg

   curl https://service.prerender.cloud/pdf/https://google.com/ > out.pdf

   curl https://service.prerender.cloud/https://google.com/ > out.html

https://www.prerender.cloud/

fastball · on Oct 21, 2018

Why not just

> Print

> Open as PDF

?

supermatt · on Oct 21, 2018

To save your microservice having to run a graphical environment and simulate mouse interaction?