Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Webpage to PDF Microservice (imti.co)
93 points by cjimti on Oct 21, 2018 | hide | past | favorite | 34 comments



A basic command line alternative using Headless Chrome[1]:

  chrome --headless --disable-gpu --print-to-pdf https://www.chromestatus.com/
[1] https://developers.google.com/web/updates/2017/04/headless-c...


Similar functionality is packaged in wkhtmltopdf, which essentially runs Webkit headless to print to PDF.

https://wkhtmltopdf.org/


The article talks about wkhtmltopdf; in fact, they developed their server in response to its limitations:

The wkhtmltopdf utility has been around awhile and works great when you get it working correctly on your platform. However, the newest version as of this writing 0.12.5 has a bug prevening TOC generation on some platforms. Some Linux platforms require the installation of Microsoft font packs, and compiling from source leads you down a rabbit hole of dependency hell.


txPDF is a simple containerized web services wrapper around wkhtmltopdf, intended to be used as a Microservice component in a larger system.


wkhtmltopdf maintainer here. That's really cool!

Did you manage to find a workaround for https://github.com/wkhtmltopdf/packaging/issues/2? If so, would appreciate a PR :-)


Thanks, I'll check out that issue and see if there is anything I can contribute. wkhtmltopdf is a great utility and we rely on it heavily.


I've reported many bugs in projects that turn "URL" to "PDF".

You need to be sure you're limiting the kind of URLs that people can submit. For example ensure that nobody makes a PDF of :

* file:////etc/passwd

* http://169.254.169.254/latest/meta-data/local-hostname

* http://localhost:8080/

I'd say over half of the "PDF-creation" projects posted here have been vulnerable to some/all of those attacks. (I continue to be surprised at how many web-to-pdf services exist. I guess there must be a lot of people paying for them?)


These are great security suggestions and I should make some clarifications on the intended use. We use txPDF as a backend Microservice and not open to direct public use. It is good for automating report generation from other portions of a larger system.


Also that people can't use them to mine crypto currency. Seen owner of one such project blog about how that happened to them.


I'm the owner/dev of one of those paid services, and yes, competition is fierce, but people do still pay for the convenience of not having to manage it themselves. One look at the issue count of puppeteer/phantomjs/selenium/slimer... tells its own story.


Awesome timing. Just started work on a LinkedIn alternative called https://jaresume.com

We need a reliable way of turning peoples resumes into PDF's

Going to give this a go today or tomorrow.

Doing it with https://github.com/GoogleChrome/puppeteer also works quite well


I venture there is huge money in a sweet path from latex resumes in pdf format to ms word. I want to offer my clients a basic template but if I choose the latex route, I will inevitably have requests for the latter no matter how lame the format.


Such an interesting concept! I just signed up one minute ago so I can't give much more of a feedback but I wish you a great success with this!


Why not simply press Ctrl+p and print to PDF?


We have tried it in the past, just doesn't work reliably with different html configurations.

Does ctrl+p on this page -> https://jaresume.com/thomas look good for you?


Have you tried using a print media stylesheet? You could hide the navigation, reduce the whitespace, maybe shrink the font size a little bit, and remove link text decoration.


Great idea. I have used print media sheets in the past, but found them easy to have regressions e.g. elements that are introduced but not hidden. A webpage to pdf process is also vunerable to that though.

I think ideally, because the resume renderer is a react component, I'd rather just boot up chromium with the react component and resume data and do a fully clean render of the page into pdf.

We shall see.


For me, when I load the page, I see a resume display and then a split second later it's all replaced with "AN UNEXPECTED ERROR HAS OCCURRED."

Is the error so critical that it must hide your content? Did it accidentally include your AWS keys or something?


Sorry about that. I had just introduced a bug for anonymous users. Should work now.


Seems like a neat project.


Thanks! Feedback always welcomed.


You can achieve this using just the browser.

In Chrome Dev Tools, click on the devices button (the icon with the phone and tablet). Using the top-right menu, select "Capture full size screenshot".

Walla, you now have a full size screenshot that you can convert into PDF.

Incidentally, I am author of https://www.pagedash.com, which is a personal web scrapbook which allows you to capture the current page as HTML and generate links to share with others.


I tried it with this page only but it didn't work for me. Got a 110Kb png file but it's empty. It is a valid PNG but it's completely blank. Maybe it's buggy.


I find wkhtmltopdf very difficult to work with, for instance the official documentation is just a man [1].

I discovered the project Weasyprint[2] a few months ago. I find it easier to use, and very powerful when using Python. You can define a custom loader to inject images or styles generated on the fly for instance.

There are still some missing features compared to wkhtmltopdf, such as defining a custom footer and header, but it's a very promising project.

[1] https://wkhtmltopdf.org/usage/wkhtmltopdf.txt

[2] https://github.com/Kozea/WeasyPrint/


Since you mention Python, I have found pdfkit[1] to be a pretty good wrapper for wkhtmltopdf. I have a document generation engine that uses it dozens of times a day. Worst part is that wkhtmltopdf in the Ubuntu repos is still compiled (when last checked) without some patch that allows it to run headlessly. I built from source, which was not too difficult.

[1]https://pypi.org/project/pdfkit/


One of my application running at work has a task of creating a user ordersheet made through the main app workflow and transposing it to an HTML document which is then converted to a PDF document by wkhtmltopdf and dispatched via email, etc.

I found this setup to be really stable and easy to maintain, so far it has produced around 70k orders per year and has been running for over 4 years now without any hiccups.

Before that I was using phantomjs but it wasn’t as fast and reliable for some reasons that I can’t quite remember now, since I havent touch that part of the app in a long time.

All I remember is that wkhtmltopdf was easier to tweak and compose with.


https://prerender.com/ is a great service (fully MIT-licensed at https://github.com/prerender/prerender ) for this type of thing, both for rendering internal pages and for scraping/rendering external sites that rely heavily on client-side code.


You can also use pdf printers available in linux distros and even windows now.


I'm still looking for a service like this, but that creates a nicely tagged PDF and conveys the HTML structure in the PDF tags.

Tagged PDFs are a requirement in many processes for accessibility or archival reasons.


Why not using HTML instead of PDF? I'm the author of an extension that allows to save faithfully a web page into an HTML file [1]. From my point of view, that should be the best solution for archiving web pages in a file. Votes on HN disagree with me though [2], I wished I could understand why.

[1] https://github.com/gildas-lormeau/SingleFile

[2] https://news.ycombinator.com/item?id=18243721


read recently PDF is defacto standard by government


alternatively if you want a SaaS REST API:

   curl https://service.prerender.cloud/screenshot/https://google.com/ > out.jpg

   curl https://service.prerender.cloud/pdf/https://google.com/ > out.pdf

   curl https://service.prerender.cloud/https://google.com/ > out.html

https://www.prerender.cloud/


Why not just

> Print

> Open as PDF

?


To save your microservice having to run a graphical environment and simulate mouse interaction?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: