Hacker News new | past | comments | ask | show | jobs | submit login
PEP – An open source PDF editor for Mac (macpep.org)
191 points by threcius on Sept 15, 2020 | hide | past | favorite | 107 comments



I have a dream... that one day people will name their projects with names that don't exist on google yet.

If you search for PEP now you'll find python enhancement proposals, and the "Philippine Entertainment Portal" and the stock code for PepsiCo.


I wish that once people do name their project, they would assign it a 128-bit random number in lower case hex, and include that number on any web page that they would like people searching for their project to find.

That way once I know that say PEP the PDF editor exists and find its 128-bit number (let's say that is 379dd864b16eaca3ce94c15a6bdfcc73), at least I can subsequently toss a +379dd864b16eaca3ce94c15a6bdfcc73 on my searches to effectively let the search engine know I want PEP the PDF editor results rather than PEP the python enhancement results or PEP the entertainment portal results or PEP that refreshing beverage company stock symbol.

"xxd -l 16 -p /dev/urandom" is a handy way to get a 128-bit random hex number. A UUID generator works, too, although they usually include some punctuation you will need to delete and you might have to lower case their output.


> I wish that once people do name their project, they would assign it a 128-bit random number in lower case hex

We already have something similar: URLs.


Except a URL only points to one resource. The idea here is that this identifier would exist on any resource related to PEP (maybe even in URLs).


>Except a URL only points to one resource

isn't that exactly what this is asking for though? A URL can by definition only point to one resource. So if you include that URL with every other reference to the project (in the app descriptions, blog posts about it, etc) then you always know you're talking about the same thing. It makes a lot more sense that any resource related to this PDF editor should include a link to "https://macpep.org" instead of including some random 128 character string. Any resource related to python peps should include a link to "https://www.python.org/dev/peps/" (which all PEPs do, by virtue of having a url that's a subdirectory of the PEP index URL)


That’s not a problem, you just add a meta or a link tag that points to “the” url for your project (maybe og:app-id Or a link with rel=“app”)


Yeah but urls and names might need to change due to marketing. A hash would uniquely id the project and let the marketing aspect be dynamic.

Although if the actual project name, authors, and codebase changes is it even still the same project?


[flagged]


There's a number of professions I'd like to add to that pyre please. ;)


tzs is proposing URNs rather than textual program names. A URL is unnecessarily specific (though I suppose you could anycast URL resolution)


> RFC 4122 defines a Uniform Resource Name (URN) namespace for UUIDs. A UUID presented as a URN appears as follows:[1]

> > urn:uuid:123e4567-e89b-12d3-a456-426655440000

https://en.wikipedia.org/wiki/Universally_unique_identifier#...

Version 4 UUIDs have 122 random bits (out of 128 bits total).

In Python:

  >>> import uuid
  >>> _id = uuid.uuid4()
  >>> _id.urn
  'urn:uuid:4c466878-a81b-4f22-a112-c704655fa4ee'
Whether search engines will consider a URL or a URN or a random str without dashes to be one searchable-for token is pretty ironic in terms of extracting relations between resources in a Linked Data hypergraph.

  >>> _id.hex
  '4c466878a81b4f22a112c704655fa4ee'
The relation between a resource and a Thing with a URI/URN/URL can be expressed with https://schema.org/about . In JSON-LD ("JSONLD"):

  {"@context": "https://schema.org",
   "@type": "WebPage",
   "about": {
     "@type": "SoftwareApplication",
     "identifier": "urn:uuid:4c466878-a81b-4f22-a112-c704655fa4ee",
     "url": ["", ""],
     "name": [
       "a schema.org/SoftwareApplication < CreativeWork < Thing",
       {"@value": "a rose by any other name",
        "@language": "en"}]}}
Or with RDFa:

  <body vocab="https://schema.org/" typeof="WebPage">
    <div property="about" typeof="SoftwareApplication">
      <meta property="identifier" content="urn:uuid:4c466878-a81b-4f22-a112-c704655fa4ee"/>
      <a property="url" href=""></a>
      <a property="url" href=""></a>
      <span property="name">a schema.org/SoftwareApplication &lt; CreativeWork &lt; Thing</span>
      <span property="name" lang="en">a rose by any other name</span>
    </div>
  </body>
Or with Microdata:

  <div itemtype="https://schema.org/WebPage" itemscope>
    <link itemprop="http://www.w3.org/ns/rdfa#usesVocabulary" href="https://schema.org/" />
    <div itemprop="about" itemtype="https://schema.org/SoftwareApplication" itemscope>
      <a itemprop="url" href=""></a>
      <a itemprop="url" href=""></a>
      <meta itemprop="identifier" content="urn:uuid:4c466878-a81b-4f22-a112-c704655fa4ee" />
      <meta itemprop="name" content="a schema.org/SoftwareApplication &lt; CreativeWork &lt; Thing"/>
      <meta itemprop="name" content="a rose by any other name" lang="en"/>
    </div>
  </div>


More in line with using "golang" since it is far easier to search for than just "go"


That's more or less what Ted Nelson envisioned in Xanadu and that's why he usually says that modern cut and paste has nothing to do with the real cut and paste and he consider it “a crime against humanity.”

Even though he's friend with Larry Tesler, the man responsible for our modern use of cut and paste

Links should bring back to the original source not point to some random text, with no context, that needs to be indexed


This solves the namespacing problem and allows creators and consumers to use different names if they want. Searching based on the creator's original name for a project becomes a mess because there will be a very large number of HelloWorld applications out there. Interestingly enough the google web store sort of already does this. The issue that comes up fairly quickly though is how to deal with the relationships between different packaged and published versions of what is nomalinally the same code base, or even forks/branches of the same code base. Maintaining a verifiable and discoverable chain for published artifacts without completely confusing users or exposing them to various malicious attacks (change a single byte in the middle of that random string and you have a nice off-by-one attack). Lots of infrastructure would be required to pull this off, but it would be great if it could be built.


If you want something like that with binaries, Nix(OS) might be for you.


That's actually a pretty solid idea. A meta tag for topics.

Only issue would be handling inevitable "SEO-ified" abusers of it.


Simple solution: 379dd864b16eaca3ce94c15a6bdfcc73™


Not sure I follow. Are you saying that by trademarking the fingerprint you can prevent SEO abuse?

The problem with any SEO mitigation is that the 128 bit string is intended for SEO. If you make a cool new thing then I blog about it I want to use your 128 bit string and you want me to use it too! So how do you prevent someone else from putting it on a linkfarm? I don't think trademark helps there.


If your page has 100 of these 128 bit strings in them then it's even clearer that its a link farm page that can be downranked


That's actually a pretty good idea. Kind of like an official @mention/#hashtag for an exact topic, if somehow wasn't abused by people, would definitely improve related search results. Navigating user intent algorithms is getting more difficult.

Does schema.org etc support ids beyond keywords/categories? I guess the id could just be a keyword.

Maybe a public registry where you claim an id for a topic, similar to claiming a yelp page or an ISBN number. Then anyone posting related content includes that id. Popular topics could be grouped. You could generate memorable ids for most known topics/products/etc, and people just utilize them organically, robots could apply them automatically over time also.

It's especially bad for words with many definitions, like "bridge repair", could mean a dental bridge, guitar bridge, or a bridge over a lake.


Just added one to a project of mine[0] (just a YouTube browser using the RSS thing YouTube does). Hope it catches on!

[0] - https://github.com/devenblake/ytfeed.py


I too dream of content addressable web.



I kind of do this in blog. Each page is assigned an UUID.


Instead of trolling HN you could contribute to Wikidata.


8 bit ascii could work.

8^256 is a huge number.


A 256 bit number is 2^256 possible combinations. 8^256 is the same as (2^3)^256 (or 2^768)


Er yeah. Derp.

8 character ascii not 8 bit.

It's early.

That's 8 bits ^ 8.

Or 256 ^ 8.

and easily able to be represented searching online with 8 characters.


Don't get discouraged, but you might still have a bug or two to work out with your new and never-before-tried "256^8 and easily able to be represented searching online with 8 characters" design. For your beta test, here are some of the unique 8-char identifiers you might want to try searching for: `unique `, ` unique `, ` unique`, `Unique `, `u^Hunique`, `un^Hnique`, `uniq^H^H^H^H`, ` . . . .`, `. . . . `, `uniqueESCESC`, `BELBELBELBELBELBELBELBEL`, ...


Lol, thanks for your totally non sarcastic assistance. It needs some fleshing out.


If the alternative is a 32 char, 128 bit hex string..I think that's a little excessive to expect people to use especially when an 8 char ascii has way more variation and is way easier to remember.

687c066db3458f7cbd5cc8bd58a65c64.

Vs.

*Xrh6x1!

You just have to eliminate dictionary words.


I can’t even this math.


I worded wrong.


PEP is also "Politically exposed person" in anti money launcering circles, and often the lists we get are PDFs.

Seems like PEP is used for more things too :)


I study biochemistry, so “phosphoenolpyruvate” was the first thing to come to mind, ha.


Also Python Enhancement Proposal


that was already pointed out upthread from you


Unfortunately that will never happen.

The good news is google is smart, and if you add a couple of subject keywords it pretty much always works.

For example if you search "pep pdf editor" the site shows up in first place.

My only issue is naming things after words that are so incredibly common they're on practically every page anyways, and thus truly useless for searching. I'm looking at you, Go.


I did not expect googling "pep pdf editor" will show my site in the first place, because i just published the link of my site in a few days.


And google is smart as it as you said.


A name that doesn't exist on google? So what exactly would that be?

It is very obvious if you google "pdf pep" python enhancement proposals or pepsi is not going to show up.

Naming is hard, I have a dream that people would stop complaining about it. There is names/acronyms for literally everything, the chance of you finding something unique is very very small.


Everybody's Google results are different these days. They put you in a bubble. Hence, saying "if you Google you should see result X or Y" isn't necessarily true.

I would say for acronyms containing 2, 3, 4 letters these are all going to be taken at this point.

What matters is how much do the acronyms overlap. Pepsi (food & drink) has nothing to do with PDF editors (tech).

pEp (or p≡p) [1] on Android is a nice K-9 fork with material design and GPG support / opportunistic encryption. Its not very well known though.

Worst would've been if there's a PEP directly related to PDF.

[1] https://www.pep.security


First time seeing a .security website in my life.


Google > pdf pep

Nope. First page is mostly CDC, WHO, etc. Nothing about python or this project.


Yep, naming is hard. And I like the name Wine most.

(Wine is not emulator)

Thank you for understanding. :)


“GNU’s Not Unix”


In previous times, I worked for a company that had the acronym AAPL. I kept getting the stock quotes for Apple every time I browse the company's intranet.


I once had to google something about the Thread protocol. Impossible. I think I haven't seen a worse name for a tech project.


We need to find an actionable suggestion. Maybe projects can have a long name and a nick name? To make everyone happy, and of course, to be useful for everyone.


pePDF? Seems relatively unused compared to pep.


I usually solve this by giving the big G (or the big Duck) more info to work with in the query: https://duckduckgo.com/?q=pep+pdf+mac


I have a dream that dang will automate flagging and removing the inevitable inane comments about name uniqueness every time someone posts a project on HN.


Once you end up investigating an issue with something that namesquats another thing you'll understand.


Yes, having only been doing this twenty years, I'm unlikely to be versed in doing that.


It's like these companies can't hire marketing reps to establish a brand image that's new


Keep dreaming. :)


A GUI app for manually crafting PDFs is one thing, but a library instead would enable countless developers to create software capable of producing PDF deliverables as output, possibly improving the accessibility situation too.

I would gladly sponsor the development of a reliable library that allows to programmatically produce compliant, accessible tagged PDFs with arbitrary layout[0], correctly printable and viewable by mainstream software.

(Considering it is a difficult, yet to be solved problem, I’d have to know the qualities of the approach that distinguish said library from not-quite-capable attempts that already exist before I commit my personal funds.)

I would not mind if said library’s developers release an entirely paid end-user GUI software based on it, as long as the library itself remains under active maintenance.

[0] Supporting commonly used paged media typesetting features such as headers, footers, page numbers, running headers.


Are you familiar with Prawn [1]? Perhaps I just haven't bumped up against its limitations yet, but in my experience it's exactly what you're asking for. Of course, it's in Ruby, which isn't to everyone's taste, but it's the best tool out there that I've found.

[1] https://github.com/prawnpdf/prawn


Prawn was considered and IIRC unfortunately it has very limited styling support.

It looks like there are only commercial and quite expensive options for programmatic generation (Antenna House: XSL-FO, Prince: CSS). Apart from that, the one option that works appears to be Apache FOP. It’s built on Java and uses XSL-FO, which is somewhat limited and seemingly on the way to becoming outdated.


https://weasyprint.org/ is quite nice, though has limited header/footer support (for now).


Check out HexaPDF too, also in Ruby: https://github.com/gettalong/hexapdf


Is it possible to use inkscape as a library? If not you can always use whatever library to produce an SVG and convert it to PDF. Or are there features in PDF that is impossible to produce using SVG?


Well, generally it is possible to produce a PDF where each page is one big vector (or even one big raster image for that matter). It would visually look the same as the one created from separate objects and text, but from my understanding with this approach it would be tricky or impossible to create a properly tagged accessible PDF.

That aside, using Inkscape for parts of the doc looked like an interesting idea. I checked and unfortunately the API seems limited. There is a Python plugin system but I don’t think it is possible to create a solution that works entirely headless on CI boxes without having to invoke Inkscape GUI.


Most features of PDF can't be produced using SVG. The visual elements are just the surface of what PDFs can do and are used for.


Seems a good feature for [0], and I am trying to not ignore this good feature, and still noted it down on my Apple Notes.

Although it seems to be a tough question as well. :p


Guess, you could use PsPDFKit (https://pspdfkit.com) but that costs money but you don't have to deal with all the weirdness of PDF


I should take a deeper look, but at first glance most of its features seem to revolve around working with already existing documents—viewing, annotating, searching, signing, filling out forms.


I would expect other apps to use the core of PEP, Gene. Which will be easy to use as dragging and dropping Gene folder into other projects in Xcode. Gene is a stand alone lib, and is licensed in GPL2.


How would your approach differ from something like RMarkdown or similar? Still trying to nail down the differentiation.


At first approach it seems to be substantially different from what RMarkdown is intended for, though maybe I am not seeing some outside-the-box ways of using it.

What I would like is more oriented around creating documents according to particular layout specs. Documents (books, technical documentation) that can be created using InDesign, but this time programmatically[0].

As content should be authored using semantic markup (for accessible PDFs to be produced), and styling & layout capabilities should be flexible enough, I can see how HTML+CSS paradigm could be viable (and would allow reusing many useful parts from web stack, such as self-containing components).

If going with CSS, the latest spec is close but does not yet seem to be there as far as proper print media layout capabilities. A viable path could be extending an existing engine (Chromium?) specifically where it renders for print media, enabling rendering to accessible tagged PDF, and likely even introducing new styling capabilities making the spec a superset of CSS.

(This all under assumption that it is at all feasible to achieve proper accessible PDF output using browser’s print capabilities.)

[0] I know about variables & data import in InDesign, that is still dependent on GUI so not quite applicable.


I know nothing of InDesign so that's likely where my problem understanding came in :). At work, we have a lot of data-defined reports that will generate daily based on real-world conditions. These reports then go out to business users internally but they're nicely done because they'll occasionally get shared with external audiences. The way the team structured it, which I thought was fairly slick, is they built a bunch of RMarkdown reports that output to LaTex as an intermediary format. Then there's custom LaTex stuff in the pipeline that makes layouts & styling transferable and sharable between the reports. From LaTeX you can go to a number of outputs like html and pdf and it's (I'll probably catch flack for laypersons explanation) maybe a more print-oriented layout/typesetting system, versus html/css which came later. This felt maybe more relevant before than now :)


https://xmlgraphics.apache.org/fop/ or anything TeX related...


reliabrary


> its own PDF engine, built from scratch

Just be careful, many larger teams have taken on that PDF spec, and it has not ended well for them: https://nvd.nist.gov/vuln/search/results?form_type=Basic&res...

and it seems to be one of the main actors in what are termed "polyglot" files: https://truepolyglot.hackade.org/ (of which my favorite is: https://news.ycombinator.com/item?id=18344778 0x15 is a laser-projectable PDF that's also a ZIP containing, among other things, another PDF that is also a Git repo of its own source code. )


Adding a roadmap with potential features would be nice. Otherwise one is actually asking for financing a personal pet project. A list of existing alternatives (incl MacOS Preview) could be added and an explanation why this project will be different (or ideally: better).


Yes, sell me on the problem you are solving, sell me on the technology opportunity, sell me on your approach, help me feel a sense of mission and curiosity.


Isn't Preview limited to just annotations, or does it also do editing?


Aside from annotating, Preview can crop, reorder, rotate, delete, and insert pages in a PDF.

But at its core, it does not "non-destructively" edit the dictionaries and object graph of the original PDF. Instead, it replays the original PDF into a new PDF context.


Thank you for your recommendation, I think a road map is too early at this stage, but at some later point, i will at one.

And for the alternatives, i think this is a good idea.


Hi, developer of a PDF scanning app for macOS (PDFScanner) here. I always toyed with the idea of developing a PDF engine from scratch but the spec scares me. Kudos for being that brave! Just curious: Why did you choose Objective-C for a new lib?


Helllo, good question, and the answer is simple, I am more familiar with Objective-C and I get used to it.


Aren't you worried about Apple killing ObjC down the line?


Their whole UI is written in ObjC, they've got two decades of Mac apps in ObjC, and they wrote a whole new language to interface cleanly with ObjC but with nicer semantics, so I'm guessing ObjC isn't going anywhere.

But even if Apple does remove it, everything apart from the UI would still compile in GCC's ObjC compiler. The guts of this project is the PDF engine, not the UI, and that would still work. TeX is written in a language that pretty much only Knuth uses, and it's still very much working.


More than just their UI, fundamental parts of Apple's OSes are written in ObjC, including very new components like ARKit, CoreML, and the Metal API. Apple continues to write tons of new Objective-C code every day. Of course Swift is making inroads inside Apple, but Objective-C has to be well supported for a very long time to come.

I've been writing Objective-C for 15 years, and continue to do about 60-70% of my work in ObjC (the rest is mostly Swift). I'm not particularly worried about my code being unusable anytime soon. When Apple starts making a real effort to wholesale migrate away from ObjC, I will too.


What about Cocoa? You think it still makes sense to invest dev time in a Cocoa app?


That's a little more ambiguous. Until very recently I was primarily a Cocoa developer (took a full time iOS dev job a few months ago), and it still makes up a significant portion of my side work, so I'm biased. But right now, the alternatives are SwiftUI and UIKit/Catalyst. SwiftUI nicely integrates with AppKit/Cocoa, but anyway, it's in pretty rough shape on the Mac at the moment (see https://news.ycombinator.com/item?id=24472063 for example). Catalyst is (IMO) basically garbage if your goal is to make a great Mac app. Its purpose is to allow easy porting from iOS to Mac, not to create truly good Mac apps.

Obviously, the presence of these two technologies, and Apple's pushing them is evidence that Cocoa may be on its way out, but it's not deprecated, and is still officially the recommended way to build true Mac apps. Undoubtedly it's SwiftUI (not Catalyst) that will eventually displace it, but especially on the Mac, SwiftUI is not really ready for production yet.


I agree with all your points.

I considered getting into Cocoa dev but between the lack of resources and the probability that Apple will kill it in the not so distant future, it didn't make much sense.

Better wait a couple of years until the mud settles.


I think that's eminently reasonable. As someone with 15 years experience doing Cocoa dev, the intention to be a Mac dev indefinitely, and hundreds of thousands of lines of existing Cocoa and Cocoa Touch code, it's just going to be a slow transition. But there's no pressing reason to just stop writing any new Cocoa code (I literally wrote some since my last comment :-P). If I were starting now, I'd probably focus on SwiftUI with the assumption that it will be the way to start most new projects within the next few years.


Aside: is there any way to demo PDFScanner? I literally got a Epson FF680W today and was shocked at how poor the default software is for reading documents. I tried VueScan (which everyone recommends online) but found the interface awful.


Aaah, trial versions on the App Store :( I'm still not sure what would be the best way to implement a demo version for an App Store only app - I loathe the "Free with IAP" model. If you want to test if your scanner works with PDFScanner, try to use it with the Image Capture application that comes with macOS. It uses the same library as PDFScanner to talk to the scanner, so if your scanner works there, it will also work in PDFScanner.


I am a happy user of PDFScanner. How are you doing the OCR?


I use the hocr feature of tesseract and parse the resulting xml that contains all the text with bounding boxes.


For a second I thought this was a Python Enhancement Proposal.


Python is "batteries included", but this sounds like a pretty hulking-big battery!


Reminds me of a resolution of GDPR I had when I first heard of it.

'German Democratic People's Republic'


It would be the Federal Republic of Germany. Or the Democratic German Republic if you think of the east block.


The UI for all of this guys apps is very strange...is this some sort of web-wrapper for OS X or custom views or what?


Hi, i am the guy, and almost all view are custom views (native Cocoa/Objc), except that some of them (like in the About window) are WebView with HTML/CSS/Javascript.


The one problem I consistently have with PDF forms on the Mac is an inability to change the font size in text fields. My work even pays for me to have the professional Adobe suite and that specific functionality doesn't exist. If someone could solve this problem I would probably give them some money.


Just noted down your requirement in my Apple Notes, I think I will take a look at it later while render and editing PDF forms.

These kinds of comments are what i am looking for, thank you.


What do you mean? In Acrobat Form (acroform) fields? In Acrobat Pro DC (in Windows, but assume Mac is the same), go into Prepare Form mode, select the field (or control select for multiple text fields), right click Properties, go to Appearance tab (Text section), and the either select Font Size: Auto (for autofit) or the point size you want.


You can't do that in InDesign?


Very cool!

I currently use adobe acrobat to make pdfs accessible (http://www.w3.org/TR/WCAG20-TECHS/pdf).

Are you planning on building in accessibility tools?


The PDF file format specification has 971 pages. https://www.iso.org/standard/63534.html


Just create a simple PDF example by using TextEdit's export function, and open the PDF with a plain text editor, you should be easy to figure out what's going on in PDF.


AFAIK, if you read through the first few chapters of the PDF spec, then you can easily understand later chapters. PDF in itself is a simple format.


The apps by the same author are very, very interesting: http://www.pixelegg.me


If someone knows something like this for Windows, I am currently offering a bounty:

https://softwarerecs.stackexchange.com/questions/19011

I am willing to increase the bounty, but I only have 297 rep currently




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: