"I think you will all appreciate this person's commenting style"

saurik · on Nov 6, 2012

PSD was never intended to be a data interchange format: it is the serialization format of a single program that has more individual unrelated features that actual people rely on than almost any other piece of software and has maintained striking amounts of backwards compatibility and almost unbroken forwards compatibility during its over two decades of existence. This product's "file format" needs to be critiqued in this context, along with similar mega-programs like Office.

I am thereby having a difficult time fathoming why anyone would think that a PSD file is thereby going to be some well-organized file format that they should easily be able to parse from their own application is just naively wishful thinking: even other products from Adobe have limitations while opening these files; to truly manipulate these files you really need to be highly-compatible with Photoshop's particular editing model (hence the conceptual difference between these two classes of file format).

jacobolus · on Nov 6, 2012

Further points:

1) The specs are now much more publicly accessible than they used to be, and frankly the spec does a fairly reasonable job describing a tricky format relatively compactly. It requires a fair bit of knowledge of Photoshop to read and understand, but it’s mostly fairly explicit. Much better than many other proprietary document formats.

2) For someone with relatively extensive knowledge of photoshop, the format is fairly comprehensible, albeit complicated. The biggest part of the problem here is, as you say, that Photoshop just has a ton of features to support, so that becoming enough of a Photoshop expert to understand it all is a difficult undertaking by itself.

3) The code this comment is taken from only interacts with a small fraction of PSD features, and is frankly pretty awful code: hacky, ad-hoc, not modular at all, etc.

All that said, if someone was to redesign PSD format today, I’m sure it would be organized quite a bit differently, and would have much better re-use of a smaller number of features. (The same goes for Photoshop itself.)

sabret00the · on Nov 6, 2012

So just out of curiosity, why doesn't Adobe embark on those two projects? We're entering a new era of desktop software, surely that should be a catalyst for redesigning both from scratch?

nollidge · on Nov 6, 2012

Because they could knit sweaters instead and it would have about the same impact on their bottom line.

worldsayshi · on Nov 6, 2012

And cut all backwards compatibility in doing so? Risky move indeed. Not so sure the benefits would make up for it.

aboodman · on Nov 7, 2012

Photoshop is an example of software that must also remain backward compat with the human systems that use it. There are lots of professions that are essentially "Photoshop Expert". If features are removed, altered, or even if keyboard shortcuts are changes, it has a very real effect on these peoples' livelihoods. There are few novice Photoshop users compared to other software.

mistercow · on Nov 6, 2012

No, it wouldn't cut backwards compatibility at all. They have made plenty of additions to the format over the years, and they give you the option of saving to be compatible with previous versions. They could just keep doing that.

kalleboo · on Nov 6, 2012

> They could just keep doing that

Then it wouldn't be a new, "clean" format. Then it'd be the old format with additions. Which it already is. Your post makes no sense to me.

geon · on Nov 7, 2012

There is no technical reason they would need to use the old fileformat at all. "Keep doing that" referred to give the user an option of what photoshop version he would like the saved file be comparible with. If he pick the latest version, the new file format would be used.

mh- · on Nov 7, 2012

so, .psdx

CoolGuySteve · on Nov 6, 2012

This is true, but there are container formats just as old like .mov that are quite nice to work with. (While your still sniggering, keep in mind that .mov has a lot in common with MPEG4.)

Whenever I need to write a binary serialization format, I usually copy .mov's tree of structs format, it's ridiculously fast, extensible, and keeps people away from C++ terrible stream operators/Java's BinaryReaderWhateverFactoryErrorProneOneIntAtATimeReader.

matt4711 · on Nov 6, 2012

do you have a description of the ".mov tree of structs format"

CoolGuySteve · on Nov 6, 2012

Here you go: http://atomicparsley.sourceforge.net/mpeg-4files.html

Nearly everything inherits from a basic struct that is 8 bytes per atom: { length of self + children, quasi-human readable 4 char code describing contents }

Practically speaking, in C/C++, you can stride by length and switch() on the ftype, using it to cast the read-in data to whatever class/struct you desire.

All of this while being so brutally dumb that you can rewrite it over and over again in about 10 lines of code in most languages.

vidarh · on Nov 6, 2012

This is pretty much IFF: http://en.wikipedia.org/wiki/Interchange_File_Format

I suspect that's where it originated.

daeken · on Nov 6, 2012

Simple version (in pseudo-C): struct Atom { uint32 length; uchar type[4]; uchar data[length - 8]; };

The file is a single atom that has other atoms (and random parameters and such) in its data field. You end up with a big tree of atoms which can be parsed as needed. Super simple format -- like the parent, I use atom trees all the time for serialization.

beagle3 · on Nov 6, 2012

Sounds almost like the IFF format, which was used for just about everything on the Amiga, and then later (with minor changes) as the basis to microsoft's RIFF, underlying wave files, .AVI and a lot of other formats.

IFF is: struct chunk { char tag[4]; int32 length; byte data[length]; byte padding[(2-(length%1))%2]; }

The padding is to two bytes; the tag uses ascii exclusively and no space (33-127), although every format I remember uses upper case + digits. The length does not include tag and the length field, not the padding. Microsoft, in a typical "we don't care" move adopted the spec except they specified little endian whereas IFF is originally big endian.

The entire file must be one complete chunk, and is thus limited to 2GB (signed integer length).

This format has been around (and at some point, dominated image storage with it's "ILBM" chunks, as well as other domains) since 1985 at least. https://en.wikipedia.org/wiki/Interchange_File_Format

qznc · on Nov 6, 2012

JPEG uses a variation of IFF, which puts an additional checksum at the end of each chunk. A nice extension for detecting errors.

sltkr · on Nov 6, 2012

So basically just IFF/RIFF with fields exchanged?

See:

  http://en.wikipedia.org/wiki/Interchange_File_Format

  http://en.wikipedia.org/wiki/Resource_Interchange_File_Format

daeken · on Nov 6, 2012

The big difference between the QT Atom structure and RIFF is that RIFF is a series of independent chunks (IIRC), whereas Atoms are a big tree. Structurally nearly identical, though.

vidarh · on Nov 6, 2012

Don't know about RIFF, but IFF files are/can be a big tree - the outer chunk must be one of FORM, LIST or CAT, and many chunk types contain additional chunks, so depending on the file you might get structures of arbitrary depth.

ansgri · on Nov 6, 2012

Is it not strange to call it an Atom? Atom is etymologically indivisible, when here we can have arbitrary structure.

msbarnett · on Nov 6, 2012

Indeed it would be much more accurate to call it a Turtle, since it's Turtles all the way down.

(Under absolutely no circumstances should anyone actually do this)

daeken · on Nov 6, 2012

Yeah, it is -- I've never found it to be a particularly great term, but it's what's used.

lloeki · on Nov 6, 2012

Atoms can be linked together to form an arbitrary structure. After all, a tree is a graph.

Evbn · on Nov 6, 2012

But once they are linked we call the linkage a molecule or compound.

breck · on Nov 6, 2012

MOV spec: https://developer.apple.com/library/mac/documentation/QuickT...

yason · on Nov 6, 2012

If a wise programmer decided he needs a serialization format, would he deliberately include in that format all the crap so vividly pointed to by the article?

No.

He will think of the "serialization format" as an interchange format between two different instances of his program. One process first writes the data file and another process later will read it. He also knows that sooner or later the "serialization format" needs to talk with different versions of his program, not just different running instances.

AFAIK, the Word .doc also started (and unfortunately continued) as basically a not-so-designed memory dump of the in-memory OLE data model. It's a format that more often than not has infamously stumped its own implementation as well. (Over time, OpenOffice has saved quite a lot of .doc files of Office users.)

lifeisstillgood · on Nov 6, 2012

The overriding aim of most formats was to load into memeory efficiently - fast load times was the key winner in the 80s and 90s. So you did not want a simple serialisation because that meant slow CPU intensive save and loads. But if you slammed it in pretty much as it would be in memeory you would win. Downside is if you change the in memeory representation of the running program you had to change the file format.

And .mov would have no such concerns - it's prime use case is store data in serialised chunks anyway - it was already serialised so could use very dumb stores.

saurik · on Nov 6, 2012

You are making a general argument why serialization formats should not exist. Fine, but in reality, and for any number of reasons, they do: they are easier at first, they are actually often somewhat easier over time, the pain cost that occurs is often easily amortized over time, they are fast to load (no transformations), they are fast to edit (you can often treat them as some insane memory page container and do internal allocation for updates, leaving old content begin until it is recycled), and their concept makes them capable of handling random seemingly-unrelated garbage that these mega-programs end up being popular for.

They aren't even always considered the non-ideal: I have seen many an argument from people who use Smalltalk that the ideal transfer format is to literally serialize part of the running program state and call it a "document", including whatever code might be required to operate the more epic parts of the document. (If you think about it, this is actually fairly similar to the various file formats that involve OLE, as you end up having the identifier of some code the user hopefully has installed attached with a block of data that that code hopefully can reinstate ate itself using.)

So, given that it is a tradeoff, and given that it was often a neccessary one for file formats where you want or need to be able to edit files that both contain numerous nearly-unrelated features (OLE would be the most beautiful example of this in the Word container format) where the entire contents may be larger than the RAM available to the entire computer, it simply seems silly to complain about this: man up, import the data, make your own format for saving your files, and stop complaining that someone in 1990 made something that over 22 years has become slightly difficult to understand without that historical context.

SideburnsOfDoom · on Nov 6, 2012

> AFAIK, the Word .doc also started (and unfortunately continued) as basically a not-so-designed memory dump

This may be true but not the whole story. It's the reason why the MS office team bit the bullet and replaced .doc with .docx about 5 years ago http://en.wikipedia.org/wiki/Office_Open_XML

Docx is basically XML in a zip file. It's a beast and has lots of compromises for backward compatibility, but as a design starting point, "zipped XML" is far far better than a binary dump of the in-memory data.

lucian1900 · on Nov 6, 2012

It's still worse than ODT (which itself isn't exactly pretty), for no good reason. That's sad.

SideburnsOfDoom · on Nov 6, 2012

ODT is also XML-based, to Docx's problems compared to ODT can't be blamed on XML.

lucian1900 · on Nov 6, 2012

I never said it has anything to do with XML. OOXML is extremely complex for little reason. Even though it is also quite complex, ODT is much, much simpler.

makomk · on Nov 7, 2012

There are actually reasons for some of OOXML's weirdness, just not good ones. For instance, it appears the reason why OOXML is pretty much the only XML-based document format which doesn't use a mixed content model is because there's a huge amount of prior art that'd have made it impossible to patent if they had. (Microsoft tried anyway though.)

SideburnsOfDoom · on Nov 6, 2012

I'm not disagreeing with you; but the context is mostly about the use of XML.

Maakuth · on Nov 6, 2012

It could be possible that the format was first very reasonable, but the surrounding platform has changed completely during the development. Then the new layers of specification have been added in a form that seemed to be the best possible solution on that platform and on that time. Wasn't Photoshop at the beginning an app for the original m68k Macintosh? Surely different kind of field sizes made more sense in that world than ours - also the tradeoffs in the sake of performance could have had some say.

taejo · on Nov 6, 2012

Word dates back to 1983, while OLE was only introduced in 1990 (but otherwise I think you are correct)

yuhong · on Nov 6, 2012

Office 97 dates back to 1996.

yuhong · on Nov 6, 2012

Not to mention security bugs too.

huhtenberg · on Nov 6, 2012

When a file uses both little-endian and big-endian serialization, at times within the same logical structure, employs several different ways to store an array, and does other things of similar nature, then it is a genuine clusterfuck regardless of whether it is reflective of "Photoshop particular editing model."

Camillo · on Nov 6, 2012

> I am thereby having a difficult time fathoming why anyone would think that a PSD file is thereby going to be some well-organized file format that they should easily be able to parse from their own application is just naively wishful thinking

I like how you embodied your point in the unsyntax of that very sentence. ;)

saurik · on Nov 6, 2012

Two hours latereader read what I wrote, and thought "man, this comment's upvotes to correct grammar ratio is remarkably high" (this is after one hour later, when I noticed another serious typo in the first few words that I still had time to edit). If it makes any difference to you: I write most of these comments I make on my iPhone, so I often can't even see the whole horizontal line at once. ;P

SideburnsOfDoom · on Nov 6, 2012

> PSD was never intended to be a data interchange format

And that's the basic design flaw - it is a data interchange format despite not being designed as one, and the terrible job that it apparently does at it. The people who wrote it didn't recognise that they were going to be filling that need. There's a lesson in there somewhere.

TeMPOraL · on Nov 6, 2012

> There's a lessen in there somewhere.

Greetings. I have arrived from the future to spare mankind more years of pain by stating it clear here that the lesson is not "serialize your data to XML".

lmm · on Nov 6, 2012

No? It seems to be working up well enough so far for e.g. ODT, or indeed the web. A file format that might need to be read in 20 years seems like one of the few cases where the super-verbose XML style is actually appropriate.

TeMPOraL · on Nov 6, 2012

I'm all for super-verbosity in useful information and flexible structure, but XML is also super-verbose in terms of redundant markup. Compression helps a bit, but it still costs processing power for no real reason. See the good ol' S-expressions vs. XML discussion, recently reincarnated as JSON vs. XML.

Evbn · on Nov 6, 2012

JSON is not where one gos to flee verbosity. It is almost identical to XML, save punctuation.

SideburnsOfDoom · on Nov 6, 2012

Hm, interesting. Could you explain why the decision to use (zipped) XML in the current MS Office document formats was a mistake? Those document formats are huge hideous beasts, but that's due to backward compatibility with large feature sets, and I don't think that moving off XML would help.

PS: I fixed the typo in "lesson"

TeMPOraL · on Nov 6, 2012

The deicision was a big step forward; however, moving off XML to an equivalent but less verbose (in number of meaningless bytes) format wouldn't hurt. All else being equal, it would be less electricity wasted on reading and sending things over the wire, and also more human-readable format. See also: SEXP vs. XML.

This argument applies the more, the bigger markup-to-data ratio is.

brudgers · on Nov 6, 2012

The wide availability across numerous platforms of robust libraries and tools for manipulating XML along with its standardization and the ubiquity of its implementation might have played a role in Microsoft's decisions.

JoeAltmaier · on Nov 6, 2012

its zipped; nothing is wasted.

TeMPOraL · on Nov 6, 2012

But to read it you have to first unzip it, and then parse all that fluff.

SideburnsOfDoom · on Nov 6, 2012

See also the "wide availability across numerous platforms of robust libraries and tools" to do this, and do it quickly.

wmf · on Nov 6, 2012

Perhaps some people care more about what would be convenient to them than whatever laziness or lock-in Adobe intended. Or maybe they believe that Photoshop might have better interchange with future versions of itself if its file format was sane.

shrughes · on Nov 6, 2012

What decade was Photoshop first written in and how long did it take to open a file?

saurik · on Nov 6, 2012

This is a very important point to make: data interchange formats typically much slower to manipulate and require much more RAM to do so, as they will require mapping back and forth to the internal data structures actually used at runtime. If you take a look at an older (pre-"x" formats) Office document, you will find a lot of its complexity (in addition to the aforementioned backwards compatibility and "numerous unrelated features" issue) relates to figuring out how to edit and resave enormous documents quickly.

scrumper · on Nov 6, 2012

Exactly. I clearly remember working on a 20MB PSD file in the mid 1990s (a restoration job done on a borrowed Mac of a badly damaged photo of my grandfather; the picture still sits on my mother's dressing table.) That took an absolute age to load and I know there wasn't much more RAM than that in the machine. It worked well enough though.

Old formats like PSD are better viewed as archaeological artefacts than as exemplars of some elegant ideal.

JoeAltmaier · on Nov 6, 2012

Um, they could have had a process for how to extend the format. Even GIF has a definition for how to add new pieces. PSD apparently had a process:play it by the seat of your pants?

Now, Word docs are xml format. Pretty extensible.

ibotty · on Nov 6, 2012

i'm not sure i get your point about the similarity to (early) office data formats. these are very badly supported now. that's a stark contrast to your assertion that psd is backwards and forwards compatible to a very large degree.

care to elaborate?

gjm11 · on Nov 6, 2012

Has been on HN before (http://news.ycombinator.com/item?id=575122) but it was years ago. I mention this just in case others are having the same feeling of deja vu as me.

rjzzleep · on Nov 6, 2012

well i'm glad it was posted again

Tobu · on Nov 6, 2012

I kept looking for a twist — jwz's “It seems very familiar to me” indicated one — but it's just a repost then.

greggman · on Nov 6, 2012

I'm pretty sure the PSD format chucks are based off IFF spec from 1985

http://www.martinreddy.net/gfx/2d/IFF.txt

Things were padded to 4 byte boundries because the 68000 processor would crash if you read an unaligned 32bit value. So the length of the actual data was what you find in the size field of each chuck but each chunk is padded. That way you didn't have to work around the 68000 quirks and read a byte at a time.

I wrote a psd reader in 93. It wasn't that hard and still works today. Maybe I chose an easy subset. It only reads the original result (merged layers) that gets saved when you chose to save backwards compatible files in photoshop.

http://elibs.svn.sourceforge.net/viewvc/elibs/trunk/elibs/li...

to3m · on Nov 6, 2012

I wrote one a few years ago as well. It read layers, summary image, and some layer metadata that I needed (blend mode, layer name, visibility flag, etc.). There's documentation for the format on the adobe site, I think (wherever it came from at the time - autumn 2007 - no fax was required), so it was actually fairly straightforward. An artist made me a bunch of PSD files with the stuff in that they wanted to use, and I sat there comparing the results of my code to what Photoshop did.

The only oddity I can recall is that Photoshop does something odd with the alpha channel - I think it was the alpha channel? - by sometimes storing it with the summary image rather than the layer to which it's related. (Don't ask me for more details than that - I don't remember.) I thought at the time that this looked like somebody's attempt to make newer data work tolerably with older revisions. That part WAS annoying, because the documentation didn't mention that, and it took about a week before somebody managed to create a photoshop file that was arranged this way.

The file format overall bore many of the hallmarks of one that had grown rather than being planned, but it looks like they'd started to clamp down on things at some point because the newer data chunks looked a lot better-designed than the old ones. These things happen. It could be worse. BMP is worse. TGA is worse. They aren't even chunk-based.

beagle3 · on Nov 6, 2012

> Things were padded to 4 byte boundries because the 68000 processor would crash if you read an unaligned 32bit value.

It is actually padded to 2-byte boundaries. The 68000 had an external 16 bit data interface. That's the only thing I would fix about IFF if I redid it today. (And I would add a 64-bit length extension, and a "reserved chunkid" designation, e.g. anything that starts with a '$' must be registered in some central registry)

runn1ng · on Nov 6, 2012

John Nack replied to this 3 years ago on his blog.

http://blogs.adobe.com/jnack/2009/05/some_thoughts_about_the...

hcarvalhoalves · on Nov 6, 2012

I appreciate the first code comment more after the introduction:

    if(sign!='8BIM') break; // sanity check

"Sanity check" as in "let's make sure it's really a PSD before we go insane".

drivingmenuts · on Nov 6, 2012

So, I guess embedding a PSD in a DOC file is like putting a Bag of Holding in a Portable Hole?

bitwize · on Nov 6, 2012

And yet to be considered a non-toy image editor, you must support 100% of this format perfectly.

simula67 · on Nov 6, 2012

Many more for your viewing pleasure : http://stackoverflow.com/questions/184618/what-is-the-best-c...

0x10c0fe11ce · on Nov 6, 2012

Thanks for this link! It's filled great laughs. Like this one:

  #define TRUE FALSE //Happy debugging suckers

I imagine what the guy who wrote it must've been through... :-P

PS: I wish Jeff hadn't shut down the thread.

henrik_w · on Nov 6, 2012

Absolutely hilarious, lots of gems there, like:

try {

} finally { // should never happen

}

felipc · on Nov 6, 2012

One of my favorite blog posts from Joel Spolsky talks about this, basically explaining how these formats come to be. For mega-softwares like those, the source code is the de facto file spec www.joelonsoftware.com/items/2008/02/19.html

smosher · on Nov 7, 2012

This reminded me of just how nice the Doom WAD format is: http://doomwiki.org/wiki/WAD

When a friend complained that he had a hard time figuring out which maps were present in a given WAD, I enjoyed myself while writing a utility to organize them into directories with map numbers. I kept thinking: this is how you serialize data. Looking back on the code now, it's still easy to understand.

brendandahl · on Nov 6, 2012

If he thinks PSD is bad he should try PDF which is really about 30 inconsistent formats all packaged into one inconsistent format.

dschiptsov · on Nov 6, 2012

This is much better reason to hire a person than 10 resumes.)

flebron · on Nov 6, 2012

I like the 'sanity check' at the bottom. :)

new299 · on Nov 7, 2012

should clearly just return false. ^^

drp4929 · on Nov 6, 2012

Is this a comment or rant ?

opminion · on Nov 6, 2012

Rant, of course, as in http://news.ycombinator.com/item?id=4134426 Rants can be very informative.

bvdbijl · on Nov 6, 2012

Both!

mbetter · on Nov 6, 2012

Is this a comment or a question?

unix-dude · on Nov 6, 2012

lol'd hard.

joshka · on Nov 6, 2012

Whilst I enjoy jwz's writings, please follow the hacker news guidelines which can be found at http://ycombinator.com/newsguidelines.html

In particular: Please submit the original source. If a blog post reports on something they found on another site, submit the latter. The original source is https://code.google.com/p/xee/source/browse/XeePhotoshopLoad...

Also: Please use the original title, unless it is misleading or linkbait.

danilocampos · on Nov 6, 2012

Hmm. Here's something else I found at those guidelines you linked, there:

> Please don't submit comments complaining that a submission is inappropriate for the site. If you think something is spam or offtopic, flag it by going to its page and clicking on the "flag" link. (Not all users will see this; there is a karma threshold.) If you flag something, please don't also comment that you did.

bigiain · on Nov 6, 2012

In this case, I suspect jwz's commentary and re-post of it is as much "the story" as the rant in the original source.

pseut · on Nov 6, 2012

Later on the list of guidelines:

"Don't abuse the text field in the submission form to add commentary to links. The text field is for starting discussions. If you're submitting a link, put it in the url field. If you want to add initial commentary on the link, write a blog post about it and submit that instead."

jleader · on Nov 6, 2012

Well, from an archival point of view, jwz's blog post I expect will last for some time (I believe he understands the value of durable URLs). On the other hand, either Google Code or the Xee project have managed to break the direct link that was submitted to HN last time (less than 4 years ago).

So that Google code link is the original source today, but it might not be tomorrow.