doc format is not documented, obviously.

brandonbloom · on Oct 26, 2014

The doc (and docx) formats are actually very well documented, thanks to pressure from the EU:

[MS-DOC]: Word (.doc) Binary File Format

http://msdn.microsoft.com/en-us/library/office/cc313153(v=of...

[MS-DOCX]: Word Extensions to the Office Open XML (.docx) File Format

http://msdn.microsoft.com/en-us/library/dd773189(v=office.12...

cldellow · on Oct 26, 2014

"Very well" is a euphemism here, I assume?

I worked in Windows Server when Microsoft was under the US DOJ consent decree and had to document every thing that looked at all like an API--even internal things that were just APIfied for design reasons / ease of testability / to make servicing simpler.

I can say with some confidence that no one gave a shit about producing good quality docs. Without exception, people viewed the government requirement as onerous and excessive and we produced docs that were perhaps technically correct, but did not provide insight into why things were the way they were. No effort at ease of readability was made, either.

icefox · on Oct 26, 2014

That is really a shame that you guys didn't use this new requirement to improve your product and internal process. Your comment comes off as a group that was just obeying the letter of the law, but not the spirit of the law and I could only guess that this would easily spill over into all cases of documentation even the cases where it matters. Having a large group of developers believe that it isn't worth the time to make good API's and produce worse than horrible docs is really sad. Taking the time to create good API, even for internal use can uncover design flaws, reduce errors, make it faster to make changes, easier to test, and faster to bring in new developers. Here with a government mandate you could have used it as an excuse to grow as a group to become better at creating software.

cldellow · on Oct 26, 2014

I can see how this comes off as an insular group sticking it to the government, but that's not the case.

If I gave the impression that we didn't create good APIs or good docs, I apologize.

We did, but that's not what the government wanted, so we gave them what they would accept. The government just was not very good at deciding what has to be documented and what doesn't. e.g., we had to document sample wire traces of messages that are all auto generated through IDLs and sent over a standard protocol. Rather than 2 page of IDL and a comment saying we use transport X (which is defined in RFC blah), we were actually required to submit 100-pages of traces. That obscures, that does not help.

Even if you wanted to do a great job of producing docs, we quickly learned that the process wasn't about creating great docs; it was about producing docs that the government would accept. Have you seen Office Space? It's that. It's thankless, because you're generating shit docs that aren't relevant that are judged by people who don't have the skills to judge them.

brandonbloom · on Oct 26, 2014

Even a half-assed effort to produce a document no one cares about is "very well" compared to the majority of mission critical and/or open source systems out there for which the only documentation is a README and, if you're lucky, some mailing list archives.

jghn · on Oct 26, 2014

My understanding was that it'd be impossible to make a 100% compatible docx parser even of armed with those docs. As an example, when the EU forced the issue I remember seeing stories about XML fields which simply contained undocumented blobs

al2o3cr · on Oct 26, 2014

Yep, having options / tags that whose definition is LITERALLY "do whatever [some ancient version of Word] does" is totes well-documented.

Implementable, on the other hand, not so much...

tzs · on Oct 26, 2014

If you don't already know how to implement them you aren't supposed to implement them. The spec even tells you not to implement them (and Microsoft does not implement them). They are there for third parties who reverse engineered ancient Word and WordPerfect formats and built tool chains around them, and want to move to a newer format but need to mark places where they depend on quirks of those ancient programs.

Here's the use case this is aimed at. Suppose I run, say, a law office, and we've got an internal document management system that does things like index and cross reference documents, manage citation lists, and stuff like that. The workflow is based on WordPerfect format (WordPerfect was for a long time the de facto standard for lawyers).

Now suppose I want to start moving to a newer format for storage. Say I pick ODF, and start using that for new documents, and make my tools understand it. I'd like to convert my existing WordPerfect documents to ODF. However, there are things in WordPerfect that cannot be reproduced exactly in ODF, and this is a problem. If my tools need to figure out what page something is on, in order to generate a proper citation to that thing, and I've lost some formatting information converting to ODF, I may not get the right cite.

So what am I going to do? I'm going to add some extra, proprietary markup of my own to ODF that lets me include my reverse engineered WordPerfect knowledge when I convert my old documents to ODF, and my new tools will be modified to understand this. Now my ODF workflow can generate correct cites for old documents. Note that LibreOffice won't understand my additional markup, and will presumably lose it if I edit a document, but that's OK. The old documents I converted should be read-only.

Of course, I'm not the only person doing this. Suppose you also run a law office, with a WordPerfect work flow, and are converting to an ODF work flow. You are likely going to add some proprietary markup, just like I did. We'll both end up embedding the same WordPerfect information in our converted legacy documents, but we'll probably pick different markup for it. It would be nice if we could get together, make a list of things we've reverse engineered, and agree to use the same markup when embedding that stuff in ODF.

And that's essentially what they did in OOXML. They realized there would be people like us with our law offices, who have reverse engineered legacy data, that will be extending the markup. So they made a list of a bunch of things from assorted past proprietary programs that were likely to have been reverse engineered by various third parties, and reserved some markup for each.