Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Compile-time HTML parsing in C++14 (github.com/rep-movsd)
175 points by rep_movsd on July 31, 2017 | hide | past | favorite | 81 comments



After going through several love-hate cycles in my career toward C++ I must say I have a kind of admiration for what the C++ authors are trying to achieve.

It took me a while to understand why every iteration of C++ brings so many (often Turing-complete or close) side effect horrors. The reason is that C++ is more than a language: it is part of an actually philosophical quest that language design is about.

It is about trying to bridge the language of the machines and the language of human abstractions as closely as possible. In itself it does not necessarily lead to the best language possible, but it explores an interesting limit.

One can use higher level languages like Haskel, Prolog or even a bit lower like Ruby or Python and code nice abstractions. By doing so, however, the programmer typically loses track of what the machine implementation will be.

C++ strives to be a language where you can still feel what the machine will actually implement while you code high-level abstractions.

That is their goal, that is their quest. In doing so there are many side-effects that pop up, but their effort is commendable.

That so many people still use it is impressive but, I think, irrelevant to the priorities they typically set.


> The reason is that C++ is more than a language: it is part of an actually philosophical quest that language design is about.

> It is about trying to bridge the language of the machines and the language of human abstractions as closely as possible. In itself it does not necessarily lead to the best language possible, but it explores an interesting limit.

> C++ strives to be a language where you can still feel what the machine will actually implement while you code high-level abstractions.

This has been false ever since the earliest days of ANSI C. The C and C++ standards define an abstract machine that is quite far from any machine that exists today. Type-based aliasing rules, to name one important example, are something that has almost nothing to do with anything that exists in the hardware.

It's quite enlightening to read the description of LLVM IR [1] and observe how far it is from anything a machine does. In fact, LLVM IR is quite a bit lower level than C is, as memory is untyped in LLVM IR without metadata: this is not at all the case in C.

In reality, C++ is an attempt to build a high-level language on top of the particular abstract virtual machine specification that happened to be the accidental byproduct of a consensus process among hardware/compiler vendors in 1989. It turns out that this has been a very helpful endeavor for a lot of people, but I don't think we should claim that it's anything more than that. There's nothing "philosophically" interesting about the C89 virtual machine.

[1]: https://llvm.org/docs/LangRef.html


I don't see how type aliasing negates anything of what I am saying. Types are an abstraction but C++ allows you to extract its pointer, to make a sizeof() of it, to do pointer casting and arithmetics.

You can go low level with most high level languages, what sets C++ apart is that using low-level instructions have generally zero overhead. ptr2 = static_cast<uint8_t*>(ptr)+5; pretty much maps to the assembly code you would expect.


> Types are an abstraction but C++ allows you to extract its pointer, to make a sizeof() of it, to do pointer casting and arithmetics.

And the semantics of those operations are dictated not by what the underlying machine does but what the C++ abstract machine does, which is very different. The underlying memory is still typed.

> You can go low level with most high level languages, what sets C++ apart is that using low-level instructions have generally zero overhead.

That's true for the JVM, .NET, even JS...

> ptr2 = static_cast<uint8_t*>(ptr)+5; pretty much maps to the assembly code you would expect.

No, it doesn't, not after optimizations and undefined behavior. It's perfectly acceptable for the compiler to turn that into a no-op if, for example, "ptr" was a null pointer.


OTOH it has become more true over time, because C and C++ based hardware and software platforms became dominant and processors are now designed to just run C(++) as well as possible.

CPUs no longer commonly implement integer overflow checking, granular hardware based memory safety primitives, efficient user level exception handling, etc.


I guess another two other big examples are sequence points and UB.


Even though it's not the real machine it's still a lot closer to it than you get from the higher level languages mentioned by parent commenter like they said.


Once you have the concept of user-defined aggregate types that are meaningful to the semantics of your VM (which C and C++ definitely do), then you're pretty far away from the machine. The difference between the C VM and the .NET VM is not nearly as wide as the difference between the native instruction set and the C VM.


That's the point: C++ allows to go pretty high on the abstract side, but it gives all the tools to map how the high level abstractions are implemented on the low level side.


I think that while it is true, and I am a big C++ fan, given it was my path after Turbo Pascal, exactly because I am a fan of Wirth and Xerox PARC languages I also know there are better paths.

An example is the work being done in .NET Native, C#'s roadmap up to 8 picking up Midori's work, D, Swift, Java 10 and so on.

C++ got this position, because the others stop caring about those issues.

Now that they finally started caring about them, lets see what the future holds.


Oh like I said I have a love-hate thing with C++. Nowadays I mostly do C# and python, with a bit of Java. I only rely on C++ when I have to do brutal per pixel computer vision algorithms.

In most cases you can rely on libraries to do the heavy duty (and then I stick to python or C#) but sometime you have to take the big C++ hammer and try to not hit your fingers with it.


I see, that is similar to me.

I spend most of my time nowadays between Java and .NET languages, with some JavaScript when doing Web stuff.

Diving into C++ means I am doing OS integration work, like Android's NDK, COM, UWP APIs exposed only to C++, CLR and JVM native APIs.

So I guess I kind of share the same love-hate thing as you.


I watched a talk once from CppCon where the presenter was showing different bits of code and asking the audience if it was a 0-cost abstraction or not. This was at CppCon and there wasn't a consensus.

Sure it's possible to do high level things while knowing what the machine will end up doing but if you have to be a guru to be able to do so... then can you really say that?

I'd say "C++ strives to be a language where expert C++ compiler writers can feel what the machine will actually implement while you code high-level abstractions (provided you're using the compiler you wrote yourself)"


There is little value in knowing if a given line of code translates into machine instruction X or Y or into none at all. Simply write readable code and trust the compiler until you have a reason to do otherwise. In aggregate your code will be fast (not the fastest).

Once you measure a performance issue you can focus in on one part or algorithm and map abstractions onto hardware, which can't be done in many languages. Because hardware is different this process is often different. Even between two members of the x86 family there are different ways things can be optimized and the compiler puts a bunch of reasonable defaults in. Once you know the sane defaults are not good enough you can trade effort for efficiency as much as you want and keep getting gains for a long time.

Trying to go to the machine code each C++ build is like trying to understand the JVM bytecode for each java build. I don't know why so many care, because you only need to do this on occasion. I suspect the culture of performance causes people giving talks to focus on this. Then people consuming talks place disproportionate value on this.

This is clearly happening with constexpr and template metaprogramming now. Few people really need it but there is huge focus because a few smart people needed it to solve their problems, and those people are also prolific speakers and, bloggers and authors. Where are all the talks on best practices around smart pointers, thread synchronization tools, class composition, and other common issues that affect correctness, design or other aspects of creating software? I want to see more on Hazard pointers and other pooled smart pointers, these are a real replacement for general purpose garbage collection, but the community prefers to talk about compile HTML parsing (not that these are mutually exclusive, but there is only so much attention).


I am very disappointed with constexpr in C++14 and C++17.

The work shown here is impressive, but the point is that it should not be. D has shown that variable initialization at compile time can be painless.

The main limitations I find are: a) Lack of support in the standard library (e.g., you should code your own sort, vector class, etc.)

b) Terrible Terrible Terrible compilation times.

c) Lack of a dynamic memory allocation within constexpr.

constexpr functions are not evaluated while parsing (e.g., the code is not compiled and then executed). My use case, which was to avoid a 1 second preprocessing time while loading a library, took more than 20 minutes to compile and used more than 60 GB of RAM (on GCC7, CLANG did not even manage to compile).

Stuff like initializing a bitset can easily eat all your ram (e.g. this fails to compile): #include <bitset> int main() { std::bitset<102410241024> bs; } https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63728


It's coming soon, as always.

A lot of the STL is already being constexpr-ized (as Stephen T Lavalej has mentioned)

This code i wrote takes about 5 seconds on gcc and 1.5 seconds on clang to compile for a 1000 node HTML template (about 28 KB). Definitely slow, but not really really slow.

I have a feeling recursion slows down the compile time, I will attempt an iteration based version eventually and see.


For reference: In D, you could just take a HTML parser then run it at compile time without writing any new code.


Yes - D is what C++ should have been. By C++ 23, we will be about 80% there :|

Considering that one of the greatest C++ programmers of all time (Andrei) is now a D person, that's proof enough.

But for some reason, this sort of C++ shenanigan is intensely satisfying to me.

I guess it's a narrow complexity fetish.


Can D be used without GC?

(Note, by "used" here I mean utilizing most of the language, and the full ecosysstem of libraries, or at least most of it; not a mode that is supported in theory but useless in practice.)


Yes, if you're willing to spent some effort on it. Somewhat simpler you can just avoid most GC allocations so you don't run into any performance issues with the GC. http://dlang.org/blog/the-gc-series/

Mind to share why a GC seems to be unaccceptable for you?


It's not that it's unacceptable in general (although there are certainly scenarios - real-time, embedded etc - where it is). It's that a language that effectively mandates GC cannot be a true C++ replacement. So I can't in good conscience accept a "better C++" or "what C++ should have been" label.

This isn't to say that D is not a great language. It is, but I see it more as competing against Go (which is also not "a better C"), and to some extent Java, C#, Kotlin, and Swift.

The only true contender to the "better C++" label that I know of is Rust.


You can already write C++ in D, all the features are there. On top of this, you get features that C++ should have had years ago for free: Static Reflection, UFCS (Less boilerplate), Meta-Programming that isn't a hack and lots of static guarantees (Purity, GC usage etc).

The standard library could have been implemented completely without the GC, but the GC makes programming simpler so it's not worth it. Practically, this may be a problem, but it's a little unfair to compare to use this against D (only the language) given that Phobos is not standardized in the same way that the C++ Standard Library.

In a very specific way (May or may not be good in your opinion), I think D is closer to C++ than Rust is: D maintains the basic idea of C++ e.g. Syntax + semantics are still C based, the type system is less complex than that of Rust.

However, Rust has better static safety guarantees than D. D's system is still being developed, although does already offer much more than C++.


The problem is the stop the world pause. Imagine the following scenario: Thread 1 is a realtime thread and allocates memory up front and even uses @nogc for the rest of the code. Thread 2 continuously allocates memory. Thread 2 starts the garbage collector and must stop Thread 1 even though it doesn't allocate anything. Basically the only solution to the stop the world pause with D is to run two instances of your program and then use IPC even though the language provides a way to disable the gc. D's solution is to generate less garbage but it doesn't solve the real problem with current garbage collector implementations.


> Basically the only solution to the stop the world pause with D is to run two instances of your program and then use IPC even though the language provides a way to disable the gc.

There is a solution to your stated problem that is to deregister Thread 2 to the runtime. No registration -> no stop-the-world for this thread. Thread 2 will not be able to "own" GC things or allocate with the GC but will perfectly be able to use them. I did this in a video game for the audio thread.


Writing in languages with a GC is quite cumbersome.

Unless you are writing a proof of concept without a care in the world for your code you still need to reason about memory just the same as in a language without a GC. Only problem is that you constantly have to fight the GC and you don't get any help from the language. And at the same time you loose valuable constructs such as RAII - for no good reason.

Similar to how static language are gaining I think that we are going to realize that garbage collection has been a huge mistake. It is such a relief that we now got Rust, and the lost opportunity that is Go is nothing but tragic.


> Writing in languages with a GC is quite cumbersome.

On the contrary, the GC simplifies program for the vast majority of use cases. If it becomes a problem, then don't use it.

> And at the same time you loose valuable constructs such as RAII.

Forgive me for being argumentative, but have you looked at the D website? D has RAII, as advertised on the website homepage.


In the same way as dynamic languages simplifies programming for the vast majority of use cases. Or not, it's just short sighted and introduces unnecessary complexity.

When it becomes a problem it is already too late, because you can't just change your mind - "oh, I guess I'll think about memory now". Because that is something you should always do regardless if a GC is present or not. And if you haven't, well, prepare to spend the remaining liftetime of that project battling with it. And if you did, why bother with the GC in the first place?

It is not difficult, and forces you to write better software.

I was talking about GC in general terms, not specifically about D. That RAII example seems a bit convoluted though. Maybe D is a great language even when ignoring the GC, but often enough using a language in a way that isn't officially meant (and I bet most do make use of and expect the GC to be available?) is usually a bad idea.


D's GC is designed in such a way that's it's very easy to avoid. It only operates when allocating, if you want it to not collect from a thread just disconnect that thread from the runtime.

Using a GC does not constitute not worrying about memory!

There is no official way to use D; the only real problem with not using the GC now is safety (Exceptions pass pointers and can therefore leak across library boundaries).


With Vibe.d you can use Pug / Jade templates and they will be parsed at compile time so you don't have to interpret them at run time.


That negates the whole point of templates and constexpr -- they're purely functional subsets and explicitly designed to be so.

(You can argue that purely functional code is useless in real life and that C++ shouldn't strive to be purely functional, but that's another argument.)


C++11 constexpr where purely functional, C++14 adds (controlled) mutation.


Likewise in Common Lisp, arguably much more easily.


Well LISP has no distinction between meta-programming and programming. so as long as you like parens, you can do anything


There is no macro needed here. You just write the code, then execute it at compile time. This is designed properly, it's not any where near as hacky as C++'s constexpr


No need for macros, eval-when is a special operator designed for this:

    (eval-when (:compile-toplevel)
       ...)
You can also have read-time macros or load-time values, depending on what you need.


how ... macros?


Macros, or (eval-when (:compile-toplevel) [your code here]). Caveats apply.


Vanilla macros are one option.

  (defmacro html (string)
    (some-html-parser-library:parse string))
Or you could use a function and a compiler macro to make it more of an optimization -- if the argument is known at compile-time just parse it then, otherwise wait til runtime:

  (defun html (string)
    (some-html-parser-library:parse string))

  (define-compiler-macro html (&whole form string)
    (if (constantp string)
      `',(some-html-parser-library:parse string)
      form))
Actual example, using `string-upcase` to stand in for an HTML parsing library:

  (defun html (string)
    (string-upcase string))
  
  (define-compiler-macro html (&whole form string)
    (if (constantp string)
      `',(string-upcase string)
      form))
  
  (defun foo (arg)
    (print (html arg))
    (print (html "compile-time"))
    nil)

  (foo "run-time")
  ; "RUN-TIME"
  ; "COMPILE-TIME"
  
  (disassemble 'foo)

  ; disassembly for FOO
  ; Size: 128 bytes. Origin: #x10106A207D
  ; 7D:       498B4C2460       MOV RCX, [R12+96]                ; thread.binding-stack-pointer
                                                                ; no-arg-parsing entry point
  ; 82:       48894DF8         MOV [RBP-8], RCX
  ; 86:       488D5C24F0       LEA RBX, [RSP-16]
  ; 8B:       4883EC18         SUB RSP, 24
  ; 8F:       488B55F0         MOV RDX, [RBP-16]
  ; 93:       488B0576FFFFFF   MOV RAX, [RIP-138]               ; #<SB-KERNEL:FDEFN HTML>
  ; 9A:       B902000000       MOV ECX, 2
  ; 9F:       48892B           MOV [RBX], RBP
  ; A2:       488BEB           MOV RBP, RBX
  ; A5:       FF5009           CALL QWORD PTR [RAX+9]
  ; A8:       480F42E3         CMOVB RSP, RBX
  ; AC:       488D5C24F0       LEA RBX, [RSP-16]
  ; B1:       4883EC18         SUB RSP, 24
  ; B5:       488B055CFFFFFF   MOV RAX, [RIP-164]               ; #<SB-KERNEL:FDEFN COMMON-LISP:PRINT>
  ; BC:       B902000000       MOV ECX, 2
  ; C1:       48892B           MOV [RBX], RBP
  ; C4:       488BEB           MOV RBP, RBX
  ; C7:       FF5009           CALL QWORD PTR [RAX+9]
  ; CA:       488D5C24F0       LEA RBX, [RSP-16]
  ; CF:       4883EC18         SUB RSP, 24
  ; D3:       488B1546FFFFFF   MOV RDX, [RIP-186]               ; "COMPILE-TIME"
  ; DA:       488B0537FFFFFF   MOV RAX, [RIP-201]               ; #<SB-KERNEL:FDEFN COMMON-LISP:PRINT>
  ; E1:       B902000000       MOV ECX, 2
  ; E6:       48892B           MOV [RBX], RBP
  ; E9:       488BEB           MOV RBP, RBX
  ; EC:       FF5009           CALL QWORD PTR [RAX+9]
  ; EF:       BA17001020       MOV EDX, 537919511
  ; F4:       488BE5           MOV RSP, RBP
  ; F7:       F8               CLC
  ; F8:       5D               POP RBP
  ; F9:       C3               RET
  ; FA:       0F0B10           BREAK 16                         ; Invalid argument count trap
`html` is only called once, to handle the call that isn't known at compile time.


Yes. We parse JSON at compile-time with std.json (in D standard library). It probably wasn't even designed with that goal. From a D point of view, compile-time parsers are almost boring since anyone can write them.


Out of curiosity, is it possible to debug that in compile time?


Yes. pragmas, static asserts


I mean through an actual debugger, not printf debugging.


Sadly, the thing that stood out the most to me is the sentence We attempt to make the compiler generate the most sensible error message, followed by an incomprehensible error message completely unrelated to the problem.

C++ should really get proper metaprogramming, with support for user-defined error messages. All this template stuff always seemed like an awful hack to me: Fighting the compiler instead of being the compiler.


It is a hack, but what else is there eh?

Going from templates to constexpr is like moving from a war zone to a forest. It's still scary and dark, but at least you know what hit you. static_printf is coming soon.

The error message looks incomprehensible, but the last couple of lines give you enough information - the error string and the line number - it looks much better on the console since its colorized


    > It is a hack, but what else is there eh?
The non-hacky way to do this sort of thing in both C and C++ is to just use less clever two-phase compilation as part of your build process.

I.e. you'd have your template compiler extract the HTML from your source (or more easily, dedicated template files), parse, validate and compile them. The end result of that would then be embedded in your binary somehow.

The gettext set of tools are a good example how how this works in another related domain. You extract strings from your program, they and the contents of corresponding .po files are validated, then compiled to efficient .mo binary files for runtime use.


Why rely on an external tool?

All someone has to do here is #include my header and write their templates with a small suffix and prefix.

No need to install, configure, script and so on.

Your argument is similar to saying "Why should an IDE parse your code and underline errors? We can run make everytime"

Another advantage of my approach is that you have access to the parsed template as a data structure - that you can compose or modify as you wish.

The only other way to do this is to parse at runtime - which is definitely slower


Most people using gettext get it via their package system. Your library is also an external tool they need to similarly fetch & install.

And no, my argument is not similar to your IDE comparison. You'd get the exact same thing, parsing / compiling when you compile your project. You'd just offload slightly more work to make & the linker.

You can also get access to the parsed version as a datastructure. This is what the gettext library does with its compiled *.mo files.

Anyway, I don't think your thing is a useless approach. It's very hacky in C++ but this sort of thing is the best way to do something like this in many other languages.

I was just pointing out that there's decades of precedence for achieving the same results in C, i.e. parsing some custom language out of the project at compile-time and shipping it with the binary. Which you (with your "what else is there" question) seemed to be unaware of.


I have nothing to add other than that's both cool and scary.


Nothing is as scary as boost :)

This is as scary as a bunny painted in camouflage.


I don't know what's scarier: that I used the Boost preprocessor library a few months ago; or that the file format I had to deal with sucked the stars out of the sky in such a manner that a reasonable person could have agreed that BOOST_PP was the best possible choice under the circumstances.


Wow.

How does one debug complex constexpr code? I assume there's no printf.

It'd be really cool if it supported some kind of interpolation, like Jinja templates, so it could generate dynamic page templates at compile time.


Often times, the easiest way to debug complex template code is to intentionally fail the compilation, with a minimal context, so you can read the compiler's output on what the types and values are. Here's a simple example of a function I used to use:

    template <typename T, typename ...Ts>
    void show_types()
    { static_assert((T*)nullptr, "Type log"); }

    int main()
    {
      show_types<int, float, bool>();
    }
When trying to compile this, using something like `g++ show-type.cpp`, you'll get this output:

    show-type.cpp: In instantiation of ‘void show_types() [with T = int; Ts = {float, bool}]’:
    show-type.cpp:7:32:   required from here
    show-type.cpp:3:3: error: static assertion failed: Type log
     { static_assert((T*)nullptr, "Type log"); }
The same can be done for non-type template args, of course.


I prefer this:

T::asdfasdf();

Although I might start using yours.


constexpr is a subset of what you can do at runtime. So my way of debugging is

#define constexpr

Which makes all the code plain code that executes at runtime, then you can put your couts and use your debuggers.

Its orders of magnitude simpler than dealing with template meta-programming.

constexpr_printf is the feature I want most of all. There is actually a patch for gcc that implements this.

I dont know how Jinja works, but the idea here is you do templating at runtime, but your templates get compiled down to a tree like DS at compiletime


Jinja compiles to Python which is then compiled to CPython bytecode and executed by the CPython VM.


I see - here the compile time data structure is converted into a tree at runtime and rendered.

I decided to use a runtime DS to allow things like splicing and looping templates etc.

It's also possible to make the runtime render the compiled time DS directly, which should be even faster if your requirements are simple enough


In Dlang, where compile time function execution is the norm, you use the equivalent of what would be "pragma printf" -- eg. You embed compiler warnings into your compile time branching to debug.


One can also just static assert with a ctfe-able message.


https://github.com/saarraz/static-print (not mine) is a compiler patch that adds the kind of print you're looking for


The program will fail to compile if the HTML is malformed

By malformed are we talking about incorrectly closed, or actual invalid HTML? HTML doesn't require all tags be closed...


The HTML that browsers accept is loosely defined

The HTML specification has a grammar and defines whats allowed - a subset/derivation of XHTML/SGML and their ilk

There are a bunch of test template files in the test/ folder that demonstrate what kinds of errors are caught.


> The HTML that browsers accept is loosely defined

This has changed since the advent of the HTML5 specification, the primary purpose of which was to retroactively describe existing browser HTML parsing behaviours and to document and specify them comprehensively in all of their complexity.

The wisdom of this abominably complex approach may be questionable, but it's certainly no longer "loosely" defined.

I'd be curious as to whether this implements HTML5...


I remember the promise of xhmtl and the sad reality that, by and large, nobody actually cared about well formed documents.

The specification allowing for "implicit" tags is something that just doesn't make any bloody sense to me. It feels like someone looked at all warnings in a compiler and said, I'll just explicitly define exactly how I want each warning to behave, so that it is no longer dangerous behavior.


HTML always had tag omission and "self-closing" elements - these are features from its SGML roots. A formal SGML grammar for modern W3C HTML 5.1 can be found at [1] (my project).

[1]: http://sgmljs.net/blog/blog1701.html


I don't think the gp was saying this is a new thing, just an old thing that never made sense to her/him.

That looks like a very nice resource though, thank you.


Indeed.

And to be fair, I'm actually somewhat ok with self closing tags. Though, I can't easily see why I'd want tags that can't self close.


To be clear, HTML's "self-closing elements" are img, meta, and others which never have content nor end-element tags. These are based on SGML empty elements, and HTML merely tolerates the XML-style form

    <img ... />
but does not allow

    <img ...></img>
though parser recovery might be able to deal with it, and WebSGML can formally reject or accept it.

Apart from empty elements, HTML's tag inference can also be formalized using SGML. Tag inference is what makes this

    <title>Title</title>
    <p>Body Text
a valid HTML document and be treated as though

    <html
      <head>
        <title>Title</title>
      </head>
      <body>
        <p>Body Text</p>
      </body>
    </html>
had been written.


Right, the "self-closing elements" don't actually bother me. I do slightly prefer the <foo /> form, but in general I don't care. What used to baffle me was I couldn't do a "<div />" as that wasn't allowed. And I couldn't think of a good reason to care on that. (It seemed you had to go out of your way to disallow that, for no apparent reason.)

The tag inference, I just don't get. I can /almost/ understand the example you used, but the examples include crap like:

    <table>
      <thead>
        <tr>
        <tr>
Where the second "<tr>" is actually part of the inferred "<tbody>". Just, why?

Edit: And to be perfectly clear, I do expect most of this to be handled for me by whatever framework I'm using nowdays. And, I don't actually generate documents directly that much. For docs, I typically go with LaTeX or friends. (Honestly, probably org-mode moreso, but even that is light nowdays.)


Well, in my paper, like you, I'm criticizing (tag omission in) HTML5's table content models, and discourage aggressive use of it ([1]), so probably I'm not the one to defend it ;)

Even the HTML specification text itself got its tables wrong ([2]; also explained in [1]).

[1]: <http://sgmljs.net/docs/html5.html#start--and-end-element-tag...

[2]: <https://github.com/whatwg/html/commit/6e305c457e42276bf275b8...


Apologies, I did not mean for you to be on a defensive. Just adding to the point. If anyone skipped your link, they shouldn't have. Thanks for sharing!


> It feels like someone looked at all warnings in a compiler and said, I'll just explicitly define exactly how I want each warning to behave, so that it is no longer dangerous behavior.

I get the impression you're saying this in a (rightfully) despairing tone as if it only seems that way, and noone in their right mind would actually do that in reality. But this is actually the stated mission of the HTML editor. This was literally his intent with HTML5. That is what it is.


Sorry, I knew that was the intent of HTML5. What I don't get is how it was a sane goal. I was likening it to as if someone decided to just redefine all of the various ways that people used things incorrectly to be correct. To my knowledge, nobody is trying to do that.

Look at the tortuous ways they deal with the various missing tags of a table. My favorite is how they work to make it permissible to have a THEAD with an implicit TBODY that follows it. Just, why!? I think this is important for browser vendors, to an extent. I also think we can try to do better.


> My favorite is how they work to make it permissible to have a THEAD with an implicit TBODY that follows it. Just, why!?

Are you asking why the HTML4 spec initially allowed this? There were likely several reasons (I wasn't involved in the working group at the time). Some of the reasons:

1) Tables without <tbody> or <thead> at all, with <tr> in the <table> directly were already all over the place before HTML4 appeared. They needed to keep those allowed, both asa practical matter and to make authoring less verbose in cases when there is no thead/tfoot.

2) Their syntax definition method (SGML) allowed for this by making <tbody> start and end tags optional, which they therefore did.

The outcome is that <tbody> has optional start/end tags in HTML 4, and <table><thead></thead><tr></tr></table> ends up with an implicit <tbody>.

Now we come to HTML5. We're not using SGML anymore, so we _could_ disallow missing tbody when there's a thead/tfoot, but still allow it if there are no headers/footers. But in the intervening 10 years, there's a ton of content that was created that relies on the HTML4 behavior, and browsers all implement the HTML4 behavior. What is the argument for changing that behavior?

> I think this is important for browser vendors, to an extent.

What was important to browser vendors, with HTML5, was having a standard that actually specified the behaviors needed for de-facto compat with existing content, so they could stop (buggily) reverse-engineering each other to figure out how to handle the corner cases. This was the stated intent of the spec. Is this the goal you consider "not sane"?

Note that some of the authoring behaviors involved are still considered incorrect in HTML5 (though leaving out <tbody> is not one of them): misnesting your tags will cause your HTML to not be valid HTML5, and a validator will flag that. It's just that HTML5 specifies what browsers should do even in the face of incorrect behavior like misnesting tags, because it turned out that people were doing that even though it was invalid and depending on the resulting behavior of browsers.

> I also think we can try to do better.

What do you think should be the goal of a spec for "HTML"?


The specific example that blows my mind is that:

  <table>
    <thead>
      <tr>..
      <tr>...
  </table>
has an implicit tbody. Sure, there are some sane reason to have implicit values. And in some cases I think it is actually obvious what those tags would be. This case, however, does not appear to be obvious to me. It is just as likely that this was a table that has a header, but no body.

I don't fully understand why "existing documents" are relevant at all. Since you basically have to "opt in" to the new version by declaring the doctype, we could have had much cleaner semantics on a new doctype. This seemed to be the goal of the xhtml push a few years prior. I am not privy to all of the history of why that failed.

To directly answer, my goal for the spec of HTML would have been a spec with fewer special cases. Preferrably, one that made less of a surprise between people that know XML and HTML.


> has an implicit tbody.

Uh.... That case has no implicit tbody in HTML5. See http://software.hixie.ch/utilities/js/live-dom-viewer/?saved... in any modern browser.

In HTML4 it does because the DTD has "TBODY+" instead of "TBODY*", and yeah, I have no idea why someone thought that was a good idea, apart from the theoretical purity of "a table with no body makes no sense".

> It is just as likely that this was a table that has a header, but no body.

That's exactly what it has.

> Since you basically have to "opt in" to the new version by declaring the doctype,

Er.... you don't. The "new version" is the only version. The doctype affects a very small number of quirks but that's it, and that part way predates HTML5.

> I am not privy to all of the history of why that failed.

There were a few reasons. First, it turned out that neither authors nor users wanted the hard-fail behavior of an XML parser. Users, because it would mean they couldn't read the page they wanted to read. Authors, because they did not sufficiently control all the markup ending up on the page (multiple people authoring snippets, CMS templates, random bits of markup pulled from databases provided by other companies, etc).

Second, because there was no sane migration path. Suppose an author wanted to switch some page over to XHTML. But not all browsers support XHTML (and in particular the browser with 95%+ market share does not), so they need to provide an HTML version too. The normal answer to that was to make use of XHTML 1.0 Appendix C to provide a document that could be parsed as either HTML or XHTML, and to use HTTP content negotiation to send either the text/html or application/xhtml+xml MIME type. But then the problem was a tendency to only test the text/html case and have the application/xhtml+xml case not end up as well-formed XML. There were tons of documents all over the place that had an XHTML doctype and were attempting to comply with Appendix C, but were not actually well-formed; luckily most of them were only served up as text/html. All of this was a strong disincentive for browsers to advertise application/xhtml+xml support, because they would get broken pages. Even the browsers that had started off advertising such support ended up removing it in the face of user complaints; see first reason above.

Note that all this would have been _much_ worse if the switching had been on doctype, not MIME type; as I noted above, there were tons of documents around that had the XHTML doctype but were not well-formed.

I should note that the actual semantics of XHTML1 were not that different from HTML4; apart from parsing there were no significant differences. And the parsing semantics turned out to be something no one wanted in practice, per above.

As for XHTML2, which did attempt new semantics of various sorts, it suffered from several problems as well. Most glaring, again, was complete lack of migration path. Unlike XHTML1 there was no way to create a document that would work with a UA that didn't implement XHTML2 _and_ one that did. The XML parsing semantics were still not wanted in the market. The new semantics XHTML2 introduced were not that wanted either, because the working group decided to not talk to any actual authors or browser vendors or anyone else who would be involved in creating or consuming XHTML2, pretty much. The result was a spec that was solving problems people didn't have, not solving problems they did have, and with no clear way to deploy it in the market.

All of the above is why when WHATWG started working on an evolution of HTML the priority of constituencies (now captured at https://www.w3.org/TR/html-design-principles/#priority-of-co... ) was users, authors, implementors, specifiers, theoretical purity. Because the approach of putting theoretical purity first had been tried and failed spectacularly...

Note that a large part of the failure was in fact due to the "existing documents" problem, because the lack of a migration path was one of the most significant barriers to XTHML adoption. Of course the lack of strong reasons to adopt it didn't help either.

> my goal for the spec of HTML would have been a spec with fewer special cases.

This is not an unreasonable goal, sure. I should note that in terms of priority of constituencies this is a "theoretical purity" goal. Getting rid of specific special cases that are confusing people could be a goal in terms of the "authors" or "implementors" or "specifiers" constituency, of course.

Note that HTML5 did in fact remove various special-cases HTML4 had that were due to its SGML heritage, and most of which had never actually gotten widely impelemented in browsers. For example, comment parsing was simplified significantly, such that "<!-- Reader -- take note! -->" is actually a closed comment (which it's not in HTML 4, and wasn't in Firefox, which actually implemented the HTML 4 semantics for comments, until the switch to the HTML5 parser). The special-cases that remained were the ones that were needed to actually render existing web pages correctly.


Hmm, I have to confess I was cribbing this example from a link above. I'll dive further on it and see where I got lost.

I am a bit fuddled on the claim that HTML5 was determined not to be an opt-in schema. I'm probably colored because most of my docs by when I was actually caring about this were using the xhtml doctype. So, for me it definitely was a sort of "opt-in" and a migration. Which, frankly, is logical and makes the most sense.

So, I grant that the "existing documents" problem presented a ton of not well formatted documents. But, a large chunk of existing code presents with excessive warnings. The solution there is not to just give up, but to come up with better tools and guide people to the higher quality paths.

In the end, I fully accept this as something I will just have to agree to disagree on. My assertion is that contortions to not raise the bar on the creation of documents did little to advance the state of the web. I do not have a clear path on how to test this assertion. And have since moved on from web development.


Syntactially, I believe I have implemenetd HTML5

It parses out a pretty large example HTML file without any trouble


Browsers seem to be happy with <br/> for example, which is actually deprecated and a HTML 4 thing


<br/> isn't an HTML 4 thing (it's an SGML thing, and hence goes back to the IIIR HTML draft and the original HTML (2!) standard): it has a very different meaning that most people think; it's the NET syntax (of the SHORTTAG feature of SGML), and is a br tag followed by a literal ">". The feature even gets mentioned in the HTML 4 spec: https://www.w3.org/TR/html4/appendix/notes.html#h-B.3.7

Of course, browsers never actually implemented this, which led to the ability to write "polyglot" markup which was both (de-facto) HTML and (de-jure) XHTML simultaneously.


nit: it's a WebSGML thing eg. an XML feature factored in to the revised SGML spec for accomodating XML-style empty elements, which was necessary so that XML could be DTD-less, yet still have a compact idiom for empty elements.

Classic SGML only had a shortform syntax such as

     <elmt/content of elmt/
and WebSGML introduced NETC and EMPTYRM to extend this into a (rather idiosyncratic) way of achieving support for XML-style empty elements within the SGML framework.


The current HTML specification defines parsing precisely. It doesn't use a grammar; it defines a state machine directly instead.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: