Hacker News new | past | comments | ask | show | jobs | submit login
The ups and downs of the HTTP header (keithcirkel.co.uk)
58 points by Keithamus on Dec 6, 2013 | hide | past | favorite | 33 comments



A number of errors in this article makes me wary:

1. The "request" line in HTTP is not a header - it is the request, which can have associated headers. The headers are all “about” the request. The request itself is not a header, and does not follow the header syntax. (The historical reason for this is that the request line was defined in HTTP 0.9, which did not have headers.)

2. ISO-8859-1 is not “a crappy Windows character set”. It is an international standard specifically different from what Microsoft was using at the time (code page 437 was standard for MS-DOS in the US). Later, Windows switched to code page 1252, which is a copy of ISO-8859-1 except some extra glyphs in the bytes the ISO standard defined as control characters.


Thanks for the clarification about the request line, I'll edit the article to point that out!

I mostly referred to it as a "crappy Windows character set" because A) it has a limited set of characters, mostly Western European, and B) it's pretty much only used by Windows these days. While the term "crappy Windows character set" is not perhaps entirely accurate, it is a short, tongue in cheek summary of ISO-8859-1.


Unicode also has a limited set of characters, mostly those that the unicode consortium has agreed on including in the standard.


That's splitting hairs - UTF-8 allows for over a million code points, enough to cover pretty much every written language, and then some (including swathes of emoji characters). ISO-8859-1 has 256 code points, barely enough to cover Europe and America.


> Thanks for the clarification about the request line, I'll edit the article to point that out!

(Apparently you weren’t thankful enough to upvote. EDIT: never mind, I must have been mistaken.)

A more accurate description of ISO-8859-1 would be “a crappy 8-bit character set mostly only still relevant for Windows which uses its own embraced and extended version, CP1252.”


I'm afraid you're mistaken, I dutifully upvoted you right after I commented.

I've changed the wording to be slightly less ambiguous. Thanks again :)


I saw your comment and still saw only 1 point on my post; I guess I must have received a downvote too during that time. Oh well, sorry for being huffy.


For compatibility reasons browsers don't use ISO-8859-1, they interpret it as Windows 1252 instead (that de-facto requirement has been codified in the HTML standard now <http://encoding.spec.whatwg.org/>).


To quibble further the request line typically wont have a "host" section. Its almost always a uri path/stem and the 1.1 client sends an additional Host header. The request line must also have the protocol and version, HTTP/1.0.


To quibble further still: the request line may have the protocol and version if the client is HTTP/1.0 or newer. HTTP/1.0 servers must "recognize the format of the Request-Line for HTTP/0.9 and HTTP/1.0 requests" (RFC 1945).


Although no one will give a fuck if you don't handle HTTP/0.9.


Indeed. The claim "Deflate sucks compared to Gzip" jumped out at me. A more thorough discussion here would be helpful, something along the lines of "While deflate would be the superior choice (though narrowly), it has historically been poorly implemented in servers and user-agents and should therefore be avoided for compatibility".


It jumped out at me as well... because I'm under the impression that there are little differences between the two and they both use the same compression algorithm.


Gzip format uses the deflate algorithm and adds header and footer. Only advantage is over raw deflate is that it includes CRC, uncompressed size, and optionally original file name. None of which are necessary for HTTP. I guess there is an advantage that already gzipped files can be served for Accept-Encoding.


The difference between the two is that Gzip uses CRC32 while Deflate uses Adler32, which is slightly more performant. The problem, though, is that many browsers and servers (incorrectly) send or expect deflate without the headers, so "deflate" interoperability is a trainwreck.


Why is the UA header so screwed up, aside from the historical issues with it? Isn't it time that we replace it with something a bit more sane and structured? It seems the idea of detecting the browser vs detecting browser features goes back and forth. Sure, on the client side, where you have access to the DOM and the JavaScript runtime, it's great to know whether you can use the placeholder attribute in a text input, but server-side you need to decide which video file to serve to the client, and this gets tricky.

Instead, why don't we have something like this?:

    OS: Windows
    OS-Version: 8.1
    Browser: Chrome
    Browser-Version: 18.5
(Not suggesting the format, just the type of data.)

That way we can ditch the stupid stuff such as "like Gecko" which means nothing, and focusing on actual useful things.


Web developers have historically tended to write shitty UA detection logic en masse, which has in turn incentivized browser makers to carefully craft UAs to break as few of them as possible. Basically, avoiding the all-too-common "this site requires IE6 or higher" message when you visit a site in IE11. The same situation would likely develop with the proposal above, which is essentially no different from the original intent for UA strings. The most viable option would be for all browser makers to just simultaneously disable them. Like a band-aid; right off!


Given your scenario of serving the right video to browsers, you shouldn't need to do UA sniffing because the browser should have the right accept headers so you can do proper content negotiation.

UAs these days should only be good for one thing, which is analytics. The browser should provide all the necessary information for other stuff through its other headers, such as Accept. Of course, I emphasise should because there's a wee bit of fantasy in that statement.


Your suggestion will work for a few years perhaps, and then it'll deteriorate to the current state over time.


Well as part of a rant, I'll point out two bizarro-world features of HTTP headers: Line folding and comments.

You can add arbitrary crlfs to any header, so long you start the next line with whitespace. Proper implementations need to properly treat every next line as part of the single header. Very annoying to implement (and other similar protocols implementations' do not all agree!), and no benefit. Unless you're composing HTTP headers to read on a 80-column layout. And that kind of thing has no place in a computer protocol.

Comments. Seriously read this from the spec:

  Comments can be included in some HTTP header fields by surrounding
  the comment text with parentheses. Comments are only allowed in
  fields containing "comment" as part of their field value definition.
  In all other fields, parentheses are considered part of the field
  value.
That's even more bizarre. It further makes parsing need to know which header it is operating on. It just adds possibility for mis-implementation, security issues (confused deputy) and hurts performance. It's only useful if you're writing HTTP headers by hand and feel the need to comment them for ... I can't think of a legit case.

"Human readable" computer protocols are debatable (parsing rules always seem to become more difficult, which is very bad), but "human writable" is just silly.


This tripped me up to no end when I had to implement a web proxy in one of my intro to CS classes. I couldn't find this mentioned in the standard anywhere and different browsers treated it differently.


I've discovered exploitable holes "in-the-wild" due to SIP using the same inane parsing rules. Proxy A asserts security and billing. Server B processes the message but instead of reading Proxy A's assertions, it reads "cutely formed" data directly from the client.

Fixing is a royal pain, because some systems require the behaviour to be one way or another.

"Fortunately" security in VoIP is such a joke that tricks like this aren't the biggest issue and so far, I've not seen any such attempts in any attacks.


A bit of trivia why Opera is claiming to be 9.80: They used 10.00 in beta of Oepra 10 and found out that many site's sniffers couldn't process 2-digit version number. So with final release (and after that until the death of the browser) they used Opera/9.80 and put the actual version elsewhere in the string.

That being said, people who sniff UA string to serve different content (or even block the user) should end up in hell. I'd start with Google.


That being said, people who sniff UA string to serve different content (or even block the user) should end up in hell.

Goodness me how I could rant endlessly on this subject.

I operate an automated web frontend testing service and much of that centres around retrieving a HTML document and running some tests against it.

I have tried very hard to be nice and fair and to set appropriate UA strings, such as featuring only the product name and relevant version numbers. Unfortunately for reasons relating to how responses are altered in relation to the UA string this is not possible.

My product features the word 'test' in the name. Some server-side services return a 404 or a 500 if the UA string contains 'test' in any form. Due to this I can't include the full product name in the UA string and expect all tests for all end users to work in cases where they really should. Some others respond similarly is the UA string is only 'agent'.

The number of services that respond in a different manner to a blank UA string is significant. Likewise for cases where the UA string is not somewhat similar to that of common browsers.

On a related subject, I'd love it if everyone supported the simple HEAD method consistently.

Some services respond as expected and return only the response headers. Some services respond fairly with either a '405 Method Not Allowed' or '501 Not Implemented', giving me the option to try again with an equivalent GET request. Some services send a 404 or 500 in response to a HEAD in cases where the equivalent GET request works just fine.

And lastly, https://myspace.com/ responds with nothing when making a HEAD request and you have to wait for the request to time out in cases where an equivalent GET works just fine.


Interesting article, but for the part about the User-Agent header, I really liked the history lesson by Aaron Andersen [1] from 2008.

[1] http://webaim.org/blog/user-agent-string-history/


Can't say I like the design of the page, but a good read nonetheless. Though after all those warnings, I expected it to be much longer. Is it really that long an article?


> Opera 12 then just gets weird on us. It says "Generic English please, or U.S English, if not then uh... Arabic! If not then perhaps Catalan? If not then Danish, or if not that then Dutch. Ok perhaps Greek? Finnish?... Go home Opera, you're drunk.

Most amusing part. Seriously, I can't imagine why Opera sends all these languages in its request. Bizarre.


Not only that, but ... prioritized!


Slightly off topic, but this is the first post I've read on a Ghost-powered blog – I think it looks great.


Plain black text on a white background is great now?


Yes, but it's #3A4145 on white.

/pedantry


Yes.


There's more to Web content than the colour of its fonts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: