Hacker News new | past | comments | ask | show | jobs | submit login
The Case of the 50ms request (wizardzines.com)
284 points by stochastimus on May 6, 2021 | hide | past | favorite | 79 comments



Pretty fun puzzle, but this kind of debugging is alien to me. Without going into too much detail, it's pretty easy to see something suspect is going on, and this hunch can be easily tested. This debugging tour takes you to strace, wireshark, all sorts of other low level debugging techniques, when really all you had to do was simulate the client with curl -d and the problem would have been pretty obvious.

And in this case the more complicated debugging tools didn't even explain anything. As the last page says, the answer is kind of a leap. If you already knew about this problem curl would have solved it immediately, and if you didn't you'd still be baffled even after knowing exactly why the stack traces are the way they are.


I had to read this remark a couple of times before I realised it's suggesting a curl debugging trace is somehow more definitive and less complicated than tcpdump, despite it being a TCP issue.

For those of us who grew up reading W.R.Stevens, that might seem absolutely back-to-front. Not only that, curl ain't gonna help you diagnose the next one, which is a Path MTU Discovery issue, or the one after that, which is an undervoltage on the DRAM refresh controller. It might, however, be helpful in nailing that HTTP/2 prefetch request connection upgrade bug.

Mechanical sympathy means feeling every part of the machine, right down in your bones.


To be fair, it was much simpler to develop and retain fluency in tcpdump when your interfaces were physical and your protocols weren't encrypted.


It is a lot nicer to tcpdump/wireshark when you can see everything, but most of the patterns of bad behavior are visible with encrypted data too.

Of course it's not as easy as it used to be / there's a lot more hurdles.

Encryption as you mentioned; although if you control the client or the server, you can often log keys and decrypt with wireshark, but it's a lot of steps, and you don't get helpful feedback to find mistakes.

NIC offloading means the OS doesn't necessarily see packets as they are on the wire. Segmentation offload means you may see larger packets than are on the wire, and checksum offload often makes sent packets show as errors but they're fine on the wire. If the NIC mutilates the packet, that's hard to debug.

It's not easy to run packet captures on mobile devices. If you can't get tcpdump on the device, you can't get captures from the client side of your cellular data. A lot of people don't have a network router they can run tcpdump on either, and if not, they can't get the client side of wifi data either.


That's the first thing I wanted to do after seeing the JS; since it was a simple request / response, check to see if it's a problem with the client vs the server by replacing the client with a known good piece of code. This effectively bisects the search space without first diving into low-level details.


Exactly. If anyone has ever seen The Price Is Right, this is the optimal strategy for the “higher / lower” game: cut the options in half each time.


Also known as: Binary Search


True, but not as catchy


How would using curl -d make the problem obvious?


It wouldn't have split the request over separate sends.

That's what flushHeaders() is doing; it does a separate send for headers vs the body of the request.

https://nodejs.org/api/http.html#http_request_flushheaders


flushHeaders() is there to have a simple bug to find. In real world cases, tools like tcpdump and strace are essential, and I think this is a nice way to teach about how these tools can be used.


Ok, I think that would have helped me learn that the client can some how trigger the bug but I don't think it would have allowed me to narrow it down to flushHeaders.


There is no such thing as flushing a TCP socket which is why that function jumped out at me. curl would have confirmed the server isn't misbehaving. Googling flushHeaders shows it's a node implementation specific thing and the docs explain that it has to do with the way node buffers http headers. Which makes sense because you don't want to call socket.send() for every http header you add to your request. But you also don't want to send() in between your headers and the body, because then you violate TCP package size expectations.


Really nice game/tutorial.

The best job interview I ever had was framed like this. The interviewer told me there was a bug in the system and had a stack of pages he'd printed out that would provide successive clues as to what caused it. I could ask them questions, in effect using the interviewer as a search engine/debugger.

It was the closest an interview has ever come to simulating the day-to-day of a web developer.


I've said it before and will say it again: One code review or debugging session gives an interviewer more actionable information than an effectively unbounded number of logic puzzles.


January 3, 2032.

For several years now, programming has been a required course for all grade-school students.

The tech field is flooded with talent, and it is virtually impossible to get a job.

The technical interview has grown at an unbounded rate.

An aspiring engineer walks into the interview room, freshly shaven. He sits down, cracks his knuckles, takes a long sip from his branded thermos.

Waits.

And then it begins. The interview.

10, 100, 1000 logic puzzles. Faster than seems possible, he recites answers memorized from algoexpert.io.

Cut to a wide shot. Speed up time. Stubble, then a beard, appear on his face. The sun rises and sets, and yet he dare not sleep.

Eventually, he forgets language, reason, civilization, coffee. The touch of his baby son's skin, although he's not such a baby anymore.

He grunts, diagrams, and whiteboards. That's all his life is now.

"Ok, well thank you very much! You'll hear from us soon," Says the interviewer. The sounds are foreign to the engineer, but he is lead out of the office and onto the bright street. His car sits there, rusted.

The interviewer motions the next candidate inside.

He doesn't hear from them.



Text link: http://www.lightspeedmagazine.com/wp-content/uploads/2014/06...

(Haven't had a chance to read it yet)


Author here. I wrote a post about the design of this game on my blog: https://jvns.ca/blog/2021/04/16/notes-on-debugging-puzzles/


Knowledge of delayed ack and nagle's algorithm wasn't necessary to solve.

The explanation didn't mention the flushHeaders call, which is apparently the fix. I didn't run any tests, just looked at the JS and figured that sending 2 packets is worse than sending 1 w.r.t. latency.

It's also a pretty strong intuition that the client side tends to have issues, since the server is usually well-tested and standardized. Also, very often people are measuring wrong, so checking the JS to be sure the time recorded is accurate is also important.

https://nodejs.org/api/http.html#http_request_flushheaders


There's a huge difference between solving an instance of a problem and understanding the problem. Enough commits of "this seems to fix the problem" and the codebase becomes really hard to work with.


It's not reasonable to expect that developers understand how everything works. There are billions of lines of code deployed that work for reasons that the developers either don't know or only think they know.

In this case, in my judgement, you only need to know about the delayed-ack/Nagling concepts if there's a engineering requirement to keep the flushHeaders call, and the 50ms latency is a dealbreaker (i.e. not merely a curiosity). Even then, it's not clear how to fix unless you can disable one of those features.


Agree, this knowledge is unnecessary.

The first thing I did was use curl to check if it's a client or server problem.

Then I looked at the code and immediately noticed the unnecessary flush call, the unnecessary loading of the whole file in memory and the unnecessary fs.stats.


As requested by the commenter below (now I want spoiler tags):

SPOILER FOR THE GAME

I’ve seen TCP_NODELAY all over the place before, but never known why. This was a fun way to find out.


Rot13 the spoiling text!


echo "rnfl fcbvyre gntf!" | tr a-z n-za-m

:)


hackernews feature request: spoiler tags. :)


With auto rot-13


I got it straight away, but I have a telecoms engineering and networking background. This just goes to show how poorly networking is taught in most courses (as are databases), and specially in boot camps. Self-taught programmers are also very unlikely to be exposed to this kind of topic.

One thing that was not offered as an option was to use a packet analyzer like tcpdump or Wireshark, even though that is the most reliable and systematic way to get to the bottom of many performance problems. You'd think the popularity of the network tab in Chrome's dev tools would make this less scary.


If you're talking about the game, then there is the option to use tcpdump directly on the client.


Oh, OK, I didn't see it and I went straight to "GUess the problem" or whatever it was called.


I was afraid of using tcpdump until I learned that the ip and tcp packet has a very specific structure. I had to use tcpdump+wireshark for debugging recently and it felt like having a mini-superpower.


This is great. Julia Evans has some great content in general, she's really good at explaining complex topics in an engaging and accessible way.


The interesting thing for me is that I would suspect most devs including myself would assume that if the request takes 50ms, that's how long it takes (because networks!).

I wonder how many of us are able to judge how long something should take? Not me, except anecdotally.


I used to work in an overseas office and quickly learned the ping times to the other branches worldwide so got a “feel” for how long things should take. 10/10 would recommend


I ran into an interesting variation of this where we shouldn't have had any problems with small packets, but it turned out we had having jumbo frames enabled in AWS (which seems to be a default now). Together with gzip, you can actually have a bit of trouble filling up a packet, which will then be delayed by the commonly mentioned interaction with delayed ACKs.


Error: <<you-said>>: error within widget contents (Error: cannot find a closing tag for HTML <pid>)

This hints to a possible XSS and/or code injection (not completely quoted input). Input was "strace -s128 -f -p <pid>" , as an answer to "how do you strace server process"


Also fun: try to tunnel TCP over TCP.


A comment from the inventor of Nagle’s algorithm: https://news.ycombinator.com/item?id=9050645

(tl;dr Try turning off delayed ACK first, especially if you can’t update the code.)


Thanks for pointing out this comment! Setting TCP_NODELAY seems to be the go-to / default solution for some reason (cargo-culting? not understanding the interaction with Delayed ACKs?) when there are times when Nagle’s algorithm could probably help.


SPOILER ALERT

I answered "req.flushHeaders()" but surprisingly it doesn't accept that as a cause, even though the headers would be sent with the initial packet and should improve the latency.


In the solution section, one of the answers is about preventing the message being split into two packets.


"Also, the Linux kernel doesn't always enable delayed ACKs -- to reproduce this I actually had to write a Python program that explicitly turns them on. I haven't been able to find a clear explanation of exactly when delayed ACKs are used."

Delayed ACKs can be enabled on Linux kernels 3.11+ with ip(8).

  ip route change ROUTE quickack 1
On MacOS and Windows, delayed ACKs can be configured through sysctl and the registry, respectively.

Delayed ACKs may be used in response to congestion.

For example, in bulk, i.e., non-interactive, transfers with large packets, delayed ACKs can be useful.

This is covered in Chapters 15 (15.3) and 16 of Stevens' TCP/IP Illustrated Vol. 1.

This draft suggests delayed ACKs are useful during TLS handshake.

https://tools.ietf.org/id/draft-stenberg-httpbis-tcp-03.html

Also, socat allows for setting TCP options via setsockopt. No need to write a new program.


   ip route change ROUTE quickack 1 dev STRING


Twine and SugarCube! Interesting to see that pop up here:

https://www.motoslave.net/sugarcube/2/


That was exceptionally fun. I thought I had the answer but I was completely wrong. I shouldn't have stopped the debugging and rush to the solution. Unfortunately it is a game, and it allowed me to do it.

To me this seems pretty obscure and you debug pretty deep into and outside of your application. One part of me thinks of this as Somebody Else's Problem, but definitively makes me rethink it as a SEP and something devs should know about. Specially in time critical/real time systems.


Based on the description, guessed 'nagle' without any debugging. Not sure if it implies I was right or wrong, but it explained Nagle's algorithm.

Asked what to do about it and typed 'tcpnodelay'. It replied it wasn't smart and asked me to click a button.

Feels pretty basic to anyone who's ever really touched TCP in code.


A few months ago there had been multiple articles about this behaviour but i really don‘t remember the details anymore. Does anyone know a writeup with a detailled explanation to understand how it is happening and tests to see whether your systems are affected?


There’s 5 links once you finish the puzzle


Loved it! Although it didn’t give a comprehensive answer on how to preventively solve the problem on any platform. Is there one? Or should software developers just stick to one way or another like some kind of an unspoken rule?


This was fun and a lesson to look at the code first before jumping into tcpdump/tracing


A couple bugs in this game, I wanted to SSH onto the client, took me to the server instead.


It does ssh you to the client -- the text is wrong.


Nice, but not perfect:

``` You said: "strace -p $(ps aux| grep server.py| grep -v "grep"| awk -F ' ' '{print $2}')".

To strace the server, first you need to find its PID. You know that the program is called server.py. ```


It doesn't actually check your answer. You're supposed to check your answer against the game yourself


grr.. I felt stupid going through the puzzle.. but I was looking at that flushHeaders call and thought that might be a problem - simply because I never call that.


[flagged]


Congratulations, you have free will. Of course so does the creator of the site, and they have chosen (a choice I would make also, FYI) to construct the game this way, so you have to part ways.


I browse with JS switched off. I don't mind a web site using JS, if it's needed. Clearly, here it is needed.

The main beef I have with this site is that when I go to it with JS switched off, all I get is a message telling me to switch it on. There's no information about why I should switch it on - what will I get in return for switching it on, and what will the site take as a result of me switching it on. What am I missing by having JS switched off.

So, instead of the site just saying "JavaScript is required. Please enable it to continue." instead it should say something like "This web page contains an interactive game that requires JavaScript to function. Please enable it to continue." Then I know that there's actually something interesting being done with the JS, and not just a lazy webmaster who can't be bothered to display a blog post correctly.


That is a fair point.

Also for sites where JS is not required but is used to enhance interactivity a note to say that the difference is with/without JS would be useful ("without JS this site will work, but be slower due to more round-trips to the server" or "without JS this site will be less pretty") then you can choose if you care enough about the difference.


Why is a game "News"?


This site covers things that are not news most of the time.


It is a new little toy that appeals to hackers. That makes it valid "hacker news" IMO.

Not earth-shattering by any account, but not everything has to be.

As a side note: the question isn't relevant to the comment you replied to, your reply would have been better posted at top level where the people seeing it is not filtered by taking out those who have decided to skip the thread about JS complaint complaints.


> Of course so does the creator of the site,

I find it very debatable that putting information on the internet does not come with associated mandatory responsibilities


I agree. There are things websites shouldn’t do - like non-consentually collecting and selling user data. Asking users to send private details over unencrypted http. Being inaccessible to screen readers. That sort of thing.

Having content that requires javascript? Eeehhhhhh. When static content like blog posts and news articles require JS it’s annoying. But for a little game like this the dev’s choice makes sense. Losing a few % of their audience is a reasonable trade in exchange for not needing to implement the whole thing twice. It’s really hard to get worked up about server side rendering for toys like this.

I suggest you pick your battles differently. This is a silly hill to die on.


What responsibility are you wanting the site to be held to here?

Accessibility could be a valid concern, though many accessibility tools cope with common JS patterns so if your concern really is accessibility (rather than "jcelerier is entitled to have all content without running JS") then please show what you think the problem is there.


This is a free, indie interactive fiction about network troubleshooting.

I mean, it could have been made as a server-side app, but going SPA makes sense here from a practical standpoint.


It’s an interactive page, not a blog post


It's a webpage with buttons that take you to another webpage.


Given the technical nature of the puzzles/adventures, I guess that if there's something we could all agree on is that the message about why javascript is required could be a bit more nuanced. Beyond that, sure, it could be done without javascript, but it's easy to see how it helps using javascript. It's a pretty fair use-case.


It’s stateful. You can visit the same location with different states.


Thats what cookies are for.


What's with all the griping going on regarding javascript? Twine is made for these "choose your own adventure" type games, and it's faster to use it than handcode everything while serializing all your app state to a cookie.

What's next, a bunch of complaining that electron apps are slow? That people are writing video games in managed languages?


I also think that Twine/Client-Side-JS isn't the worst choice here. It takes unnecessary load from the server and is easy to use, at least for the one creating the game.

It's just so frustrating to watch how having loads of Javascript on websites that absolutely don't need it has been normalized. I don't think people should be blamed for using Javascript (or Electron, .NET, whatever), especially for hobby projects. But I also think users shouldn't be silent about it, especially since there are young developers who actually don't know software could be better. It's just such a massive waste of resources and breaks accesibility way to often.


I very much understand your frustration. Hate JS web page as well, but this kinda game multipage logic type of thing SPA is better. Just wish I can only give authorization to certain JS api.


> Just wish I can only give authorization to certain JS api.

This is already implemented in any browser I know, but for some reason it is available only to some functionality (microphone, camera etc.) but not for eg. Ajax, Cross-Origin Things, Websockets, Canvas2D, WebGL etc.

Also before executing any Javascript browsers should ask if you want this website to execute (potentially malicious) code on your computer!


JS is pretty well sandboxed — are there any examples of websites doing anything outside of their own JS context?


Remote-code-execution as a feature just isn't a good idea. Sandboxes can (and Murphy's Law says they will) be broken out of.

There are many documented browser-exploits and basically all newer ones that are actually a danger to users involve Javascript. It's also not reasonable to assume that people only visit trusted sites.


All due respect, this is a neat advertisment for the "storytelling" Javascript library she is using, but I learn much more by reading W R Stevens' books. There is more to TCP/IP than what one can do through Berkeley sockets. Plus reading Stevens' books does not require Javascript.


Her pronouns are she/her.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: