Hey guys -- the most resource-intensive part of the infrastructure (the game master boxes) is currently overwhelmed by your enthusiasm. We're spinning up more boxes for you, but it will take a few minutes.
Thomas and I have a running gag about this. Rails is holding up admirably. You can still actually see the website. Our problem is presently that one player implies 1+ ruby (not Rails) processes running bots and doing orchestration for their levels, and that has pegged the CPU on all of our GM boxes and also, possibly, saturated their network links as they hit the venue with... I don't actually know how many orders have been placed already but I'm going to wager a guess that it's "a lot."
I so want to read a postmortem of the launch issues. It's one thing when a corporate project with crazy requirements goes down (almost expected these days :( ). Completely different when a service started/run by known tech people with their own schedule breaks on launch. I'd be really interested in what testing was done and why it didn't match reality. (not a jab/criticism, just something to actually learn from)
Meh? It's a play-money stock exchange which processed over 200k orders since a few seconds ago. We're in the process of finding the parts of the plumbing that weren't sized right for hundreds of simultaneous players and tens of thousands of simultaneous actors (each bot is approximately as taxing on the infra as a human -- more, at the moment, since many folks can't read the instructions to start hitting us with orders and the bots already know how to do that).
I confess that I would love to see Patrick's usual writing style and ability applied to almost anything. I'm not sure why, but he just seems to be really ... genuine? Personable? It's like I'm listening to a friend talk -- a friend who happens to either be an expert, or have really interesting things to say.
I want to say that it's bizarre, but I think it's mainly that he writes very well.
I wondered if Patrick wrote "The StarFighter Handshake", the friendly guidelines that take the place of Terms And Conditions when you signup. It made the usual legalese (ie don't DDOS our site, don't hack into personal user data) feel like shared values of a club you were about to join. That had a hugely positive reaction on me when I saw & read it, and I'm not even entirely sure why I was so impressed. [Kudos, whoever wrote it.]
We have one Rails box w/ ~2 gigs of RAM which has been sitting pretty since the golang process also on the box stopped hogging 100% of the available file descriptors. Our scaling issue is with ruby processes, of which we have several thousand active at the moment. Each process runs one wee little universe, filled with wee little traders trading wee little stocks, for the benefit of a single player playing a single level.
These aren't a JVM language (or Golang, for that matter) out of expediency: I had to write the economic simulation and the trading algorithms and get it done in about ~2 weeks, so I wrote them in my best language, Ruby.
I'm a fan of the JVM, but it doesn't remove the need to be prepared to scale up by adding more machines. That said, it's great at concurrency, so it's easier to harness the full capability of a machine than with other platforms.
Even if I didn't need multiple machines for throughput I'd probably still want them for redundancy (redundancy of machine hardware as well as redundancy across availability zones). (Note: didn't vote on your post)
Modestly less trivial if you happen to put your chat server on the box with Rails because Rails is certainly not going to fall over and your chat server attempts to open several thousand websockets on a box with a not-too-generous number of open file descriptors, it appears.
As I understand, HTTP pipelining is basically a no-go in the real world. Broken implementations, plus performance implications since responses must go in order.
I feel like a chat server is something best served by something not in your stack (especially Rails). I surprised you chose to do it in house. How about using https://pusher.com/ or https://www.firebase.com/ ?
Why in this day and age of cheap and easy servers would you ever have more than one service type running on a server in production? It's trivial to break things up nowadays
I hope you get this fixed soon. It can be quite stressful having your exposure period potentially damaged because your software or infrastructure can't handle the sudden influx.
I'm pretty calm at the moment, particularly now that the servers seem to be cooling down a bit.
Startup culture says Launch Days Are Really Important. This is not our first rodeo. We know 0.01% of the people who are going to play this game are even aware it exists at the moment. We're pleased as punch that y'all are playing and some of you are progressing in it, but long-term, a few teething problems on launch day is a no-op.
This is part of the game. To beat this level you sweet talk the founders into letting you help them fix their scalability problems. If you get the site running in 3 hours, you get a bonus score of ∞.
I can't get the first level to play, seems things are still overloaded. Really looking forward to this though, hope you will post or email again once things settle to remind us about it.
edit: this is the message im getting trying to play first steps, if it helps, dashboard says "Trades.Exec() (Status: playable!)" (3:08 UTC)
> We couldn't start the level: This level has been locked by a different server. (Sorry if this happened accidentally as a result of a server crash -- it will clear up in an hour.) Refresh page to check if server is down?
I'm going to try to build a self-help tool for unlocking level instances.
Briefly: you get round-robined to a (healthy, importantly) GM machine the first time you try to start a level and that association needs to be fairly sticky. Once you've started the level, a piece of that box's memory is all your own until the level ends. To prevent a single player from getting memory on all the boxes we "lock" a level to a particular box. While the level is running, no other box will open the same level for the same player.
The problem is when the GM with your box becomes unhealthy. Nginx will put you in touch with a different GM, which could possibly mean your level is still running on a box you can't contact. At that point, you have to wait for either that box to hurry up and finish dying, for me to do a redeploy, or for one of the other GMs to figure "OK, that level appears locked but it's a very old lock so it's unlikely it is legit. I'll unlock it for you and grab the lock."
Levels generally die ~40 minutes after starting (but who knows with a failing GM). The lock ages open after ~1 hour.
Distributed locking with liveness checks is exactly what zookeeper excels at.
I wrote a one-click deploy (via cloudformation) self-healing zookeeper cluster on AWS (wrapping netflix exhibitor for the pretty GUI and config reloads) that has been running since the beginning of 2014 on ~$100/mo in production without issue. It's roughly 200-300 lines of ops config, because exhibitor makes everything easy.
Consider biting the bullet! If you want to discuss further, my email is in my HN profile.
Perhaps it could be as simple as keeping a hash of levels tied to a box. When your monitoring alerts that your box is unhealthy, run a process to remove all locks associated with the dead box allowing the player to refresh and restart the session?
Also, maybe a new error, is getting mixed in, response from /gm/levels
{"ok":false,"error":"exit status 1"}
I was thinking you could treat is as a distributed hash ring with key shuffling, healing and the like ala Riak. Or a virtual actor cluster ala .NET's Orleans.
You could use container migration to move people off dying nodes, or to redistribute the cluster.
Ultimately though, it's probably a pretty lean startup and there is only so much experience and effort you can apply to each area.
Managed to get through level 2 eventually, lot of retries while boxes were being rebooted etc. The calls to /gm must allocate an instance or something of that sort as once you manage to get lucky on that step the game will work it seems.
Raah, that was the part I was looking for, I wanted to use it to learn a bit about assembly and code generation - I couldn't care less about HFT (sorry Patrick).
Things are looking good, but we're decorating the family Christmas tree tonight, and so I took a break about 30 minutes ago. My current guess is that our part will be up tomorrow early.
Just for anyone interested:
We're pushing a "trainer" tomorrow, which is essentially "one level" of the CTF, so that people who have never looked at low-level embedded systems code can get the hang of it, and so we can, you know, not blow up when the actual CTF runs.
The trainer can do some things that the real CTF UI won't be able to do (because they would make levels too easy to clear), so I'm just fine with that. It's fun in a different way from the CTF.
The full CTF we'll post in a week or two. Ironically: it was pruning all the real CTF levels out that screwed me this week. You go into thinking "all we're doing is removing functionality, this will be easy", and by the end of the week you're setting your computer on fire.
Additionally, if you're especially anxious to be introduced to an assembly-themed CTF, you can also go look at Microcorruption. I would wager that Thomas and Erin's current project is "better", but Microcorruption still seems rather good.
It seems the launch was a bit premature, there are tons of errors. I think this is a symptom of a single developer writing everything from scratch. I respect both Thomas and Patrick for their opinions, but I think HN falsely equates that to engineering expertise.
I'm getting this error when trying to play the first level:
"We couldn't start the level: couldn't connect to the GM server to start the level. Refresh page to check if server is down?".
I assume it's just too many people hitting the server at the same time, but fyi.
We had a bug in email validation (worked on staging grumble grumble). Separately, our deploy script blows away memcached. Guess where email validation tokens were kept.
I temporarily turned off bust-cache-on-deploy and the rapidly rising number of validated emails suggests things are working now. Now back to more DevOps craziness.
I wanted to give you a status update of what new users currently experience. Or at least what I'm experiencing.
After creating an account, I was redirected to the front page with no notification about whether it succeeded or failed. No emails have arrived. When I try to create it again, I get "username already taken" and "email already taken", so something worked. But when I try to log in, I'm redirected to the front page with no notifications. In other words, login is failing.
To clarify, this isn't a complaint. Congratulations on launching, and apologies if you're already aware of these issues.
Edit: This is similar to https://news.ycombinator.com/item?id=10724699 but I'm unable to log in after clearing cookies. I get "Couldn't sign in. Check username/password" for a correct password.
Seems to be pretty stressed out at the moment, tons of 500s and timeouts. 2-3 refreshes per page, got to the Level controls dialog, nothing doing.
We couldn't start the level: This level has been locked by a different server. (Sorry if this happened accidentally as a result of a server crash -- it will clear up in an hour.) Refresh page to check if server is down?
Usernames must start with a letter? That is an odd requirement. I usually pad my username with zeroes when sites dont let me choose dfc. Out of curiosity why the constraint?
Databases are usually sharded by the first letter of the username. Devs like to start their usernames with a number, which causes problems.
Also, usernames are injected into their code. Therefore they must be valid identifiers. Don't name yourself exit() or ret, or you'll crash their servers. Naming yourself nop will give you a distinct advantage relative to other players, however.
Update 5 min later: Maybe not. I logged in and while playing First Steps is listed in "Stuff You Can Do", clicking a level from Level Controls is instead throwing errors in the console.
I'm assuming this is not part of the game.
--
It looks like email invites went out about the same time as the HN post, but it seems to be good now.
+1 exactly the same problem. Maybe it is because of the high number of requests. I'm astonished that the site is still up despite being #1 on the front page. EDIT: Yup, clearing cookies solves this
From the welcome email: "You can play with any tech stack which speaks HTTP, or try doing everything by hand (or curl) if you like playing life on hard mode."
"We decided to found a company to do nothing other than create and operate the best programming challenges in the world."
"You can play whenever you want, at any pace which works for you. Some players will race to complete these. Some will play leisurely. Either is perfectly fine."
Congratulations on having the same sort of initial launch problems that other very successful games have!
Seriously. There's nothing worse than preparing for scale and then discovering you don't need it because there's not enough interest. Wheras I'm sure you'll get these initial load problems sorted soon enough!
While you are correct that it's not good to over-engineer without knowing what is needed, reading this mornings' update shows that while you shouldn't over-prepare for scale and have it unused, this isn't mega-scale:
It seems like your nginx proxy is having trouble... I keep getting the same ip from DNS and nginx error messages. Maybe you should bring up some more and at the very least use DNS to send requests round-robin to the proxies.
We only have one Nginx box at the moment. You might have caught it during one of the Rails reboots I've been making in the last few minutes during redeploys.
How does the "finding jobs" part work?
Are you aiming to help people get a foot in the door at companies which would be looking to interview them, or trying to skip the "technical interview" part completely?
(generate interview opportunities vs job offers)
1. You play stockfighter and get the highest score this year.
2. patio11 emails you and says "hey, you're awesome. are you looking for a job?"
3. You say "yes!"
4. patio11 emails pc (CEO of Stripe) and says "hey 2oi4j3 is awesome, here is proof (i.e. your stockfighter score)"
5. pc emails you and says "hey let's talk about what job you would like at Stripe, and let's bypass most or all of that technical interview crap because we already know you're awesome"
6. you and pc decide that you would be best on the X team, you start next week.
I think skipping #5 is unwise. What's to stop someone from making starfighter account, scoring well, and then selling their account to some middling programmer?
I think the idea behind starfigher is great. But if I was using their service, I'd do a "Trust, but verify" approach. I would not skip a technical interview. But I also wouldn't bombard them with inane whiteboard challenges or multi-hour take-home challenges
However, I would want them to walk me through their code. And I would want to probe into why they made certain decisions. And I'd like to give them at least one challenge and witness their problem solving process. So #5 would most certainly not be skipped.
But if the expectation of this service is there's absolutely no technical vetting process outside of this game then I guess it wouldn't be a service I'd ever use to hire someone. I don't believe that's the case though.
They are contingent recruiters much like any other. They generate leads to companies for their hiring pipeline. If that lead makes it through the interview process they get paid (likely priced as a percentage of first year salary).
They differentiate themselves in that they claim their game will generate better leads than email spam, trolling LinkedIn & scuzzy phone sales.
I believe them, but I'd still interview anyone they sent me.
We couldn't start the level: This level has been locked by a different server. (Sorry if this happened accidentally as a result of a server crash -- it will clear up in an hour.) Refresh page to check if server is down?
So much reading about what it does. Don't assume people already know how to play a programming game. Most of us don't. Is there an intro video that at least we can watch while you address the server issues?
> Our servers are under continuing heavy load due to having launched recently and the site being on the front page of HN. This is causing poor performance and undefined behavior.
How is that possible? As I understand, undefined behaviour is either there when you compile, or it isn't (it is a function of the source code). So if the undefined behaviour is a function of traffic intensity, I assume you compile different C code base when you have high load. I never heard about something like this before.
It could be around the high load causing errors that are causing unexpected flow on effects. For example, an earlier thread talked about how you're locked to a specific server in the pool and if your server goes down, you can't continue because no other server in the pool will talk to you until the lock expires. Having said that I'm getting around 4 different errors randomly so it could be a number of things playing out in undefined ways.
Yes and no. According to Amazon, "EC2 is a web service that provides resizable compute capacity in the cloud." Really, an EC2 instance is just a virtual computer. You can, with the click of a mouse, create a new EC2 instance and load a pre-defined system image onto it. This allows you to, rather quickly, manually increase or decrease the number of EC2 instances based on load. Automatic scaling of EC2 instances can also be done, although it would be provided at a higher layer.
However, a more critical factor is whether the architects designed their systems and software to scale effectively when the number of EC2 instances is adjusted. If there is a bottleneck that is not properly addressed by adding EC2 instances, then adding EC2 instances will be pointless.
Is this meant for people interested in security? I ask because the main site says you guys are making CTFs to replace interviews, but I'm not sure if that's interviews for security roles or all developers.
I'd be interested in either, but for the moment rather not add yet another thing for me to learn