Hacker News new | past | comments | ask | show | jobs | submit login
Stockfighter is live (stockfighter.io)
549 points by jzig on Dec 12, 2015 | hide | past | favorite | 138 comments



Hey guys -- the most resource-intensive part of the infrastructure (the game master boxes) is currently overwhelmed by your enthusiasm. We're spinning up more boxes for you, but it will take a few minutes.


I BLAME RAILS.


Thomas and I have a running gag about this. Rails is holding up admirably. You can still actually see the website. Our problem is presently that one player implies 1+ ruby (not Rails) processes running bots and doing orchestration for their levels, and that has pegged the CPU on all of our GM boxes and also, possibly, saturated their network links as they hit the venue with... I don't actually know how many orders have been placed already but I'm going to wager a guess that it's "a lot."


I so want to read a postmortem of the launch issues. It's one thing when a corporate project with crazy requirements goes down (almost expected these days :( ). Completely different when a service started/run by known tech people with their own schedule breaks on launch. I'd be really interested in what testing was done and why it didn't match reality. (not a jab/criticism, just something to actually learn from)


Meh? It's a play-money stock exchange which processed over 200k orders since a few seconds ago. We're in the process of finding the parts of the plumbing that weren't sized right for hundreds of simultaneous players and tens of thousands of simultaneous actors (each bot is approximately as taxing on the infra as a human -- more, at the moment, since many folks can't read the instructions to start hitting us with orders and the bots already know how to do that).


Don't underestimate how valuable "meh" post mortems are. I for one, would love to see your usual writing ability applied to one.

Weeks from now when this is all a funny anecdote of course...


I confess that I would love to see Patrick's usual writing style and ability applied to almost anything. I'm not sure why, but he just seems to be really ... genuine? Personable? It's like I'm listening to a friend talk -- a friend who happens to either be an expert, or have really interesting things to say.

I want to say that it's bizarre, but I think it's mainly that he writes very well.


I wondered if Patrick wrote "The StarFighter Handshake", the friendly guidelines that take the place of Terms And Conditions when you signup. It made the usual legalese (ie don't DDOS our site, don't hack into personal user data) feel like shared values of a club you were about to join. That had a hugely positive reaction on me when I saw & read it, and I'm not even entirely sure why I was so impressed. [Kudos, whoever wrote it.]


Did you guys consider using Elixir/Phoenix? Seems like it would have been a good fit (in terms of scalability) for what you guys have built.


> Rails is holding up admirably. You can still actually see the website.

Site is down now... (at least in Europe)


Site has been mostly down for me in the US over the last ten minutes.


Give it time, for me it thought about it a few minutes, and then rendered :)


> Rails is holding up admirably.

... except that you have to spin up more boxes.

I think you have a different definition of "admirably" than most of us :-)

Follow Twitter's lead and switch to a JVM-based solution when you get a chance, you'll be shocked.


We have one Rails box w/ ~2 gigs of RAM which has been sitting pretty since the golang process also on the box stopped hogging 100% of the available file descriptors. Our scaling issue is with ruby processes, of which we have several thousand active at the moment. Each process runs one wee little universe, filled with wee little traders trading wee little stocks, for the benefit of a single player playing a single level.

These aren't a JVM language (or Golang, for that matter) out of expediency: I had to write the economic simulation and the trading algorithms and get it done in about ~2 weeks, so I wrote them in my best language, Ruby.


Curious--is this something with which JRuby could help?


I'm a fan of the JVM, but it doesn't remove the need to be prepared to scale up by adding more machines. That said, it's great at concurrency, so it's easier to harness the full capability of a machine than with other platforms.

Even if I didn't need multiple machines for throughput I'd probably still want them for redundancy (redundancy of machine hardware as well as redundancy across availability zones). (Note: didn't vote on your post)


I thought your stack was mostly go?


It currently appears to be mostly stop.


It's mostly Go except where it isn't.


I suspect that was irony...


In a blog post some months ago [1], Patrick said that some components do use Rails.

[1] http://www.kalzumeus.com/2015/08/20/designing-and-building-s...


I know using Rails isn't as popular here as it once was, but in my experience, if you cache correctly (like you should), it's trivial to scale.


Modestly less trivial if you happen to put your chat server on the box with Rails because Rails is certainly not going to fall over and your chat server attempts to open several thousand websockets on a box with a not-too-generous number of open file descriptors, it appears.


"Make sure you've turned up the file descriptors" is the single-process socket server equivalent of "Turn off HTTP KeepAlive" for Apache sites. :)

(Yeah, I've had outage because of this too.)


Surely you meant "Turn on HTTP pipelining"? Or is there something wrong with Apache's pipelining implementation?


As I understand, HTTP pipelining is basically a no-go in the real world. Broken implementations, plus performance implications since responses must go in order.

Edit: Here's a FF discussion on it: https://bugzilla.mozilla.org/show_bug.cgi?id=264354 -


I misremembered a bit; it was Turn Off Apache Keepalive: https://hn.algolia.com/?query=patio11%20apache%20keepalive&s...


I feel like a chat server is something best served by something not in your stack (especially Rails). I surprised you chose to do it in house. How about using https://pusher.com/ or https://www.firebase.com/ ?


Perhaps those lack useful vulnerabilities?


Why in this day and age of cheap and easy servers would you ever have more than one service type running on a server in production? It's trivial to break things up nowadays


Looks like it's actually Go based on one of the JSON responses I found in the web page:

    {
      ok: false,
      error: "strconv.ParseInt: parsing "NaN": invalid syntax"
    }


Congrats on having the problems you want to have. :)


These are definitely good problems to have!


I hope you get this fixed soon. It can be quite stressful having your exposure period potentially damaged because your software or infrastructure can't handle the sudden influx.

Stick to deep breathes and think it through :)


I'm pretty calm at the moment, particularly now that the servers seem to be cooling down a bit.

Startup culture says Launch Days Are Really Important. This is not our first rodeo. We know 0.01% of the people who are going to play this game are even aware it exists at the moment. We're pleased as punch that y'all are playing and some of you are progressing in it, but long-term, a few teething problems on launch day is a no-op.


This is part of the game. To beat this level you sweet talk the founders into letting you help them fix their scalability problems. If you get the site running in 3 hours, you get a bonus score of ∞.


∞ score doesn't sound bad. I should start up an LLC and give away 99% of my score to that LLC, for charity purposes.


I can't get the first level to play, seems things are still overloaded. Really looking forward to this though, hope you will post or email again once things settle to remind us about it.

edit: this is the message im getting trying to play first steps, if it helps, dashboard says "Trades.Exec() (Status: playable!)" (3:08 UTC)

> We couldn't start the level: This level has been locked by a different server. (Sorry if this happened accidentally as a result of a server crash -- it will clear up in an hour.) Refresh page to check if server is down?


I'm going to try to build a self-help tool for unlocking level instances.

Briefly: you get round-robined to a (healthy, importantly) GM machine the first time you try to start a level and that association needs to be fairly sticky. Once you've started the level, a piece of that box's memory is all your own until the level ends. To prevent a single player from getting memory on all the boxes we "lock" a level to a particular box. While the level is running, no other box will open the same level for the same player.

The problem is when the GM with your box becomes unhealthy. Nginx will put you in touch with a different GM, which could possibly mean your level is still running on a box you can't contact. At that point, you have to wait for either that box to hurry up and finish dying, for me to do a redeploy, or for one of the other GMs to figure "OK, that level appears locked but it's a very old lock so it's unlikely it is legit. I'll unlock it for you and grab the lock."

Levels generally die ~40 minutes after starting (but who knows with a failing GM). The lock ages open after ~1 hour.


Distributed locking with liveness checks is exactly what zookeeper excels at.

I wrote a one-click deploy (via cloudformation) self-healing zookeeper cluster on AWS (wrapping netflix exhibitor for the pretty GUI and config reloads) that has been running since the beginning of 2014 on ~$100/mo in production without issue. It's roughly 200-300 lines of ops config, because exhibitor makes everything easy.

Consider biting the bullet! If you want to discuss further, my email is in my HN profile.


Perhaps it could be as simple as keeping a hash of levels tied to a box. When your monitoring alerts that your box is unhealthy, run a process to remove all locks associated with the dead box allowing the player to refresh and restart the session?

Also, maybe a new error, is getting mixed in, response from /gm/levels {"ok":false,"error":"exit status 1"}


I was thinking you could treat is as a distributed hash ring with key shuffling, healing and the like ala Riak. Or a virtual actor cluster ala .NET's Orleans.

You could use container migration to move people off dying nodes, or to redistribute the cluster.

Ultimately though, it's probably a pretty lean startup and there is only so much experience and effort you can apply to each area.


Managed to get through level 2 eventually, lot of retries while boxes were being rebooted etc. The calls to /gm must allocate an instance or something of that sort as once you manage to get lucky on that step the game will work it seems.


Well, sort of. Erin and I are scrambling on last-minute emulator/compiler bugs, but Patrick is letting people play with his trading exchange levels.


Raah, that was the part I was looking for, I wanted to use it to learn a bit about assembly and code generation - I couldn't care less about HFT (sorry Patrick).

Good luck!


Things are looking good, but we're decorating the family Christmas tree tonight, and so I took a break about 30 minutes ago. My current guess is that our part will be up tomorrow early.

Just for anyone interested:

We're pushing a "trainer" tomorrow, which is essentially "one level" of the CTF, so that people who have never looked at low-level embedded systems code can get the hang of it, and so we can, you know, not blow up when the actual CTF runs.

The trainer can do some things that the real CTF UI won't be able to do (because they would make levels too easy to clear), so I'm just fine with that. It's fun in a different way from the CTF.

The full CTF we'll post in a week or two. Ironically: it was pruning all the real CTF levels out that screwed me this week. You go into thinking "all we're doing is removing functionality, this will be easy", and by the end of the week you're setting your computer on fire.


Additionally, if you're especially anxious to be introduced to an assembly-themed CTF, you can also go look at Microcorruption. I would wager that Thomas and Erin's current project is "better", but Microcorruption still seems rather good.


It's seems the site is extremely broken, I can't navigate anywhere and clicking links that have "#" href's don't do anything.

["handleListAllowedLevels", Object] blotter-bundle-dd01d8c9.js:1549 No level specified in URL. blotter-bundle-dd01d8c9.js:1605 ["enterLevel does not think it can reenter level: ", "first_steps", NaN, Object] blotter-bundle-dd01d8c9.js:2602 ["handleEnterLevel -> startLevel", Object] blotter-bundle-dd01d8c9.js:2556 ["Error starting level:", Object]


It seems the launch was a bit premature, there are tons of errors. I think this is a symptom of a single developer writing everything from scratch. I respect both Thomas and Patrick for their opinions, but I think HN falsely equates that to engineering expertise.


Congrats on launching!

I'm getting this error when trying to play the first level: "We couldn't start the level: couldn't connect to the GM server to start the level. Refresh page to check if server is down?".

I assume it's just too many people hitting the server at the same time, but fyi.


Just out of curiosity, can you post the stats about load on your site and what kind of issues did you face, would appreciate it.


Definitely something we'll do; later.


Unable to validate my email. Earlier it wasn't working at all, now it tells me I need to be logged in (but I am, I think).

Edit: Fixed for me now.


We had a bug in email validation (worked on staging grumble grumble). Separately, our deploy script blows away memcached. Guess where email validation tokens were kept.

I temporarily turned off bust-cache-on-deploy and the rapidly rising number of validated emails suggests things are working now. Now back to more DevOps craziness.


I wanted to give you a status update of what new users currently experience. Or at least what I'm experiencing.

After creating an account, I was redirected to the front page with no notification about whether it succeeded or failed. No emails have arrived. When I try to create it again, I get "username already taken" and "email already taken", so something worked. But when I try to log in, I'm redirected to the front page with no notifications. In other words, login is failing.

To clarify, this isn't a complaint. Congratulations on launching, and apologies if you're already aware of these issues.

Edit: This is similar to https://news.ycombinator.com/item?id=10724699 but I'm unable to log in after clearing cookies. I get "Couldn't sign in. Check username/password" for a correct password.


For email validation you can also just used a signed json token (jwt), so that way the only thing you need to store on your servers is the secret key.


I'm in the same loop. Click link, verify email, tells me to login, already logged in, re log in, verify url, still says verify.

With that. Congrats to all three of you. It's an exciting day

edit: Woot. Fixed.


Blow away cookies for the site and try again.


Try again. This was happening to me as well. I eventually generated a new email and it worked.


Seems to be pretty stressed out at the moment, tons of 500s and timeouts. 2-3 refreshes per page, got to the Level controls dialog, nothing doing.

We couldn't start the level: This level has been locked by a different server. (Sorry if this happened accidentally as a result of a server crash -- it will clear up in an hour.) Refresh page to check if server is down?



Thanks; fixed.


Usernames must start with a letter? That is an odd requirement. I usually pad my username with zeroes when sites dont let me choose dfc. Out of curiosity why the constraint?


Databases are usually sharded by the first letter of the username. Devs like to start their usernames with a number, which causes problems.

Also, usernames are injected into their code. Therefore they must be valid identifiers. Don't name yourself exit() or ret, or you'll crash their servers. Naming yourself nop will give you a distinct advantage relative to other players, however.

I don't know.


It took me too long to realize you weren't serious


UPDATE: Can log in now. Cleared browser cookies, but have no idea if that's a coincidence or not.

I seem to have hit some kind of bug signing up:

1. Filled out signup form, hit enter.

  > Page reloaded, sign-up form empty. No other feedback.
2. Filled out signup form again, hit enter.

  > Told username and email address taken.
3. Check email.

  > no new email.
4. Filled out login form using info entered in step 1.

  > Page reloaded, not logged in.


Update 5 min later: Maybe not. I logged in and while playing First Steps is listed in "Stuff You Can Do", clicking a level from Level Controls is instead throwing errors in the console.

I'm assuming this is not part of the game.

--

It looks like email invites went out about the same time as the HN post, but it seems to be good now.


+1 exactly the same problem. Maybe it is because of the high number of requests. I'm astonished that the site is still up despite being #1 on the front page. EDIT: Yup, clearing cookies solves this


Try clearing cookies for the site and try logging back in.


From the welcome email: "You can play with any tech stack which speaks HTTP, or try doing everything by hand (or curl) if you like playing life on hard mode."

Love this.


Trying to start the first level:

"Couldn't resume the level because Couldn't connect to the GM server to resume the level."


I'm on the instructions for level 2 and it points to the API docs (http://starfighters.readme.io/) but that URL is giving me a 404.


There's an extra 's', try https://starfighter.readme.io/


Yup, I got it now from the API docs link in the level. Seems like a typo in the level 2 instructions.


Really awesome to see it launch!

It looks like the GM server is down at the moment, so I have yet to start the first level, but I'm more than happy to mash the F5 key till it works.


If we all mash F5, maybe it will start up faster!


I know what will help, writing a script to mash F5 faster and notify me when it works.


Is there any deadline or is this a long-running project? I'd love to play with this but I don't have any time at the moment.


We intend this CTF to be the flagship product of our company for the next several years. If you can't play it today, no worries.


Thx for the info and all the best for your new company :)


From their launch announcemnet:

"We decided to found a company to do nothing other than create and operate the best programming challenges in the world."

"You can play whenever you want, at any pace which works for you. Some players will race to complete these. Some will play leisurely. Either is perfectly fine."

https://discuss.starfighters.io/t/welcome-to-stockfighter/14...

So come on over when you have time and enjoy yourself!


It looks like it's a long running project. From their announcement post[1]:

"This first release is Chapter 1. We intend to release a new chapter approximately every 6 to 8 weeks."

[1]https://discuss.starfighters.io/t/welcome-to-stockfighter/14...


Congratulations on having the same sort of initial launch problems that other very successful games have!

Seriously. There's nothing worse than preparing for scale and then discovering you don't need it because there's not enough interest. Wheras I'm sure you'll get these initial load problems sorted soon enough!


While you are correct that it's not good to over-engineer without knowing what is needed, reading this mornings' update shows that while you shouldn't over-prepare for scale and have it unused, this isn't mega-scale:

https://discuss.starfighters.io/t/state-of-the-game-last-upd...

tldr: It took ~3000 game instances from < 5000 sign-ups (and therefore I assume << 5000 concurrent players) to bring the game down.


You didn't do any load testing before releasing this on HN?


The handshake agreement presented here makes me very happy.

Were many revisions done on it, or was it pretty clear to write?


I wrote it in one take 15 minutes before hitting Go.


It seems like your nginx proxy is having trouble... I keep getting the same ip from DNS and nginx error messages. Maybe you should bring up some more and at the very least use DNS to send requests round-robin to the proxies.


We only have one Nginx box at the moment. You might have caught it during one of the Rails reboots I've been making in the last few minutes during redeploys.


Congrats on launching!

If you're looking to dive into the problems more immediately, there's a wealth of API clients and other resources available in the forums: https://discuss.starfighters.io/t/helpful-external-tools/136

Don't be afraid to try the game if you don't have much programming experience. It's seriously a great way to pick up some valuable skills.


Confusing, still pretty buggy. Maybe it's a bit early to release?

I'll try to compile some constructive feedback in a bit.


How does the "finding jobs" part work? Are you aiming to help people get a foot in the door at companies which would be looking to interview them, or trying to skip the "technical interview" part completely? (generate interview opportunities vs job offers)


http://starfighters.io/ (click "how?" in the last paragraph)


hmm yeah I get that but still, what does an "introduction" mean? :)


Sequence of events:

1. You play stockfighter and get the highest score this year.

2. patio11 emails you and says "hey, you're awesome. are you looking for a job?"

3. You say "yes!"

4. patio11 emails pc (CEO of Stripe) and says "hey 2oi4j3 is awesome, here is proof (i.e. your stockfighter score)"

5. pc emails you and says "hey let's talk about what job you would like at Stripe, and let's bypass most or all of that technical interview crap because we already know you're awesome"

6. you and pc decide that you would be best on the X team, you start next week.


7. pc sends stockfighter a bunch of money


Yes, quite right. I forgot that part.


Really hoping they can also find non-full-time placements.


I think skipping #5 is unwise. What's to stop someone from making starfighter account, scoring well, and then selling their account to some middling programmer?


What's to stop someone from diving head first down the stairs to speed up their descent?


I think the idea behind starfigher is great. But if I was using their service, I'd do a "Trust, but verify" approach. I would not skip a technical interview. But I also wouldn't bombard them with inane whiteboard challenges or multi-hour take-home challenges

However, I would want them to walk me through their code. And I would want to probe into why they made certain decisions. And I'd like to give them at least one challenge and witness their problem solving process. So #5 would most certainly not be skipped.

But if the expectation of this service is there's absolutely no technical vetting process outside of this game then I guess it wouldn't be a service I'd ever use to hire someone. I don't believe that's the case though.


that business model... if you've nothing nice to say, say nothing, i'll keep to that


"say nothing" ...hm... looks like you failed. Better luck next time!


They are contingent recruiters much like any other. They generate leads to companies for their hiring pipeline. If that lead makes it through the interview process they get paid (likely priced as a percentage of first year salary).

They differentiate themselves in that they claim their game will generate better leads than email spam, trolling LinkedIn & scuzzy phone sales.

I believe them, but I'd still interview anyone they sent me.


I wasn't able to access the levels yet unfortunately. Is there still some testing going on?


Does anyone know what tool is used to create the stockfighter API doc [1]? Looks really nice.

[1] https://starfighter.readme.io/v1.0/docs


It looks like the chose the three column option from https://readme.io/ .



We couldn't start the level: This level has been locked by a different server. (Sorry if this happened accidentally as a result of a server crash -- it will clear up in an hour.) Refresh page to check if server is down?


So much reading about what it does. Don't assume people already know how to play a programming game. Most of us don't. Is there an intro video that at least we can watch while you address the server issues?


https://starfighter.readme.io/ top navigation "Stockfighter" link links to staging.


Well done on getting it out there. Also well done on ensuring you get no time off during the holiday season :D

Will be jamming it in the morning. Hope you guys get some sleep at least.


Wow, I never think I've had a confirmation email arrive so fast.

FWIW, I'm also hitting the rate limit exceeded error when trying to access the "first_steps" level.


same and same


can't login with my migrated account. Here's what i get on pw recovery:

"The change you wanted was rejected.

Maybe you tried to change something you didn't have access to."


Try clearing your cookies. They flipped on the secure flag, and it borked a few things.


Woohoo, Congrats on going live. I am excited to play around!


Are you aware that the API docs have the subdomain "starfighter" on readme.io rather than "stockfighter"? Seems... strange.


That's hilarious. Now they have "stockfighter" too ;)

https://stockfighter.readme.io/docs/getting-started


> Our servers are under continuing heavy load due to having launched recently and the site being on the front page of HN. This is causing poor performance and undefined behavior.

How is that possible? As I understand, undefined behaviour is either there when you compile, or it isn't (it is a function of the source code). So if the undefined behaviour is a function of traffic intensity, I assume you compile different C code base when you have high load. I never heard about something like this before.


It could be around the high load causing errors that are causing unexpected flow on effects. For example, an earlier thread talked about how you're locked to a specific server in the pool and if your server goes down, you can't continue because no other server in the pool will talk to you until the lock expires. Having said that I'm getting around 4 different errors randomly so it could be a number of things playing out in undefined ways.


I think he means unexpected behaviour, due to bugs, probably race conditions and not handling errors well enough.


Awesome, I've been looking forward to this! Signup went smoothly for me and I'll jump into the first level as soon as I can.


Cool! Sounds like there's some teething problems right now, so I'll put this on the check it out later list.


I'm getting "Level Instance is not running" on the first level.


Getting a lot of 500 NGINX errors trying to confirm email or login.


Congratulations!! Really happy for you Patrick


Level 1 complete!


Nice try, HFT company that wants free algorithms.


Except the trading interface is an API. You can keep all your precious algorithms on your own computer.


I am not a devops but seriously curious, wouldn't this scalability problem they are experiencing been solved already by using Amazon EC2 and stuff?



not sure if you are joking, since they are using ec2 afaik


okay. so isn't the purpose of that is to auto scale based on system load? Am I missing something?


Yes and no. According to Amazon, "EC2 is a web service that provides resizable compute capacity in the cloud." Really, an EC2 instance is just a virtual computer. You can, with the click of a mouse, create a new EC2 instance and load a pre-defined system image onto it. This allows you to, rather quickly, manually increase or decrease the number of EC2 instances based on load. Automatic scaling of EC2 instances can also be done, although it would be provided at a higher layer.

However, a more critical factor is whether the architects designed their systems and software to scale effectively when the number of EC2 instances is adjusted. If there is a bottleneck that is not properly addressed by adding EC2 instances, then adding EC2 instances will be pointless.


Auto scaling has a cost effect too which is not small if you're doing it from your own pocket.


Suggestion: deploy your GM server as Docker containers on kubernetes and auto-scale while you enjoy some fine wine in a casual atmosphere. :).

Seriously, good luck. Success hurts sometimes.


Is this meant for people interested in security? I ask because the main site says you guys are making CTFs to replace interviews, but I'm not sure if that's interviews for security roles or all developers.

I'd be interested in either, but for the moment rather not add yet another thing for me to learn




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: