Hacker News new | past | comments | ask | show | jobs | submit login
Lichess: Post-Mortem of Our Longest Downtime (lichess.org)
176 points by jpablo 3 months ago | hide | past | favorite | 53 comments



The main lichess engine (lila, open source) is a single monolith program that's deployed on a single server. It serves ~5 million games per day. But there are a several other pieces too. They discuss the architecture here https://www.youtube.com/watch?v=crKNBSpO2_I

BTW consider donating if you use lichess.


Wow. ~US$40k/mo running costs, with about US$5k/mo for server hosting:

https://lichess.org/costs

It looks like the servers are individually managed via OVH or similar, rather than running their own gear in co-location or similar. Wonder why?


Easy: If something is wrong with the physical gear it's OVH's problem rather than theirs. It also means no one has to ever go to the data center which is probably important for a geographically distributed team (I assume they are). Cheap, no-frills cloud is extremely underrated, IMO.


Underrated? The flip side is that hardware failures are still your problem like they would be with rolling your own hosting. I think they’re correctly rated for the position on the scale of traders that they provide.


Surprising numbers, and really goes to show how cheap the hardware/software side is for this sort of thing if you do it right.

I wonder what the "Misc dev salaries" is for - only curious because it's a flat $5k


Heh heh heh.

To me those numbers seem on the high side as I'm (personally) used to (for cheap projects) scavenging together stuff from Ebay before deploying to a data centre. ;)


lichess is hardly a "cheap project" though :P It's one of the most popular chess platforms


Sure, but they seem to be extremely budget constrained. ;)


no surprise there tbh

Here is a comparison of free and their premium accounts:

https://lichess.org/features


Looks like they're fulfilling their mission?


its also crazy how much cheaper it is than AWS. the primary DB is around $500/month with 32 CPU and 256 GB of RAM and 7TB. AWS RDS db.m6gd.8xlarge which is 32 CPU and 128 GB of RAM costs $2150/month before paying for storage as well.


Yeah, but you get what you pay for. That m6gd.8xlarge would never be subject to such a long network outage as once the hardware fault was detected, it would be moved to another machine


Yup, and you also get to make AWS deal with OS upgrades, DB upgrades, backups, etc.


You have to pay 2x for multi-AZ or you get downtime for upgrades. And DB major version upgrades require manual effort unless you want to roll the dice on their new blue-green feature, which can take hours to fail or finish cutting over.


> You have to pay 2x for multi-AZ or you get downtime for upgrades.

Worse. In Single AZ deployments you get (short, but not that short or strongly bound) downtime for daily backups and when doing snapshots. Source:

- https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...: "During the automatic backup window, storage I/O might be suspended briefly while the backup process initializes (typically under a few seconds). [...] For MariaDB, MySQL, Oracle, and PostgreSQL, I/O activity isn't suspended on your primary during backup for Multi-AZ deployments because the backup is taken from the standby. ",

- https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...: "Amazon RDS creates a storage volume snapshot of your DB instance, backing up the entire DB instance and not just individual databases. Creating this DB snapshot on a Single-AZ DB instance results in a brief I/O suspension that can last from a few seconds to a few minutes, depending on the size and class of your DB instance.".

Not to mention that multi-AZ deployments incur extra transfer cost between zones - not between DB instances (this one is free, last time I checked), but between your compute deployments and DB instances, if your compute does not automatically follow the zone of the db host it talks to.


Given that the full AWS setup that would replace that one server would cost closer to $6000-8000 / month, they could just use that money to buy a bunch of extra hard drives, a backup server, and hire a junior dev/sysadmin whose only job is to watch over it, still coming out ahead of AWS.


I’m seeing more like $5k/mo with reservations but even at those figures … how many skilled DBAs are you getting for $72-96k? Don’t forget that rolling it yourself means you have to build and test all of the hardware, maintenance processes, backups, multi-data center HA, etc. yourself. That’s not junior trainee level work and some of it is ongoing (e.g. every OS and hardware change) or at intervals not of your choosing (say you discover a kernel or firmware issue when it’s crashing randomly - how many months of savings will be canceled out by pulling senior people off of whatever they’re working on?).

You can beat AWS on pricing but not like this. You need to be finding areas where you have a lot of baseline demand – enough to amortize the cost of all of the lower level work – and can cut some of the things they do which you don’t need. For example, if you can afford more downtime in a disaster scenario or can rely on an external rebuild process if the database backups turn out to be unusable.



I'm a patron!

I really appreciate the benefits package for patrons. Thibault is zee best.


There's also a nice architectural diagram on their GitHub page: https://github.com/lichess-org/lila


I guess some of my questions are addressed in the latter half of the post, but I'm still puzzled why a prominent service didn't have a plan for what looked like a run of the mill hardware outage. It's hard to know exactly what happened as I'm having trouble parsing some of the post (what is a 'network connector'? is it a cable? nic?). What were some of the 'increasingly outlandish' workarounds? Are they actually standing up production hosts manually, and was that the cause of a delay or unwillingness to get new hardware goin? I think it would be important to have all of that set down either in documentation or code seeing as most of their technical staff are either volunteers, who may come and go, or part timers. Maybe they did, it's not clear.

It's also weird seeing that they are still waiting on their provider to tell them exactly what was done to the hardware to get it going again, that's usually one of the first things a tech mentions: "ok, we replaced the optics in port 1" or "I replaced that cable after seeing increased error rates", something like that.


You are not wrong that this is puzzling, especially when viewed through the perspective lens of a professional with background in these areas (10 years).

There are many red flags which beg questions.

That said, I stopped taking them at their word years ago, this isn't the first time they've had dubious announcements following entirely preventable failures. In my mind, they really don't have any professional credibility.

People in the business of System Administration would follow basic standard practices that eliminate most of these risks.

The linked post isn't a valid post-mortem, if it were it would contain unambiguous details of the timetables and specifics, both of the failure domains and resolutions.

As you say, a network connector could mean any number of things. Its ambiguous, and ambiguity in technical material is used to hide or mislead most times which is why professionals detailing a post mortem would remove any possible ambiguity they could.

It is common professional practice to have a recovery playbook, and a plan for disaster recovery for business continuity which is tested at least every 6 months, usually quarterly. This is true of both charities and business.

Based on their post, they don't have one and they don't follow this well known industry practice. You really cannot call yourself a System Administrator if you don't follow the basics of the profession.

TPOSNA covers these basics for those not in the profession, its roughly two decades old now, it is well established, and ignorance of the practices isn't a valid excuse.

Professional budgets also always have a fund for emergencies based on these BC/DR plans. Additionally, using resilient design is common practice; single points of failures are not excusable in production failure domains especially when zero-downtime must be achieved.

Automated Deployment is a standard practice as well factoring into RTO and capacity planning improvements. Cattle not Pets.

Also, you don't ever wait on a vendor to take action. You make changes, and revert when the issue gets resolved.

First thing I would have done is set the domain DNS TTL to 5 minutes upon alerted failures (as a precaution), and then if needed point the DNS to a viable alternative server (either deployed temporarily or running in parallel).

Failures inevitably happen which is why you risk manage this using a topology with load balancers/servers set up in HA groups, eliminating any single provider as a single point of failure.

This is so basic that any junior admin knows these things.

Outlandish workarounds only happen when you do not have a plan and you are dredging the bottom of the barrel.


I've worked with Thibault before he could self-sustain on lichess donations, he's a professional software developer and sysadmin and one of the best I've worked with.

The people behind lichess are very much professionals, have worked in companies before, and know about everything you're writing. However instead of building a business they decided to run a completely free and ad-free non profit living off donations.

You don't get the same budget doing that compared than a subscription base / ad supported service. That's true for the number of people maintaining it as well as the cloud cost you can afford.

If you look at their track record, uptime have been pretty good. Shit happens, but if you ask me it's worth it to have a service like Lichess that can exist completely on donations.


There are many problems with what you've written here as well as bot-like behavior in the responses that have telltale signs of vote manipulation and propaganda similar to Chinese state-run campaigns.

We will have to disagree. You have clearly contradicted yourself in at least one way, and attempt to mislead readers in a number of other ways which I won't go into here.

From these, I have to come to the conclusion that you don't have credibility.

The downtime would not have happened if they had followed professional practices. Even a qualified Administrator coming into the outage fresh would have had a fix within 30 minutes if they were working at a professional level.

Yes shit happens, but professionals have processes in place so that common shit does not just happen. This was preventable.


What kind of Tom Clancy novel do you live in that intelligence agencies are astroturfing for free chess sites?


I'm going to assume that your question is genuine and sincere, and not meant sarcastically.

If you read the following books by established experts, you should be able to rationally answer the question for yourself as to the why and the how. The subject matter involves torture for thought reform, real not fantasy. This differs from SERE training which is geared towards resisting information extraction.

China by their own words (internal leaked documents), seeks the destruction of the national will, of their enemies. This involves an identity based approach to torture/thought reform, which falls under the military strategy, Divide and Conquer. Digital attacks are cost effective when weighed against other options.

Anything you believe, love, or common experiences that you share with other people is fair game for inducement and then destructive interference to promote nihilism, while segmenting individuals into two groups, disassociative responses (apathetic/non-response), and psychotic break responses.

The items targeted include chess, along with many other things. Inducement of struggle sessions to break people.

If you spend the time to review the material I mentioned, you'll likely find out that a core belief of yours is untrue, that belief being that something like this is fantasy and impossible. This has a way of breaking the weak-willed, often in a psychological reversal/delusion.

I hope you are a strong person, we need more rational people if we are to survive as a species.

Robert Lifton (Thought Reform and The Psychology of Totalism) Joost Meerloo (Rape of the Mind) Robert Cialdini (Influence) USMC Press (Political Warfare)(Free ebook at their website)


You need to seek professional help.


No I don't, but you certainly do after trying to gaslight like that.

Can't tell if its pathological or malevolent... probably both. Pray that we never meet.

Thankfully, it is not such a simple thing to discredit when world renowned experts all agree and say a thing, and the longer they have been established the more one should listen.

Saying I need to seek professional help for repeating what's been documented by experts, yeah that is rich.


Log off.


This is so far out of line I wonder what the background is for this issue. Lichess is not emergency dispatch software running in a 911 Call Center, if they have an outage the cost is that users can't play online chess until it is fixed. Additionally, the founder of this open source project is objectively good at what he does. Exhibit the fact that he built and hosts a top 2 online chess platform that competes well against the biggest commercial sites. How does that not lend some professional credibility.


We will have to disagree Kenneth.

Your idea of "so far out of line", would include any communication you disagree with, and is absent rational principles or social norms/mores basis, it is absurd.

I stuck to the objective issues in my previous post, you should too before making baseless claims.

Do some due dilligence on the business entities involved, peruse their github history (the deleted parts). Get a real picture about what's going on there. You'll find many contradictions if you dig.

The question on any critical IT professional's minds is how can you run the service given the resources claimed. Yes he runs the top traffic site for chess, and its done on a bespoke monolith.

You napkin math/sketch it out by required component services, and it quickly becomes clear that nothing adds up. When nothing is consistent, or supported, you examine your premises for contradictions and lies, which goes again back to credibility.

(Hint: https://trufflesecurity.com/blog/anyone-can-access-deleted-a...)


How do you have my first name?


Why put so much effort when at worst you have a few hours of downtime


As they say, each 9 of uptime increases costs by an order of magnitude. For a non profit service, a few hours of downtime seems a fine trade off vs engineering all of the “right” redundancies. All of which have their own operational costs.


This isn't a billion dollar company trading on the NYSE. Its a free website to play chess.


This response and post-mortem is superior to most commercial services I have seen in recent years.


That's basically every aspect of their service. The founder Thibault Duplessis is criminally undercompensated (his choice) for running a site that is better designed, faster, and more popular than 99% of commercial websites out there.


I worked with him once on a job -- incredibly nice guy and obviously talented developer who used to work for the French agency responsible for the Scala Play Framework. https://github.com/lichess-org/lila and https://github.com/lichess-org/scalachess are great resources for anyone ever curious to see a production quality Scala3 web application using Cats and all the properly functional properties of the language.


Would you recommend it as a deep-dive to observe Scala in production?


I haven't looked at the code in ages, but it's probably the only scaled consumer web application written in Scala and moreover running on Scala 3 that you can see the end-to-end source for. You have all the Twitter open source Scala projects, of course, but that's just infrastructure for running a web application, rather than an actual production quality app -- and my sense is that in 2024 there aren't many product teams outside of Twitter using their application tooling (as opposed to some of their data infrastructure, certainly the area where Scala sees the most use today with Spark etc).

TLDR if you want to see production-quality Scala code that this very second is serving 40k chess games -- and mostly bullet/blitz where ms latency is of course crucial -- definitely take a look.

Not as much hype for the language at the moment over Rust or Kotlin, say, but it remains my language of choice for web backends by far.


Exact same thought went through my head. Also note in the first few paragraphs they acknowledge the worst impacts to users. That's very selfless - often corporate postmortems downplay the impact, which frustrates users more. Incidentally, a critical service I use (Postmark) had an outage this week and I didn't even hear from them (I found out via a random twitter post). Shows the difference.


Presumably because Lichess is free thus doesn't have contractual obligations and SLAs that they'll be sued for breaching.


> so you, as our beneficiaries and stakeholders, who support us and encourage us — deserve to get clarification on what happened

Is it that complicated for big tech to reply politely with the above statement when they suddenly disable your account for no obvious reason!


It may not be complicated, but it does require caring about what you do and your customers as opposed to going through basic minimum requirements to appear that you are doing something.

It is much more difficult for corporate cogs to have that level of care compared to someone who does their things with passion.


The post-mortem is honest, but the infrastructure is well below what I'd expect from commercial services.

If a commercial provider told me they're dependent on a single physical server, with no real path or plans to fail over to another server if they need to, I would consider it extremely negligent.

It's fine to not use big cloud providers, but frankly it's pretty incompetent to not have the ability to quickly deploy to a new server.


Poe's law. Lichess is 14 years old and their longest outage is less than 12 hours. Google and AWS have both had ~6 hour outages and that's with billions od dollars depending on them and thousands of engineers. Simpler is working just fine.


We're an understaffed charity.


As a general thought, any idea if people have looked at something like (for example) using Proxmox on the physical hardware so the services can be put on VMs which can be migrated between hosts if there are problems?


Yeah I'm not criticizing it as a charity, just pointing out this definitely isn't "superior to most commercial services."

That being said, removing dependence on single hardware nodes isn't something you need a big team for. I've done failover at 1-person startups.


And yet even Meta recently had a multiple hours downtime, despite a budget thousands if not million times higher. Would you call them negligent too?

By increasing the complexity you multiply the failure points and increase ongoing maintenance, which is the bottleneck (even more than money) for volunteer-driven projects.


To be clear, you don't need to make it more complex / failure-prone. I didn't say failover needs to be automated.

Kubernetes or complex cloud services are not required to have some basic deployment automation.

You can do it with a simple bash script if you need to. It's just pretty surprising to see the reaction to a hardware failure being to wait around for it to be repaired instead of simply spinning up a new host.


Once the private link was reestablished, could they not have tunneled out to the internet via another server acting as a sort of gateway?

Disclaimer: I'm not a network engineer so I may be misunderstanding the practicality and complexity of such a workaround.


summary for the lazy: OVH




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: