Hacker News new | past | comments | ask | show | jobs | submit login

This response and post-mortem is superior to most commercial services I have seen in recent years.



That's basically every aspect of their service. The founder Thibault Duplessis is criminally undercompensated (his choice) for running a site that is better designed, faster, and more popular than 99% of commercial websites out there.


I worked with him once on a job -- incredibly nice guy and obviously talented developer who used to work for the French agency responsible for the Scala Play Framework. https://github.com/lichess-org/lila and https://github.com/lichess-org/scalachess are great resources for anyone ever curious to see a production quality Scala3 web application using Cats and all the properly functional properties of the language.


Would you recommend it as a deep-dive to observe Scala in production?


I haven't looked at the code in ages, but it's probably the only scaled consumer web application written in Scala and moreover running on Scala 3 that you can see the end-to-end source for. You have all the Twitter open source Scala projects, of course, but that's just infrastructure for running a web application, rather than an actual production quality app -- and my sense is that in 2024 there aren't many product teams outside of Twitter using their application tooling (as opposed to some of their data infrastructure, certainly the area where Scala sees the most use today with Spark etc).

TLDR if you want to see production-quality Scala code that this very second is serving 40k chess games -- and mostly bullet/blitz where ms latency is of course crucial -- definitely take a look.

Not as much hype for the language at the moment over Rust or Kotlin, say, but it remains my language of choice for web backends by far.


Exact same thought went through my head. Also note in the first few paragraphs they acknowledge the worst impacts to users. That's very selfless - often corporate postmortems downplay the impact, which frustrates users more. Incidentally, a critical service I use (Postmark) had an outage this week and I didn't even hear from them (I found out via a random twitter post). Shows the difference.


Presumably because Lichess is free thus doesn't have contractual obligations and SLAs that they'll be sued for breaching.


> so you, as our beneficiaries and stakeholders, who support us and encourage us — deserve to get clarification on what happened

Is it that complicated for big tech to reply politely with the above statement when they suddenly disable your account for no obvious reason!


It may not be complicated, but it does require caring about what you do and your customers as opposed to going through basic minimum requirements to appear that you are doing something.

It is much more difficult for corporate cogs to have that level of care compared to someone who does their things with passion.


The post-mortem is honest, but the infrastructure is well below what I'd expect from commercial services.

If a commercial provider told me they're dependent on a single physical server, with no real path or plans to fail over to another server if they need to, I would consider it extremely negligent.

It's fine to not use big cloud providers, but frankly it's pretty incompetent to not have the ability to quickly deploy to a new server.


Poe's law. Lichess is 14 years old and their longest outage is less than 12 hours. Google and AWS have both had ~6 hour outages and that's with billions od dollars depending on them and thousands of engineers. Simpler is working just fine.


We're an understaffed charity.


As a general thought, any idea if people have looked at something like (for example) using Proxmox on the physical hardware so the services can be put on VMs which can be migrated between hosts if there are problems?


Yeah I'm not criticizing it as a charity, just pointing out this definitely isn't "superior to most commercial services."

That being said, removing dependence on single hardware nodes isn't something you need a big team for. I've done failover at 1-person startups.


And yet even Meta recently had a multiple hours downtime, despite a budget thousands if not million times higher. Would you call them negligent too?

By increasing the complexity you multiply the failure points and increase ongoing maintenance, which is the bottleneck (even more than money) for volunteer-driven projects.


To be clear, you don't need to make it more complex / failure-prone. I didn't say failover needs to be automated.

Kubernetes or complex cloud services are not required to have some basic deployment automation.

You can do it with a simple bash script if you need to. It's just pretty surprising to see the reaction to a hardware failure being to wait around for it to be repaired instead of simply spinning up a new host.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: