I used to love Operating Systems during my undergrads, Modern Operating Systems by Tanenbaum is till date the only academic book I've read entirely. I recently read an article about how Amazon built Aurora by Werner Vogels and I was captivated by it. I want to start reading about Distributed Systems. What would be a good start/Road Map?
I posted this recently, but MIT's 6.824: Distributed Systems (taught by Robert Morris, of both Morris worm and Viaweb/Y Combinator fame) is completely open and available online, and it includes video lectures, notes, readings, and programming assignments from as recent as Spring 2020 (including half of the lectures recorded from home as the pandemic strikes). The assignments even include auto-graded testing scripts, so you can verify your solution to the assignments.
I posted this as well in the last thread about this class, but since we're discussing it again there is an active study group doing the labs in Clojure on reddit:
Several of the labs have been ported in full (map-reduce, first part of RAFT lab), including test scripts. Please join if this is interesting to you.. the more the merrier!
The book “Designing Data-Intensive Applications” by Martin Kleppman is a fantastic read with such a concise train of thought. It builds up from basics, adds another thing, and another thing.
I kept asking myself, what would happen if I were to extend on the feature currently presented in the chapter I was reading, only to find out my answers in the next chapter.
Problem with this book is: what to read after. The book is really good but leaves you with the feeling "there are some many topics I don't know yet... but all the other books out there suck". Any recommendation for someone that has already read DDIA?
Here is their git repository of all online references, by chapter. Seems it doesn't include papers that don't have a clear, public link. If you have the book, though, you can get the names and search for them.
I'm currently reading Distributed Systems by Tanenbaum [1].
It goes into more detail and it's more extensive. It's more outdated but the fundamentals are there.
This was truly one of the greatest, if not the greatest, book on software that I have ever read. I have read it twice at this point and fully intend to read it many more times. It is packed full of incredibly interesting information and written in a way that keeps you interested.
Not only is the main content great, but the references are numerous and open up entirely new sets of material as you progress.
* This System Design Primer [1] on GitHub is a decent overview of how large-scale apps are designed, with jumping-off points into many different subjects.
* The Morning Paper blog's distributed systems tag [2] has a lot of good summaries of research on distributed systems, both from academia and industry.
* I maintain a list of assorted resources on distributed system design and operations on GitHub. [3]
* Also, as mentioned, Designing Data-Intensive Applications is a good starting place.
While I don't see it as a starting point (I think the topics require more context), I'm a big fan of the articles Amazon has published recently as the "Builder's Library"
From a previous question re: "Ask HN: CS papers for software architecture and design?" (https://news.ycombinator.com/item?id=15778396 and distributed systems we eventually realize were needed in the first place:
> A [Byzantine fault] is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine Generals Problem",[2] developed to describe a situation in which, in order to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.
Practically, dask.distributed (joblib -> SLURM,), dask ML, dask-labextension (a JupyterLab extension for dask), and the Rapids.ai tools (e.g. cuDF) scale from one to many nodes.
https://pdos.csail.mit.edu/6.824/