I'd also highly recommend the redbook (readings in databases) which is by several luminaries in the database world including Peter Bailis, Joe Hellerstein, and Michael Stonebraker. They also updated it in recent years to ensure it was up to date for the latest research on databases as well - http://www.redbook.io/
IBM has a division which produces technical books based on internal staff, business partner and client contributions. They've been doing this for quite a long time (I participated in one almost 20 years ago). Historically they had red covers and so to this day they are called the Redbooks. http://www.redbooks.ibm.com
In civil engineering, all construction contracts are usually based on FIDIC's Red Book. This way, everybody is used to a standard contract, from lawyers to judges, in whatever jurisdiction, nationally or internationally...
My Ph.D. advisor gave me his copy when I was just getting started with my thesis, almost twenty years ago. I remember spending 6-7 very intense weeks with it. It gave me most of the mathematical/logical background specific to databases I needed to finish what was primarily a DB focused dissertation. I kept coming back to it all through my studies and later.
If you ever need to design a logically sound query processor/optimizer and/or data storage layer, this is a book you should study.
It still sits on my bookshelf, waiting for the student whose interest in databases is deeper than dujourDB, so that I can pass it down to them.
Mostly bioinformatics these days. I’m in academia, and have been virtually all my career, so I can’t really speak to whether a Ph.D. is worth it in industry.
Of all the database theory books I have seen, I found the C.J Date book(An Introduction to Database Systems) to be the most elegant one, though the edition I used attempted to use tutorialD than SQL. What I liked is that it brought out the beauty of relational algebra to the reader with great clarity.
I read it when I started looking at RDBMS after having worked in NoSQL and I clearly recall how excited I was to learn that there is a full fledged system that I could utilize which would get rid of 90% of my code.
I know this term is usually not used when it comes to technical books, but I found it to be quite a page turner!!
I used to think this also. However after diving really deep into OLAP for business reporting over the last 6 months, I find NULL to be the single most powerful idea in SQL. Here’s some use cases for NULL that I find super useful:
COUNT(case when <complex condition> then 1 end) as foo
Count occurrences of complex condition over a row set. Works because COUNT ignores null.
NVL(1/NULLIF(foo, 0), 0)
Protect against division by zero. The expression 1/X is a simple example but this generalizes well. In fact, I’ve come to think that NULL turns any SQL value into a Monadic option type like those used functional languages and Rust.
Note that I’m not saying the above cannot be done without NULL, just that NULL is a really useful tool in my experience.
Done a fair bit of BI and OLAP this last decade, and I still feel NULL cause more issues than it solves.
For the counting case, sum(case when ... then 1 else 0 end) generally works as well, and generally att little to no cost.
Having nullability as a concept in the query language isn't very problematic though, I might even consider it be helpful on average bas long as a view or table can never contain nulls
Nullable columns in tables and queries is the problem. It boils down to how each nullable columns (in)effectively describes a schema equivalent to a table for each nullable columb, and one for the non-nullable, where the relationship between these tables are quite under constrained.
Thus to ensure data quality and correctness you will have to query all nullable columns of interest to make sure that the various combinations of nulls in those columns doesn't interfere with what you are trying to report. You will also have to track that no component of the system suddenly starts reporting nulls in columns where it didn't use to, as that is equivalent to a schema change.
This wouldn't be much of an issue if there were never more than one nullable column per table, or att the very least that only the absolute minimum of columns were nullable. This is however only very rarely the case, usually there are far too many nullable columns. To make reliable and stable reporting one then has to solve the under specification of the source system in the analysis and reporting pipeline for a class of issues that for the most part should be the responsibility of the source systems.
Best use of nullls I know of, is using them to lower cost of table scans in Google BigQuery.
Just be careful not to use * , count does not always ignore NULLS, if you specify a column, it ignores NULLS, if you use * then it counts it (this is on SQL Server btw... might be different on other platforms)
It's exactly unlike a monadic option type. It's much more like null is in Java, C# etc. Every type gets a NULL value here, where in ML/Haskell/etc use of Option/Maybe/etc is explicit.
Chapter 3, 'The Structure of the Relational Model', has a helpful background on the origins of table-oriented relational algebra. Unfortunately, it misses the importance of graphs, which afaict, transcend the dependence on tables' bias for normalized topologies of pathologically regular 2D metric spaces, and it even reinforces their relevance by saying, "Intuitively, the data is represented in tables...". Whereas,
> Graphs allow us to build complex, intuitive relationships between data points. Unlike many traditional methods of structuring data, which focus on the format of relationships, graphs focus on the topology of relationships in the dataset.
When it comes to making database concepts more intuitive, I find visually rich representations very helpful, for example using geometric space-encoded logical relationships to aid diagrammatic reasoning.
It goes along well with many other people's efforts to improve mathematical represenations and interfaces through intuitive visualizations, ie:
If you are curious how databases are implemented, here is a simplistic implementation in clear abstractions https://github.com/komu/ahwen
It's a good idea to familiarise with the code in bottom-up order of layers.
Readme says this:
> The implementation takes heavy inspiration from Edward Sciore's SimpleDB, augmented by implementations of various exercises in his textbook Database Design and Implementation.
I see that this is now #3 on the homepage with 112 points but no comments yet. Are people already very familiar with this work? Can some expound on its merits?
I think that a common occurrence on HN, which I'm definitely a perpetrator of, is upvoting things you wish you had a better grasp on. Theory-heavy & math posts seem especially prone to this.
Perhaps people find it interesting but are too shy to leave the first comment. Like being the one to ask a question during a lecture, sometimes it's really hard to put yourself out there.
On the preface of the book (omitted from the ebook), it details several learning paths:
The Classic: a basic database theory course covering the classical material would center around Parts B and C. Chapter 10 and parts of Chapter 9 of Part C are somewhat more advanced and could be skipped. If time allows, some of Chapter 12 and a selection of Part F might be covered.
Feast of Query Languages: a course on the theory of query languages would start with a quick review of the basic material on classical languages (Part B), and continue with Parts D and E. If time allows, some material on languages for complex objects and object-oriented databases (Part F) could be covered.
Gourmet Sampling of Database Theory: a course for people with theoretical appetites that emphasizes the specificity of database theory. Logicians wishing to explore the connection between finite-model theory and databases will be interested in Parts C and E. Those interested in descriptive complexity will find Part E closest to their hearts. Researchers in logic programming will prefer Part D, particularly Chapters 12, 13, and 15. People with a background in theoretical artificial intelligence will find Parts D and F of particular interest. Rule-based systems are related to Chapter 14 (see also parts of Chapter 22). Programming language people will be interested in much of the material on query languages, including Chapters 20 and 21 in Part F.
Fast-Food Database Theory: a course for applied database students that is meant to provide informal exposure to some of the basic results of database theory. This would include assorted topics, with an informal presentation emphasizing the examples, results, and intuition provided in the text, rather than the proofs. A possible syllabus would include Part B; parts of Chapters 8, 9, and 11 in Part C; Chapter 12 and parts of Chapter 15 in Part D; and selected chapters of Part F.
Numerous exercises have been included, and they are essentially of three categories. Routine exercises (unmarked) require easy manipulation of the basic concepts and results. Harder exercises are marked with a (*). Another category of exercises is meant to complement the material in the text and often contains results from related research articles. These exercises are usually on the hard side and some may constitute term projects. These exercises are marked with a ( ).
With databases or anything in general, I believe in doing first and then learning the principles.
Play around with a small RDBMs like sqlite. Do a project around it. Look at the source maybe and then learn about the principles. Foundations of databases are based on discrete mathematics ( where relational comes from in RDBMs ) so a background in discrete math helps.
For enterprise level stuff ( clustering, replication, distribution, etc ), it's more practical to learn on the job where you have access to the hardware, servers, etc.
This is really interesting. Now it's ranked #20 with more than 500 points, and it now has 36 comments. But none is actually about the link. People are giving advice on database books they read, and arguing about which one is the best, but it looks like nobody read this particular one. Yet it's massively upvoted.
I took Serge's class when he was on sabbatical (?) teaching at UC Berkeley back in the days of steam-powered web browsers. Was a great course but it felt like the theory wasn't _quite_ there yet to be the next big thing in databases.
Looks like a great textbook. I'm really looking for a book on relational database implementation itself, but I understand this is a topic so large that a single book probably could never cover it all.
Obviously more than one HN user has heard of C.J. Date's work on databases.
We banned you last year for egregiously violating the site guidelines, and told you why. When banned users create new accounts and show new signs of abusive behavior, we ban the new account too. The reason is obvious, but I'll explain anyway: when a banned user gives us reason to believe they've had a change of heart and will use HN as intended in the future, we're happy to unban them—but when they give us reason to believe the opposite, we do the opposite. Otherwise bannage would be a bit of a joke on a site when people can just create a new account in a few seconds (and many do).
In that case I was referring to https://news.ycombinator.com/item?id=17922994. When a new account posts like that and we have evidence linking it to a previously banned account, we usually take it as a sign that the user does not want to use HN as intended. In such cases we extend the previous ban to the new account, for what I'd hope would be obvious reasons.