I really don't see how a fixed schema is seen as such a bad thing by many NoSQL advocates. In most databases, altering a schema is an operation that's over quickly and in many databases it's easily reversible by just rolling back a transaction.
The advantages of a fixed schema are similar to the advantages of a static type system in programming: You get a lot of error checking for free, so you can be sure that whenever you read the data, it'll be in a valid and immediately useful format.
Just because your database is bad at altering the schema (for example by taking an exclusive lock on a table) or validating input data (like turning invalid dates into '0000-00-00' and then later returning that) doesn't mean that this is something you need to abolish the schema to solve.
> In most databases, altering a schema is an operation that's over quickly
I think many people were burned by MySQL InnoDB, which modifies (or modified; I'm still on quite an old version) schemas by making a new table and copying the data, and don't realise that all databases don't do that.
InnoDB tables have online, non-copying (for most operations) DDL in 5.6. Also, TokuDB has had online, non-copying DDL (column add and drop, add index) since TokuDB 5.0.
Data migrations are rarely just about adding or removing a few columns. It also involves data manipulation. Schema doesn't help you with this. So at some point you have to write data migration code. In which case (and if you are using an ORM) why not do the schema change at the same time.
And sorry but it is complete nonsense to suggest that data migrations are almost impossible on schema less systems. I've done countless myself and people are doing it every day.
Relational models are well-described and can be changed transactionally, both in structure and content, using a well-known and ubiquitous language to describe the migration. It doesn't get any easier than that.
> Data migrations are rarely just about adding or removing a few columns.
This has not been my experience. Sure, you occasionally need code to do a migration. But at least 90% of the time, you are adding to or refactoring your existing data model. In which case, pure SQL is entirely sufficient, unless you've been doing untoward things with your schema.
Well, I think the static typing analogy pretty much cuts to the heart of it. I wouldn't be shocked if static typing advocates tend to like relational databases, and dynamic typing advocates tend to favor nosql. (I can't prove this -- but I wouldn't be surprised).
I don't think so. I'm a fan of Python and Postgres. Rationale: there's nothing special about type errors in the code. They're just ordinary errors; you fix the code and move on. But if bad data gets into your database, it's too late to just fix the code. By the time you discover the problem, you're already set up for a very bad day, or long sequence thereof. The database has to be treated as sacred.
Nope. MongoDB is very popular in enterprises which are largely dominated by Java i.e. strongly typed. MongoDB actually suits strongly typed languages since you enforce your domain in code rather than in the database.
And since Morphia (https://github.com/mongodb/morphia) is similar to JPA it is trivial to port your existing code to MongoDB. Which then leaves you with the same experience as an SQL database except with a lot of benefits (single binary, easy to query/administer/scale, geo spatial features etc).
I'd not count Java as a sign that people at a place like strongly typed systems: Java is a default. The typing system has so few features, it pales in comparison with the alternatives. Those that are in a JVM and really care about a type system might be using Scala instead.
I agree, and I'm pretty sure there is a correlation between those two axes, although there are a lot of subtleties that can make someone prefer dynamic+SQL or static+NoSQL. Speaking for myself, I'm completely in the static+SQL camp.
I suspect that it would be somewhat harder to find static+NoSQL combinations than dynamic+SQL, since it's easier to use a dynamic language over an SQL database (types in SQL don't match 1:1 those in languages anyway, so queries are traversed by column index in both static and dynamic languages; this means that SQL's "static typing" won't affect your code) than a static language over a NoSQL database (suddenly you start to lose guarantees about what goes in every field, and checks must be performed).
I agree that the philosophies seem to match up, but I believe most practical people try to look for machine guarantees from somewhere to contain the complexity.
Someone who is comfortable with strong guarantees in the database may be more willing to use a dynamically-typed language for the application, and (less commonly) vice-versa.
I had the same thought about NoSQl / SQL being like dynamic / static languages. But I prefer dynamic langauges. I think I like the strictness that a relational database gives me, so that I can be worry less about the app code level (which will be in a dynamic language).
A good relational database that implements the normal checks and constraints can serve as an additional check on your dynamically typed code trying to store gubble in the database.
I don't agree with you and I'd like some clarification from anyone who knows for sure.
If the database is small, then sure, schema changes are harmless. But if a database is in production with data shuttling in and out of it and has many records, then ALTERs are non-trivial.
And I don't think an ALTER can survive in a transaction, i.e. it is not something that can be rolled back. It can be undone via another ALTER that is it's opposite, but it cannot be rolled back the way INSERT and UPDATE can.
> And I don't think an ALTER can survive in a transaction, i.e. it is not something that can be rolled back. It can be undone via another ALTER that is it's opposite, but it cannot be rolled back the way INSERT and UPDATE can.
It can in databases that use transactional DDL (and, really, doing so is part of the relational model, since part of that model is that metadata describing the structure of data is itself stored in relations and -- whatever the syntactic sugar layered over top of it -- can itself be manipulated with the same tools that regular data is manipulated with.)
Having the DDL be described relationally is only incidental. I would imagine rollback support to be highly dependent on how the storage engine works. It seems non-trivial to me, short of copying the entire table across the ALTER. And that doesn't sound scalable.
I agree. I wonder how many NoSQL advocates should really be called NoMySQL advocates. I have always been a NoMySQL advocate, but not necessarily a NoSQL advocate.
Without a fixed schema, the schema must be stored with each item. Every "row" not only has the data but the metadata as well, including field names.
If you have a lot of repeatable data this can consume very significant amounts of space (regularly on the order of 4x comparing PG rows to BSON docs). This isn't just a disk space issue, this means less caches and more pages to scan when doing any operation. Compression only improves the disk space issue while adding CPU overhead.
Relational DB's like PostgreSQL having a fixed schema need only store it one time, each row has minimal overhead (about 23 bytes in PG) to store only the unique data.
Perhaps Mongo could optimize behind the scenes, detecting a repeating schema and making a fixed schema data structure for it similar to how the V8 javascript engine creates C++ classes for javascripts objects so its not a key-value lookup everywhere limiting performance. I haven't seen any NoSql db attempt that, I don't think any even attempt simple string interning as suggest in Mongos 863 issue.
PG gives me a choice I can go nearly schemaless with the JSON type or I can create a fixed schema when that makes sense, with many optimized data types to choose from. PG having array support give you a nice middle ground between the two as well.
It's because a fixed schema is unnecessary in most cases. Most developers using a database are doing so via an ORM. In which case they already have a data model in their code. It often fixed and depending on the language strongly typed. It is also centralised, richer, commented, has far better validation and in source control.
So again why do I need a second data model that does nothing but get in my way ?
because querying and aggregating your data with arbitrary conditions becomes much easier. Using SQL you describe what result you want and let the database figure out how to retrieve it.
In many document stores, you have to write manual map/reduce jobs telling the system how to get the data you want.
Most software development projects tend to be falling into two camps these days. Either they are simple and just need basic filtering/grouping in which case you don't need map/reduce. Or they are doing more advanced querying of data in which case Hadoop/Mahout is a better option.
See my other response about the advantages of schema-less design. I agree that fixed schemas can often be useful - the point is to think about this in advance for every project before deciding on SQL vs. NoSQL. I'd say a fixed schema helps during initial development, but will lead to more and more expensive data migrations later on in a system's life cycle, except in case of systems that are very well defined before anyone starts writing one line of code - and sometimes it's plain not possible if you want to have a flexible data model that's not fixed at programming time. It's simply - as always - a question of choosing the right tool for the right job.
> I'd say a fixed schema helps during initial development, but will lead to more and more expensive data migrations later on in a system's life cycle
Yes. The migrations might become expensive (though that varies between databases - some take exclusive locks for a long time, others don't), but you can save a lot of time and money when querying the data.
In a reasonably normalized schema, there's next to nothing that you can't extract with the help of SQL and whatever you throw at it, databases have become very good at planning the optimal way to query for the data you've requested.
If all your data is in huge blobs, querying for arbitrary conditions and aggregating parts of those blobs is complicated and requires manual planning of every query you may come up with.
If your data is well normalized, whatever SQL you throw at it, the database will know how to answer. No specific retrieval code needed.
As such I would argue that a strict schema and a relational database will save you a lot of time as the requirements of your application change.
1.) Yes, ad-hoc querying is usually easier to do and more powerful on relational databases (although the trend in NoSQL databases seems to go in that direction as well, ArangoDB being one example).
2.) The document based approach doesn't say anything about the degree of data normalization - you can have your data as normalized or denormalized as you want, just as in relational databases. Each row can correspond to one document, the containing table usually corresponds to a 'type' property on the document. Following this approach, building views / queries is about as powerful as it can be in a RDBMS, the trade-off is rather between strictness in forbidding/allowing slow algorithms vs. ease of implementing a first working version.
> The advantages of a fixed schema are similar to the advantages of a static type system in programming
Yes, but I already have one of those in the form of a static type system. I dislike database schema for the same reason I dislike XML schema: it tends to enforce only the most trivial constraints, and means you duplicate part of your validation (because you still want validation in code for the things you can't check in the schema). It's just another moving part that can be a potential source of failures.
Are you sure about that? You can validate an awful lot in a PostgreSQL constraint, and I often don't bother validating in the app as a result. Validation in the app can also be extremely difficult to get right in a concurrent environment. There is a high likelihood that many such checks have race conditions that you aren't aware of, if you haven't gone over your application with a dedicated DBA.
You're right that a lot is possible, particularly if you're willing to write a trigger at which point you have a programming language at your disposal. But it's not a great environment for writing code with logic, in terms of things like tool support, debug support, library ecosystem.
I do not disagree with your primary PoV that schemas can be a good thing, but:
a. Schema changes are considered DDL statements and hence can't be simply rolled back via transaction in most databases (as far as I know).
b. If you have millions or more rows and you need to introduce a schema change, it can get painful to do so and can result in app downtime. It is a well documented problem with some well documented work arounds.
On the contrary, coming from the world of mature SQL systems (like SQL Server), I was horrified to learn that MySQL supports neither transactional DDL nor online schema updates. These are taken for granted in premium SQL databases, and to a lesser extent in good free ones like PostgreSQL. (Raise your hand if you've seen the polite suggestion from the Django (South?) team to use a "real" database when your migrate command fails...)
Put simply: MySQL was written by and for people whose time isn't valuable. I.e. the assumption is that it's cheaper to throw low-grade people and brainpower at workarounds (OpEx) than to pay to have the problems avoided in the first place (CapEx). There's a reason people shell out the money for big-iron databases: the math can be easy when you factor in the costs saved (both the direct costs and the opportunity costs).
Luckily, just as not all NoSQL databases are the same, nor are all SQL databases. If NoSQL is an option, standards are obviously less relevant, so you can rely on the behaviour of one SQL database engine of your choice. Postgres is one of a few SQL database engines which implement transactional schema changes.
Companies such as Percona also provide open-source tools which allow for online, non-locking schema changes for Postgres, and I assume commercial database engines have options as well.
The advantages of a fixed schema are similar to the advantages of a static type system in programming: You get a lot of error checking for free, so you can be sure that whenever you read the data, it'll be in a valid and immediately useful format.
Just because your database is bad at altering the schema (for example by taking an exclusive lock on a table) or validating input data (like turning invalid dates into '0000-00-00' and then later returning that) doesn't mean that this is something you need to abolish the schema to solve.
Just pick a better database.