I really don't see how a fixed schema is seen as such a bad thing by many NoSQL ...

rsynnott · on April 17, 2014

> In most databases, altering a schema is an operation that's over quickly

I think many people were burned by MySQL InnoDB, which modifies (or modified; I'm still on quite an old version) schemas by making a new table and copying the data, and don't realise that all databases don't do that.

neverminder · on April 17, 2014

I can confirm that is still the case with latest MySQL version, which is why my company is moving to PostgreSQL.

leif · on April 17, 2014

You're wrong. http://dev.mysql.com/doc/refman/5.6/en/innodb-create-index-o...

InnoDB tables have online, non-copying (for most operations) DDL in 5.6. Also, TokuDB has had online, non-copying DDL (column add and drop, add index) since TokuDB 5.0.

beberlei · on April 17, 2014

the mysql documentation definatly states otherwise: http://dev.mysql.com/doc/refman/5.7/en/innodb-create-index-o...

emeraldd · on April 17, 2014

Was that InnoDB or MyISAM? I didn't realize it used the same technique on both.

lucian1900 · on April 17, 2014

It's even better: with an explicit schema you can trivially do data migrations!

With something without a schema, data migrations are almost impossible to do correctly.

danpalmer · on April 17, 2014

I think trivially is over simplifying it, but certainly migrations are much easier with a schema.

threeseed · on April 17, 2014

Data migrations are rarely just about adding or removing a few columns. It also involves data manipulation. Schema doesn't help you with this. So at some point you have to write data migration code. In which case (and if you are using an ORM) why not do the schema change at the same time.

And sorry but it is complete nonsense to suggest that data migrations are almost impossible on schema less systems. I've done countless myself and people are doing it every day.

angrybits · on April 17, 2014

Relational models are well-described and can be changed transactionally, both in structure and content, using a well-known and ubiquitous language to describe the migration. It doesn't get any easier than that.

mercurial · on April 17, 2014

> Relational models are well-described and can be changed transactionally, *

*: transactionally, for certain flavours of relational databases.

mercurial · on April 17, 2014

> Data migrations are rarely just about adding or removing a few columns.

This has not been my experience. Sure, you occasionally need code to do a migration. But at least 90% of the time, you are adding to or refactoring your existing data model. In which case, pure SQL is entirely sufficient, unless you've been doing untoward things with your schema.

lucian1900 · on April 17, 2014

Having a schema most certainly does help, since you can know with certainty what shape a specific record has before retrieving it.

If you have an ORM for your schema-less database, you get roughly the same thing, but with weaker guarantees.

Correct data migration requires knowledge of the schema of the data.

overgard · on April 17, 2014

Well, I think the static typing analogy pretty much cuts to the heart of it. I wouldn't be shocked if static typing advocates tend to like relational databases, and dynamic typing advocates tend to favor nosql. (I can't prove this -- but I wouldn't be surprised).

rwallace · on April 17, 2014

I don't think so. I'm a fan of Python and Postgres. Rationale: there's nothing special about type errors in the code. They're just ordinary errors; you fix the code and move on. But if bad data gets into your database, it's too late to just fix the code. By the time you discover the problem, you're already set up for a very bad day, or long sequence thereof. The database has to be treated as sacred.

threeseed · on April 17, 2014

Nope. MongoDB is very popular in enterprises which are largely dominated by Java i.e. strongly typed. MongoDB actually suits strongly typed languages since you enforce your domain in code rather than in the database.

And since Morphia (https://github.com/mongodb/morphia) is similar to JPA it is trivial to port your existing code to MongoDB. Which then leaves you with the same experience as an SQL database except with a lot of benefits (single binary, easy to query/administer/scale, geo spatial features etc).

hibikir · on April 17, 2014

I'd not count Java as a sign that people at a place like strongly typed systems: Java is a default. The typing system has so few features, it pales in comparison with the alternatives. Those that are in a JVM and really care about a type system might be using Scala instead.

pilif · on April 17, 2014

One data point: I prefer dynamic typing in programming and I'm totally in love with static schemas.

gordaco · on April 17, 2014

I agree, and I'm pretty sure there is a correlation between those two axes, although there are a lot of subtleties that can make someone prefer dynamic+SQL or static+NoSQL. Speaking for myself, I'm completely in the static+SQL camp.

I suspect that it would be somewhat harder to find static+NoSQL combinations than dynamic+SQL, since it's easier to use a dynamic language over an SQL database (types in SQL don't match 1:1 those in languages anyway, so queries are traversed by column index in both static and dynamic languages; this means that SQL's "static typing" won't affect your code) than a static language over a NoSQL database (suddenly you start to lose guarantees about what goes in every field, and checks must be performed).

jeffdavis · on April 17, 2014

I agree that the philosophies seem to match up, but I believe most practical people try to look for machine guarantees from somewhere to contain the complexity.

Someone who is comfortable with strong guarantees in the database may be more willing to use a dynamically-typed language for the application, and (less commonly) vice-versa.

collyw · on April 18, 2014

I had the same thought about NoSQl / SQL being like dynamic / static languages. But I prefer dynamic langauges. I think I like the strictness that a relational database gives me, so that I can be worry less about the app code level (which will be in a dynamic language).

hga · on April 17, 2014

A good relational database that implements the normal checks and constraints can serve as an additional check on your dynamically typed code trying to store gubble in the database.

ionforce · on April 17, 2014

I don't agree with you and I'd like some clarification from anyone who knows for sure.

If the database is small, then sure, schema changes are harmless. But if a database is in production with data shuttling in and out of it and has many records, then ALTERs are non-trivial.

And I don't think an ALTER can survive in a transaction, i.e. it is not something that can be rolled back. It can be undone via another ALTER that is it's opposite, but it cannot be rolled back the way INSERT and UPDATE can.

dragonwriter · on April 17, 2014

> And I don't think an ALTER can survive in a transaction, i.e. it is not something that can be rolled back. It can be undone via another ALTER that is it's opposite, but it cannot be rolled back the way INSERT and UPDATE can.

It can in databases that use transactional DDL (and, really, doing so is part of the relational model, since part of that model is that metadata describing the structure of data is itself stored in relations and -- whatever the syntactic sugar layered over top of it -- can itself be manipulated with the same tools that regular data is manipulated with.)

ionforce · on April 17, 2014

Having the DDL be described relationally is only incidental. I would imagine rollback support to be highly dependent on how the storage engine works. It seems non-trivial to me, short of copying the entire table across the ALTER. And that doesn't sound scalable.

JonnieCache · on April 17, 2014

postgres can rollback everything except advancing sequences, and dropping the whole DB.

wildmXranat · on April 17, 2014

that is awesome and I admit to not knowing about that. I've been on MySQL and DB2 for some time.

zwily · on April 17, 2014

DDL can be rolled back by Postgres in a transaction.

matwood · on April 17, 2014

I agree. I wonder how many NoSQL advocates should really be called NoMySQL advocates. I have always been a NoMySQL advocate, but not necessarily a NoSQL advocate.

SigmundA · on April 17, 2014

Fixed schemas aren't just about validation but performance and space usage as well.

Consider this long open high priority enhancement to MongoDB: https://jira.mongodb.org/browse/SERVER-863

Without a fixed schema, the schema must be stored with each item. Every "row" not only has the data but the metadata as well, including field names.

If you have a lot of repeatable data this can consume very significant amounts of space (regularly on the order of 4x comparing PG rows to BSON docs). This isn't just a disk space issue, this means less caches and more pages to scan when doing any operation. Compression only improves the disk space issue while adding CPU overhead.

Relational DB's like PostgreSQL having a fixed schema need only store it one time, each row has minimal overhead (about 23 bytes in PG) to store only the unique data.

Perhaps Mongo could optimize behind the scenes, detecting a repeating schema and making a fixed schema data structure for it similar to how the V8 javascript engine creates C++ classes for javascripts objects so its not a key-value lookup everywhere limiting performance. I haven't seen any NoSql db attempt that, I don't think any even attempt simple string interning as suggest in Mongos 863 issue.

PG gives me a choice I can go nearly schemaless with the JSON type or I can create a fixed schema when that makes sense, with many optimized data types to choose from. PG having array support give you a nice middle ground between the two as well.

threeseed · on April 17, 2014

It's because a fixed schema is unnecessary in most cases. Most developers using a database are doing so via an ORM. In which case they already have a data model in their code. It often fixed and depending on the language strongly typed. It is also centralised, richer, commented, has far better validation and in source control.

So again why do I need a second data model that does nothing but get in my way ?

pilif · on April 17, 2014

because querying and aggregating your data with arbitrary conditions becomes much easier. Using SQL you describe what result you want and let the database figure out how to retrieve it.

In many document stores, you have to write manual map/reduce jobs telling the system how to get the data you want.

threeseed · on April 17, 2014

Most software development projects tend to be falling into two camps these days. Either they are simple and just need basic filtering/grouping in which case you don't need map/reduce. Or they are doing more advanced querying of data in which case Hadoop/Mahout is a better option.

m_mueller · on April 17, 2014

See my other response about the advantages of schema-less design. I agree that fixed schemas can often be useful - the point is to think about this in advance for every project before deciding on SQL vs. NoSQL. I'd say a fixed schema helps during initial development, but will lead to more and more expensive data migrations later on in a system's life cycle, except in case of systems that are very well defined before anyone starts writing one line of code - and sometimes it's plain not possible if you want to have a flexible data model that's not fixed at programming time. It's simply - as always - a question of choosing the right tool for the right job.

pilif · on April 17, 2014

> I'd say a fixed schema helps during initial development, but will lead to more and more expensive data migrations later on in a system's life cycle

Yes. The migrations might become expensive (though that varies between databases - some take exclusive locks for a long time, others don't), but you can save a lot of time and money when querying the data.

In a reasonably normalized schema, there's next to nothing that you can't extract with the help of SQL and whatever you throw at it, databases have become very good at planning the optimal way to query for the data you've requested.

If all your data is in huge blobs, querying for arbitrary conditions and aggregating parts of those blobs is complicated and requires manual planning of every query you may come up with.

If your data is well normalized, whatever SQL you throw at it, the database will know how to answer. No specific retrieval code needed.

As such I would argue that a strict schema and a relational database will save you a lot of time as the requirements of your application change.

m_mueller · on April 17, 2014

1.) Yes, ad-hoc querying is usually easier to do and more powerful on relational databases (although the trend in NoSQL databases seems to go in that direction as well, ArangoDB being one example).

2.) The document based approach doesn't say anything about the degree of data normalization - you can have your data as normalized or denormalized as you want, just as in relational databases. Each row can correspond to one document, the containing table usually corresponds to a 'type' property on the document. Following this approach, building views / queries is about as powerful as it can be in a RDBMS, the trade-off is rather between strictness in forbidding/allowing slow algorithms vs. ease of implementing a first working version.

lmm · on April 17, 2014

> The advantages of a fixed schema are similar to the advantages of a static type system in programming

Yes, but I already have one of those in the form of a static type system. I dislike database schema for the same reason I dislike XML schema: it tends to enforce only the most trivial constraints, and means you duplicate part of your validation (because you still want validation in code for the things you can't check in the schema). It's just another moving part that can be a potential source of failures.

Jweb_Guru · on April 18, 2014

Are you sure about that? You can validate an awful lot in a PostgreSQL constraint, and I often don't bother validating in the app as a result. Validation in the app can also be extremely difficult to get right in a concurrent environment. There is a high likelihood that many such checks have race conditions that you aren't aware of, if you haven't gone over your application with a dedicated DBA.

lmm · on April 18, 2014

You're right that a lot is possible, particularly if you're willing to write a trigger at which point you have a programming language at your disposal. But it's not a great environment for writing code with logic, in terms of things like tool support, debug support, library ecosystem.

gnufied · on April 17, 2014

I do not disagree with your primary PoV that schemas can be a good thing, but:

a. Schema changes are considered DDL statements and hence can't be simply rolled back via transaction in most databases (as far as I know).

b. If you have millions or more rows and you need to introduce a schema change, it can get painful to do so and can result in app downtime. It is a well documented problem with some well documented work arounds.

davidp · on April 17, 2014

On the contrary, coming from the world of mature SQL systems (like SQL Server), I was horrified to learn that MySQL supports neither transactional DDL nor online schema updates. These are taken for granted in premium SQL databases, and to a lesser extent in good free ones like PostgreSQL. (Raise your hand if you've seen the polite suggestion from the Django (South?) team to use a "real" database when your migrate command fails...)

Put simply: MySQL was written by and for people whose time isn't valuable. I.e. the assumption is that it's cheaper to throw low-grade people and brainpower at workarounds (OpEx) than to pay to have the problems avoided in the first place (CapEx). There's a reason people shell out the money for big-iron databases: the math can be easy when you factor in the costs saved (both the direct costs and the opportunity costs).

pilif · on April 17, 2014

>a. Schema changes are considered DDL statements and hence can't be simply rolled back via transaction in most databases (as far as I know).

In the database of my choice (Postgres. Since 7.1), DDL is transactional and can be rolled back.

>b. If you have millions or more rows and you need to introduce a schema change, it can get painful to do so and can result in app downtime.

In many databases and for many schema changes the workarounds are trivial (like adding columns without a default value).

gnufied · on April 17, 2014

So looks like postgres and MS SQL support transactions on DDL whereas mysql and Oracle don't. :-)

vertex-four · on April 17, 2014

Luckily, just as not all NoSQL databases are the same, nor are all SQL databases. If NoSQL is an option, standards are obviously less relevant, so you can rely on the behaviour of one SQL database engine of your choice. Postgres is one of a few SQL database engines which implement transactional schema changes.

Companies such as Percona also provide open-source tools which allow for online, non-locking schema changes for Postgres, and I assume commercial database engines have options as well.

jkodroff · on April 17, 2014

I agree with your points, but PostgreSQL does support transactional DDL: http://wiki.postgresql.org/wiki/Transactional_DDL_in_Postgre...