Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I completely do not understand this sort of thinking. It's like saying, "All you people driving motor vehicles: trains are better". It's a meaningless comparison. NoSQL systems are really great for certain things. RDBMS systems are really great for certain things. What I would really like to see is if all this effort writing these kinds of articles went into well-thought out pieces that discusses very specific cases NoSQL backends provide an advantage and why. Use cases that are not a blog, digg, or twitter clone would be most helpful. Some of us have to work outside the bay area for companies like banks, insurance companies, hospitals, etc... Can I use Cassandra as a data warehouse for electronic medical records? Hell if I know, without actually having to learn it and implement it to see if it works.



If you look carefully, you'll find that a whole heck of a lot of those banks, insurance companies, and so forth are already using something that falls into the NoSQL realm -- Lotus Notes/Domino. The use case for a document-based, schemaless database is already well understood, if not always by the developers using it (or by the developers who grew up on a strict diet of relational databases).

It boils down to this: do you need addressing that would not be well-served by structured, tabular data? Relational databases are (or should be) extremely efficient at accessing data that can be arranged neatly into tables and read using the equivalent of pointer math. They suck, though, when the data is stored outside of the table (as large text fields, blobs, and so forth would be). The more heterogeneous and variable the data, the worse performance gets. A schemaless database gives up efficiency with tabular data, often precluding efficient runtime joins and, consequently, multiple orthogonal data access paths (data normalization) for a more efficient generalized access.

As a thought experiment, imagine a data quagmire where, in order to make the data fit a SQL table, every cell in every row in the database would need to be varchar(17767) and may contain multiple character-delimited values (or not --each record can be unique) and every row can have an arbitrary number of cells. That's what schemaless data can, and often does, look like -- and something like Notes or CouchDB can work with that comfortably and with an efficiency that cannot be matched by a relational representation of the same data.


Bioinformatics is another application area where document-oriented databases have been used long before it was fashionable. For example, much of the data at the NCBI is stored in blobs as ASN.1 (which predated JSON by about two decades...)

One thing I don't yet understand is the big revolutionary deal about the NoSQL systems vs. storing one's hierarchical documents in blobs in an SQL database or BDB. At the end of the the day, I'm not sure that there is any magic - B-trees are B-trees, joins are joins, and disk seeks are slow.


Upvoted for using the term "data quagmire." Many examples instantly shot through my head and I began laughing hysterically.


Here's the issue with this entire topic:

1. It's not just which can do reads/writes better--it's not apples to apples, it's apples to oranges. You're right.

2. If you look the type of reads Cassandra can do and the type of reads MySQL can do. MySQL blows Casandra out of the water in terms of flexibility. (at a cost obviously)

MySQL simply has the best flexibility. Cassandra has a ways to go.

3. In Cassandra you essentially have to hack your data model to fit the data structure to make things work and if you decide one day you read things differently it's not always easy. You have to massively denormalize. (but hey disks are cheap as they say)

In a nutshell, use the best tool for the job. Cassandra happens to just fit lots of use cases so it's worth looking at. I don't think it would be best for a company to port everything over, MySQL is still very good at many things and the flexibility in reads is worth the cost.

You shouldn't wait for someone to have to tell you it's going to work for y use case. It really matters what queries you end up asking your data store and Cassandra really shines for simplistic ones and be hacked for complex ones. You need to just dive into the docs and http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-mode... is a great resource.

FYI: I love what Casandra does and I think it's the best of the NoSQL options out there.


I love what Casandra does and I think it's the best of the NoSQL options out there.

I'd say that the differences between types of NoSQL databases preclude any attempt to choose a winner. It's makes no more sense than trying to pick the better between NoSQL and RDBMS.

I think this illustrates much of the problem with the NoSQL moniker - it seems like an attempt to group a bunch of completely different data storage systems by arbitrarily excluding a single one.


"it's apples to oranges"

Exactly, and I didn't feel OP was saying anything but that. What I thought he was trying to get across was that there situations where NoSQL is a no-brainer superior, and by the sounds of it as a response to seasoned SQL DBA's criticizing the NoSQL option.

I work at your typical large corporation writing typical applications on top of sql databases all the time. SQL databases are incredibly flexible and perform excellent, and I'd need a VERY good reason to not use one on a project. For me, 49 times out of 50 they are more than sufficient.

But, at least in my industry, there are cases where its just the wrong technology for certain special requirements. I was AMAZED how poorly Oracle performed in one certain use case (real time aggregation of a relatively large hierarchial database), despite MANY hours devoted to indexing and tuning, with every table pinned in memory. So our solution was to write kind of an in-memory database, which out performed Oracle by >>1000 times or so. Of course this is not surprising.

What is surprising, and I think a large motivation for the OP, is how ignorant some of even the most skilled SQL DBA's are of what's happening in the real world. About a year ago I was talking with one of the most senoir database people at our company, and the topic of high performance databases came up, and I made some reference about how of course if you are dealing with huge performance, of course you wouldn't use something like Oracle. (I think maybe we were actually talking about Google). And he made some reference that Google uses Oracle, and I said basically, no they don't, and then he says he knows they do because ~he knows someone that works there as an oracle dba~. Of course, Google almost undoubtedly runs lots of Oracle internally, but he was under the impression that Google is run on Oracle. And this is a guy that is smart (always relative of course); but he really knows what he's talking about when it comes to Oracle. The thing is though, there is this whole other world, that he is entirely oblivious to.

And then there's just your standard DBA "professionals", who are just covering their asses. I personally consider myself to be largely uninformed when it comes to SQL Server and Oracle, but it's rare that I come across a xSQL DBA that knows more about the internal system tables than I do, or how script actions (sql server DBAs mostly), or write triggers, or do anything even remotely advanced. In the large corporations I work in, I'd estimate at least 50% of DBA's don't know anything about system tables at all beyond what they learned in their certification course. They've likely never used them in their work. A lot of them don't even know they exist. Their profession is largely a mystery to them.

tl;dr: SQL DBA's aren't very well informed about what's happening outside their bubble.


> I was AMAZED how poorly Oracle performed in one certain use case (real time aggregation of a relatively large hierarchial database)

Out of interest, what kind of hierarchial database was this? Something like this?

http://explainextended.com/2009/09/28/adjacency-list-vs-nest...


Except of course, TFA is not that, but instead, "Train riders, we're not complete idiots for driving cars."


So my thinking on this is that the way these NoSQL systems will make their way into Healthcare is through the great backdoor of analytics. Virtually all available EHR/EMR systems available today allow various mechanisms for data retrieval based on primary indices. Unfortunately that avenue does not lend itself to secondary data reuse which will become more and more valuable as Institutions realize their data is valuable not only for academic research but for operational efficiencies re. money in the kitty.

Even more unfortunate still is that virtually all these installed systems do not have the performance capacity or advanced search-ability to adequately mine this growing horde of data. Administrators, under capex constraints, do not allocate resources for secondary systems which would duplicate data for mining purposes while alleviating strain on the principle production system. The no money budget problem will lead in house programmers to build these research systems on top of open source NoSQL solutions. There, the technology will prove itself.

Additionally, "NoSQL" comes in different flavors. Generally, all of them forgo the Consistency in CAP for Availability and Partition tolerance, which is fine for many use cases - just not primary medical data acquisition use cases. As the field matures programmers and system designers will learn how to make this work better to the point where one day NoSQL systems may be used as the primary data repository for medical data. However, that day has not come. For instance, Riak allows you to tweak knobs in order to favor of certain aspects of CAP theorem at different times while in production (specifically the w and dw parameters, http://blog.basho.com/2010/03/19/schema-design-in-riak---int...). But having just started working with Riak in the last month or two I would still only use it as an analytics tool exposing my medical record data to m/r jobs at this point. And before jbellis smacks me, I think Cassandra is awesome and I'm looking forward to spending some time with it but I'm still not putting my med app data in Casssandra just yet as a primary data store.

/Disclaimer. I work for a major University Medical Center and write business web applications in this area./


Secondary data reuse/analytics requires more like "MapReduce" functionality rather than "NoSQL" stuff.


As a developer of a small, in-house specialized data warehouse for medical research purposes (built with 100% open source), I would be curious to hear more thoughts on this. The flexibility angle is certainly a huge win. My issue is that I can't quite get my head around how one runs ad-hoc queries like "give me everyone with a bmi >x and age between a and b". In the key/value view of the world, that doesn't quite fit.


Take a look at the revolutionary illuminate Correlation database - ultimate flexibility for "ad-hoc" queries

The Correlation database is a NoSQL database management system (DBMS) that is data model independent and designed to efficiently handle unplanned, ad hoc queries in an analytical system environment. Unlike relational database management systems or column-oriented databases, a correlation database uses a value-based storage (VBS) architecture in which each unique data value is stored only once and an auto-generated indexing system maintains the context for all values. Queries are performed using natural language instead of SQL.

Learn more at: www.datainnovationsgroup.com


How do NoSQL databases allow you to do a better job at "secondary data reuse" than a traditional RDBMS? Is it just a matter of (1) Denormalization (2) performance gains from not needing to worry about transactions?


The main issue from my perspective - beyond performance - is flexibility of the NoSQL model. Specifically the document store concept. The main analytics player is the data warehouse currently which has served very well over the years but comes with certain rigidity mainly in programming and cost. Of course, not to detract from their success, OLAP data warehousing is very mature with a rich toolbase.

The flexibility of the NoSQL model coupled with the exposure of your data to the m/r paradigm is the big win from my vantage. Almost every NoSQL solution will expose your data via bindings in virtually every programming language allowing almost any programmer to leverage NoSQL. Now you kinda have to have experience with data warehousing. Cost wise, what you would spend in licensing can be invested in hardware but more specifically talent instead.

Secondary data reuse as a concept is highly unstructured. Sure your primary systems capture data in specific formats but your analysis can take you in all kinds of directions. Being able to slice and dice without having to pre-define will be huge.


I'd really like to see a set of use cases for each type of system too. I recently compared NoSQL systems in terms of the CAP theorem and the underlying data model (discussion here http://news.ycombinator.com/item?id=1190772). I haven't seen enough of these systems in production, but it would be great to see examples of each combination of CA, AP, or CP systems with relational, key-value, column-oriented, document-oriented, and graph (not featured in my post) data models.


yeah no but. if you need to move a lot of shit, trains are better.

flame on? maybe you don't understand because your startup doesn't have a lot of traffic?

you even admit that you don't really understand how the nosql dbs even work!


Can I use Cassandra as a data warehouse for electronic medical records?

I don't think people building those systems are supposed to talk about them...


Actually, You are allowed to talk about your setup, you're just not allowed to release your actually data. But that's true for a lot of systems.


That depends on the business. Some businesses consider things like that proprietary information. Some go so far as to consider just about every detail of their system including hardware and even OS sensitive.

Others are less secretive, but still want to protect the details of their schema. Properly and carefully developping a schema for a major system can be a big project that reveals a lot about your business and can give an upstart competitor an advantage.


The point in question is whether HIPAA and other health privacy standards allow a provider to describe their architecture. They do (see: http://en.wikipedia.org/wiki/Health_Insurance_Portability_an...).

Whether a business chooses to talk about their setup depends on a variety factors including their business model. However, keeping such things private is often security through obscurity (http://en.wikipedia.org/wiki/Security_through_obscurity).


I think we had different understandings of the grandparent:

I don't think people building those systems are supposed to talk about them...

In a former job I worked on systems similar to what was described and my former employer would be very unhappy with me if I revealed so much as their hardware setup much less schema. I am not supposed to talk about such things, and there is an NDA that says so....

Whether it is legal for an authorized person from the company to discuss those matters is a separate matter.

Also, I must point out that this is not security through obscurity. Security through obscurity cannot be relied on, I agree. But in this case, it is a matter of preventing you competitors from knowing what you are doing.

You know your competitors can develop the same thing you did in time, but you want to make sure they have to spend that time rather than being "inspired by" reading over your source code or even stealing it entirely. In some competitive environments, even just knowing what your competitors are or are not capable of at that moment can be a huge advantage.


I wasn't referring exclusively to HIPAA. I also assumed that medical software companies would tend to be the type that don't like their employees blogging. It's not even just security, but general trade secret stuff.


Why would these kind of details be better kept in secret?


They wouldn't, but often times certain regulations may require it.


But in this case, the regulations don't require it.


[deleted]


Or PhP vs Every Other Language Out There. The NoSQL crowd is mainly saying that RDBMS isn't for everyone, and that other storage methods need a bit of time in the limelight.


In each of your other examples, you are listing two things that are direct (or almost direct) competitors. It makes sense to compare them and then pick one over the other.

SQL vs RDBMS are further apart. There are cases where it makes sense to use both for different parts of your system. Sure you can force one to do the job of the other, but that shows you don't know when to pick one over the other (or were otherwised forced to do that...), not that they really target the same use-cases.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: