I don't know Riak, other than its a distributed NoSQL key-value data store.
Time series has always been prevalent in the fintec and quantitative finance, and other disciplines for decades. I read a book in the early 1990s on music as time series data, financial tickers, and so on.
How is Riak different, or more suited to use than Kdb + q, J with JDB (free), Jd (a commercial J database like Kdb/q)[2], or the new Kerf lang/db being developed by Kevin Lawler[3]?
Kevin also wrote kona, an opensource version of the "K programming language"[4].
Kdb is very fast at time series analysis on large datasets, and has many years of proven value in the financial industry.
Minor quibble: the article is about RiakTS, their time-series enhanced version of riak_core.
riak_core's main strength is that it does key-value in a distributed/resilient manner, spreading values in multiple copies (at least 3) all over a cluster of servers. Kill one server, no problem. Need more capacity, add servers and it will rebalance itself.
The TS part is just an optimization built on top of that, to make values which are near each other in time to be near each other in storage, for faster range-base retrieval.
Most TS, or tick DBs, are columnar-based, memory-mapped, fast and light systems.
Are there any benchmarks similar to STAC-M3, which is a year's worth of NYSE data run on different hardware to gauge kdb+'s effectiveness on different hardware configurations [1]? It's a great way to gauge performance and TCO.
Does it do both memory (streaming data) and disk-based (historical) storage for big data set analytics in realtime?
I'd be interested to see numbers there.
A lot of people think kdb+ is only for finance. There is a conference coming up in May that will have talks on q (the language for the kdb+ database) about natural language processing and machine learning in q to name a few. Another is about using it at a power plant to most efficiently route power based upon realtime data [2].
I only got into kdb+ and q with the free, non-commercial 32-bit version. I usually use J and sometimes APL, which had MapReduce since at least the 80s for APL.Check out this post from 2009 [3]. I guess the 'new shiny' bit threw me in your chosen title.
I was inquiring about benchmarks for RiakTS, but your link was perfect. I am a J/APL dabbler, and quite recently learning kdb+/q (I prefer k).
As much as I step away from these languages, I always find my way back to them in strange ways. I was studying music, and there was a great J article in Vector magazine written in August 2006 [1] that walks through scales, and other musical concepts in J.
A Forth-based music software called Sporth [2] has a kona ugen in it, so you can generate scales or other musical items in kona, and then use them in the stack-based Sporth audio language.
My interests in kdb+/q, k, J and APL are in applying them to mathematical investigations of music, visuals, doing data analysis, and then just code golfing, or toying around. They're so much fun!
I need more time on large streaming datasets (Time Series data), than large disk-based datasets to really test latencies. I am building a box much better suited for it than my current machine. The goal is to stay in RAM as much as possible.
I had stumbled upon John's work before.
I am currently dabbling with a stack-based audio language called Sporth [1], and messing with the idea of somehow mashing it up with John's ike project.
See, vector/array languages aren't just for FinTech or Time Series!
As far as I know, this and KDB have very different use cases. KDB is a single box timeseries database - the closest open source analogue would probably be using Pandas + a big folder of CSV files. (KDB performs a lot better than this, however.)
The use case is storing tick data + economic data for all the symbols. I.e., one timeseries per publicly traded company. The primary use case is loading a significant chunk of that data and running some statistical analysis on it.
Also KDB pricing starts at $100k or something like that.
In contrast, Riak's timeseries product is distributed. It could store millions of timeseries if you throw enough boxes at it. You probably won't be loading all the data for analysis, probably you'll be processing some of the data one series at a time, and the rest is just for reference.
The main use case here is sensor networks (aka "internet of things") more than financial data. I.e., one timeseries per rotor on a drone, or per sensor on your phone, sometihng like that.
It’s not limited to a single box. “single box” is the easiest setup, but by no means the common use case. Kdb+ can be, and is, used across multiple servers and multiple storage devices.
> the closest open source analogue would probably be using Pandas + a big folder of CSV files
I’m not familiar with Pandas, but I know what “a big folder of CSV files” looks like and that’s pretty far from kdb+. Historical/disk data is typically organized as “splayed/parted” tables, which means it’s column oriented, sorted, grouped/hashed by key columns, can be distributed across multiple devices[1], and has built-in compression. So yes, operating over binary data, with a well-organized physical layout, and some optimized data structures is going to be much faster, not to mention the implied storage/server cost savings.
> Also KDB pricing starts at $100k or something like that.
I wish Kx were more open about this, but this is not very accurate. You pay per CPU core, or for query throughput. The size of your data, or number of users doesn’t matter, so you can start-out for 1/10th this cost. Of course your need for query throughput will correlate with things like number of users/clients and size of your data. But the licensing model is pretty simple, and seems to fit well with how user access scales.
I know Kx makes the argument that the licensing costs are more than offset by the reduced server footprint. Personally, I’d love to see an actual case study that shows costs for a open-source database solution (with some performance metrics/benchmarks), while a kdb+ performance-equivalent solution costs less (and projected to scale for less). I know that’s a complicated comparison, where you’d really need to account for things like developer/admin time/salaries, not just equipment/bandwidth/storage/licensing costs. But it would be great to at least see a cost comparison on the latter dimensions.
So the file format is a lot better than CSV files, but in principle it's basically just a bunch of files. Maybe a better analogy would have been a big folder of feather/hdf5/etc files.
(Incidentally, I'm a big fan of the folder/s3 bucket/etc full of CSV/binary files and use it whenever possible.)
I agree - it's absolutely better to use than that, but it's a lot closer to that model than to the Riak model of querying a distributed system to send you the data.
I stand corrected on single box and pricing - it's been a while since I've used it.
kdb+/q/k are used for IOT applications [1], not just fin tec. After all, it is all time series data.
The benchmarks given in a response above by srpeck [2], shows spark/shark to be 230 times slower than a k4 query, and using 50GB or RAM vs. 0.2GB RAM for k4. If RiakTS is relying on spark/shark as the in-memory database engine, it is already at a big disadvantage compared to k in terms of speed, and all the RAM that is going to be required on those distributed servers.
I will have to look at the DDL/math functions available in RiakTS too, since that is how you get your work done regardless of speed of access.
Very cool, I stand corrected. I hope one day I have another opportunity to play with KDB.
As for the speed advantage, you'll have a similar speed advantage with python/pandas/big folder of CSV files. For all of Spark's claims on "speed", it's really just reducing the speed penalty of Hadoop from 500x to 50x. (Here 500x and 50x refer to the performance of loading flat files from a disk.)
Do you really mean flat CSV text files? I get the simplicity of that, but it seems really expensive (speed and size). But I'm used to tables with more than a dozen columns, and with kdb+ you only be pull in the columns of interest, and the rows of interest (due to on-disk sorting and grouping), which is a smaller subset, often much smaller.
By number, my data sets are usually in CSV. I could probably get some additional advantage via HDF5, but a gzipped CSV is usually good enough and simpler. By volume (i.e. on my 2 or 3 biggest data sets) I'll probably be mostly HDF5. I haven't tried feather yet but it looks pretty nice.
KDB would probably be better, but don't underestimate what you can do with just a bunch of files.
KDB is not distributed and K (APL) is not particularly pleasant to work with. While it has a proven track record in fintech no one I know of is particularly fond of working with this technology. Not to mention the cost of K developers (200K+). Riak is simply offering a free alternative to these systems that is very palatable.
I’m not sure what you mean by "not distributed". With kdb+ you have a lot of flexibility in how to setup the database. You can organize the data to be stored in a distributed fashion (across multiple devices, multiple servers), you can setup query load balancers to distribute work-loads, and you can replicate to multiple servers/devices. You don’t have to use K, you code in q, which most people find far easier to read/write. There’s a wealth of information to help with all this[1], and a very responsive user group.
But yes, you do need kdb+ expertise to get full use out of the tool. And yes, feelings seem to run strong towards kdb+, in both directions, love/hate it. And correct again, it’s not free and the licensing cost is definitely a hurdle to wider adoption.
>I’m not sure what you mean by "not distributed". With kdb+ you have a lot of flexibility in how to setup the database. You can organize the data to be stored in a distributed fashion (across multiple devices, multiple servers.
So I was under impression (feel free to correct me) that kdb horizontal scaling was something akin to Oracle RAC. I.e. horizontal in the name only. I.e. the data is only ever available from one physical instance at a time.
Having worked with both Oracle and KDB, the thing to understand is that KDB (or K or Q) is a fully fledged language... you can do whatever you like really. If you want something distributed you just design something yourself and build it. It's not like Oracle where it is all opaque and buried in query plans and the like.
> How is Riak different, or more suited to use than Kdb + q, J with JDB (free), Jd (a commercial J database like Kdb/q)[2], or the new Kerf lang/db being developed by Kevin Lawler[3]?
I would really be interested to see an informed answer to this question!
> Riak uses the SHA hash as its distribution mechanism and divides the output range of the SHA hash evenly amongst participating nodes in the cluster.
Wait, Riak uses SHA as distribution hash?
Why use a cryptographic hash for distribution and not something like Murmur3, if you're talking about high-performant[0] ?
The hash function in this case is used purely for generating an integer from a small binary blob (bucket/key pair) and that integer is the deterministic artifact that tells Riak Core which machines and hash partitions are supposed to own that data.
The performance impact of that is massively dwarfed (by probably 3+ orders of magnitude) by everything else that's going on in the critical sequential read/write path.
I guess it's worth investigating, murmur has been in use by some big names (cassandra, elastic search, hadoop, etc...) for a while in similar contexts.
More time is likely spent synchronizing data across the network than hashing. (And the variance for the network time is likely high enough to account for the hash time)
Conjecture: not sure if this is the reason why SHA is used, but a useful side effect is that it may make users of Riak less vulnerable to certain types of denial of service attacks. Not 100% sure since I know little about how Riak works, but a more predictable hashing algorithm could make it easy for attackers to overload a given bucket with data, and slow down the db to a crawl.
I've always found it quite curious that computer human interfaces have always focused on the noun/verb proposition of describing data, and not the time/place. Time is the only true constant in the universe, and yet computers are set up to track and control it, seemingly, as a second thought.
Imagine if instead of having files/folders to (teach,confuse) Grandma, we simply had a time-based system of references. If Time was a principle unit of information that a user was required to understand as an abstract concept, I feel that it would result in far better user interfaces.
We can see this in the Music-making world, where Time is the most significant domain over which a Musician exerts control. A DAW-like interface for managing events seems to me to be quite intuitive - for so many other non-musical applications - that its almost extraordinary that someone hasn't built an email system, or accounting system, or a graphical-design system, of applications, oriented around this aspect. (Of course, they are out there - but it seems that Time management makes the dividing line between "professional" and "dilettante" users rather thick...)
> Imagine if instead of having files/folders to (teach,confuse) Grandma, we simply had a time-based system of references. If Time was a principle unit of information that a user was required to understand as an abstract concept, I feel that it would result in far better user interfaces.
How would it be easier when I, let alone my Grandma, can't remember if I did something last week or the week before last? It seems it puts a higher cognitive load on the user.
Yeah I actually don't think the human brain accounts very well for time. Like despite a decade has passed I don't feel as if time has moved very much for me. Despite the fact that people age, it does not seem an inbuilt thing to recognize your age. It just seems to be something that happens to your body.
Thinking a bit more about it I don't think the brain accounts for time in long term memory, it does a better job in short term memory. That's why a musician can use time and we can't use it very well for stuff we stored a week ago.
You think its better to have a situation where you can't find where something is, nor even approximate it, but you can generally approximate when something is, and grope around it? Because I don't think thats the case - if you can't remember where you put it, its not helpful to grope around the location of where you think you might've put it - or at least, lets say this: it might be more helpful to group around time, than space, when dealing with abstract ideas.
A user interface can be derived to help you do that groping with Time. User interfaces have been derived to help you do groping with form or place (Spotlight, cf.), too, but I wager Time-groping is more effective. But that could just be the musician in me...
Have you ever seen Time suddenly stop? It doesn't. It just keeps on ticking. (okay, the theory of relativity may have other things to say about how those ticks propagate through space, however: we're talking about computers and user interfaces here..)
We're on the look out for suitable remote storage for prometheus.io, and would want to know the hardware that'd be required to handle 1M samples/s and how many bytes a sample takes up.
It doesn't support full float64 which we need, but we could workaround by putting it into a 64 bit unsigned number.
Might be worth looking into dalmatiner.io (DalmatinerDB) as an alternative to this. It's also built on riak_core to manage cluster membership and the top-level framework for dealing with routing and rebalancing.
Waited for a long time for Riak TS to come out. Tried KairosDB & Cyanite, but the operational overhead of Cassandra wasn't something I wanted to buy into for such a narrow use case (infrastructure metrics store), and then suddenly out of nowhere DalmatinerDB was released. The code is clean, the architecture is solid, and the ops story is simple.
I don't have any affiliation of any kind with the Dataloop folks. I am however a happy end-user. We do currently use Riak KV due to its CRDT support though.
DalmatinerDB has been around for a while now. I started working on it during the EUC in 2014, and it was released as part of ProjectFiFo the same year I just really suck at marketing ;).
But I'm kind of curious, are you using Dalmatiner directly? And if so what is your use case?
Current deployment is Graphite (Carbon/Whisper) proper due to a lack of time to change it out, but eval'd Dalmatiner and should be migrating to it in the next 6 months (before the year is out). Made it through testing like a champ.
The use case isn't particularly interesting though. Just multi-site machine and application metrics aggregation. Needed something that could work with Graphite tooling, would be highly-available, and be comparatively trivial to operate/maintain.
I see SQL support, that is interesting. Isn't Riak the premier NoSQL database. I guess it is a NoNoSQL db now ;-)
The implementation of SQL part is so neat. Great work whoever did that. It uses yecc and leex that comes with Erlang and rebar even knows how to compile those. Very cool!
All non-trivial NoSQL databases support (or will support) SQL - otherwise programmers can insert data, but end-users can't consume it. No reporting, no revenue.
Cassandra has actually deprecated their original data access API in favor of CQL (Cassandra SQL.)
So what’s the big deal? People have been recording
temporally oriented data since we could chisel on tablets.
Never answers it, but instead explains how Riak handles large time series. Certainly interesting, but I would like an answer to this question, as I don't understand the big deal.
Immediately following the part you quote (my emphasis)
Well, as it turns out, thanks to the software-eating-
the-world thing and the Internet of Things we happen to be
amassing *vast quantities* of all sorts of data [...] The
demand for systems that are capable of storing and
retrieving temporal data on an *ever increasing scale*
necessitates systems that are specifically designed for
this purpose.
Strongly implying the big deal is the need to scale?
That's an obstacle that needs to be overcome because you want to get somewhere. A challenging obstacle that has spawned an entire industry, but nevertheless not the goal. It is implied we can get somewhere new and exciting with these vastly larger timeseries. My question is: where?
I think Datomic which has been doing time series data for years now answers the "what’s the big deal" question for me:
"Given a value of the database, one can obtain another value of the database as-of, or since, a point in time in the past, or both (creating a windowed view of activity). These database values can be queried ordinarily, without having to make special queries parameterized by time." http://www.datomic.com/rationale.html
So in other words you get to see exactly how the whole database looked at any given moment through history (certain parallels with the blockchain I suppose).
Until you reach 10B datoms and then you're off for a fun ride sharding with multiple databases. Also in an IoT context or more general timeseries usage you very likely don't care about tx, but very much do about write throughput and not having a spof, which Datomic isn't good at at all.
Datomic is an ok choice in some contexts, but the one detailed here is not one of them, you're likely to reach Datomic limits quickly and be in a world of hurt when you do.
> I think Datomic which has been doing time series data for years now answers the "what’s the big deal" question for me:
This is known as "time travel queries". Postgres supported this back in the original version from the 1980s but then they took it out in 1997 because it takes up too much disk space.
It's trivial to do in an MVCC system. You just turn off garbage collection (vacuuming in PSQL parlance).
I think the big deal is allegedly "we now have so much data that we have a use for distributed databases with high performance", which allows them to go on and say "and our big deal is that we've built a high-performance distributed database tuned especially for time series".
I only noticed it because I was shocked that there was an actual comprehensible value proposition in a blogpost about a NoSQL product.
However being system admin of a PostgreSQL database managing a lot of timeseries I was just like "meh use timestamp and good index, what's the deal?". Also for performance sake just cluster the data using the most used index...
But yeah I guess if you really need NoSQL this should be nice. However sharding based on time will probably be one of the easiest approach for horizontal PostgreSQL deployment.
One of the most compelling features of Riak, both KV and TS, is that it is masterless and replicated. When nodes fail for whatever reason, the cluster handles it transparently.
As someone who deals with sensor data, the tricky part is really not the write-rate, but rather dealing with messy data. There's a lot of parallelism in sensor network streams, and for many domains you never look at the sensors from one device against the sensors of another device, so you can put them in entirely different databases and it doesn't matter. (It's not true in every case, of course, but if you're doing time series/streaming, ask yourself if it's true for you before picking a system)
The real pain is handling data that arrives out of order or otherwise very late, or handling data that never arrives at all, or handling data that's clearly wrong. Worse, you may have streams that are defined/calculated from other streams for some algebra on series, e.g. series C is series A plus series B - so handling new data on A means you need to recalculate/update the view for C.
Oh, and you'd like this all to be mostly declarative so you have some way to migrate between systems if you need to switch for whatever reason.
Apache Beam/Google Dataflow gets a lot of this stuff right: it's not quite as declarative as I'd like but it gets the windowing flexibility right and handles restatements at a data model level.
>The real pain is handling data that arrives out of order or otherwise very late
Riak TS uses leveldb under the hood. Leveldb is natively sorted. In riak ts that includes the bucket. so the sort order is basically bucket/%PK where PK is your composite PK as defined in your CREATE TABLE statement. See Local Key [0].
Fantastic db, but maybe not quite ready for production usage. We are using influx for a small portion of our ingestion engine as well as for storing server metrics. The updates/improvements have been pretty astounding over the last year, but also hard to keep up. I had to fork the nodejs library just to update it from 0.9 to 0.12 [0] because there were a LOT of breaking changes. Pre-0.9 there were many issues we ran into when it came to disk space & performance but they have all been resolved as of 0.9.
Another thing to note (after speaking with them on a few occasions) is that they are only providing cluster support to their enterprise offering which wont be available till this summer. They do offer Relay which is their high availability tool for the open source version.[1] You really won't need clustering unless you're doing an insane amount of writes: single server performance is insane right now. We average 10k writes/sec with bursts up to 5x that and it doesn't break a sweat (on a cheap 2 cpu/7gb ram instance, with ssd block storage).
I am not an expert but from using InfluxDB I think influx supports more features in terms of aggregation/rollups/ gap filling/ retention/ etc. But I suspect Riak TS would win on certain scalability use cases because it's based on the Dynamo architecture.
Time series does not necessarily have to be about 'huge' data either, just a much greater level of historical precision. Example:
ISP sells a circuit with 95th percentile billing to a customer.
If you poll SNMP data from a router interface on 60 second intervals and store it in an RRA file, you will lose a great deal of precision over time (because RRAs are highly compressed over time). You'll have no ability to go back and pull a query like "We want to see traffic stats for the DDoS this customer took at 9am on February 26th of last year".
with time series statistics you can then feed it into tools such as grafana for visualization.
An implementation such as openTSDB to grab the traffic stats for a particular SNMP OID and store it will allow you to store all traffic data forever and retrieve it as needed later on. The amount of data written per 60 second interval is miniscule, a server with a few hundred GB of SSD storage will be sufficient to store all traffic stats for relevant interfaces on core/agg routers for a fairly large sized ISP for several years.
I can't speak for the engineers but a lot of folks at basho are constantly building and tearing down riak single instance or multi instance clusters on redhat/ubuntu virtual machines. I do that and I also have different versions of riak sitting in their own folders on my mac hd. mac osx support +1.
Looks really nice. although I am bit sad to see that - it requires structured schema. I have been lookout for a metric collection system (like influxdb) and this would fit very well - except the schema part.
You can kinda fudge it by making one of the columns a varchar. It will store whatever you put in it like stringified json but not compute over it (arithmetic, filters, aggs).
Time series has always been prevalent in the fintec and quantitative finance, and other disciplines for decades. I read a book in the early 1990s on music as time series data, financial tickers, and so on.
How is Riak different, or more suited to use than Kdb + q, J with JDB (free), Jd (a commercial J database like Kdb/q)[2], or the new Kerf lang/db being developed by Kevin Lawler[3]?
Kevin also wrote kona, an opensource version of the "K programming language"[4].
Kdb is very fast at time series analysis on large datasets, and has many years of proven value in the financial industry.
[1] https://kx.com/ [2] http://www.jsoftware.com/jdhelp/overview.html [3] https://github.com/kevinlawler/kerf [4] https://github.com/kevinlawler/kona