Hacker News new | past | comments | ask | show | jobs | submit login
Call me maybe: Elasticsearch 1.5.0 (aphyr.com)
276 points by tylertreat on May 4, 2015 | hide | past | favorite | 66 comments



> How often the translog is fsynced to disk. Defaults to 5s. [...] In this test we kill random nodes and restart them. [...] In Elasticsearch, write acknowledgement takes place before the transaction is flushed to disk, which means you can lose up to five seconds of writes by default. In this particular run, ES lost about 10% of acknowledged writes.

Something bothers me about this: if the bug was merely a failure to call fsync() before acknowledging an operation, then killing processes shouldn't be enough to cause data loss. Once you write to a file and the syscall returns, the written data goes into the OS's buffers, and even if the process is killed it won't be lost. The only time fsync matters is if the entire machine dies (because of power loss or a kernel panic, for instance) before those buffers can be flushed.

So is the data actually not even making it to the OS before being acked to the client? Or is Jepsen doing something more sophisticated, like running each node in a VM with its own block device instead of sharing the host's filesystem?


Looks like there are two separate relevant configs:

index.translog.interval - How often to check if a flush is needed, randomized between the interval value and 2x the interval value. Defaults to 5s.

index.gateway.local.sync - How often the translog is fsynced to disk. Defaults to 5s.

http://www.elastic.co/guide/en/elasticsearch/reference/curre...

It's probably the flush interval that also defaults to 5 seconds that is responsible for losing data from simple a kill -9 to the process. Before that the data is in the process memory but not the OS buffer.


I believe Jepsen is killing (VM) systems, not just processes. It is meant to model the failure modes in a data center, which would include machines bursting into flame, etc.


Hmm. Upon looking at the documentation [1], it looks like the recommended configuration is to run nodes in LXC containers. Maybe I'm mistaken, but I thought "LXC" basically just meant some automation around kernel features like chroots, user namespaces, etc. In which case, shouldn't the containers share the host's VFS and pagecache?

[1] https://github.com/aphyr/jepsen/blob/master/jepsen/README.md


Ah, I didn't know it used LXC, as opposed to Xen or so.


After the syscall is made successfully the data should make it to disk eventually baring any other storage subsystem errors.

I believe this page provides the details necessary to understand/control about when that happens: http://www.westnet.com/~gsmith/content/linux-pdflush.htm


pdflush got replaced with per-BDI writeback since 2.6.32 http://kernelnewbies.org/Linux_2_6_32#head-72c3f91947738f1ea...


So it has. From what I can find though it appears the tunables and behaviour around the tunables is pretty much the same between flush and pdflush.


Write acknowledgement in this case means committed to memory, i.e. the transaction was recorded to a translog buffer (which is fsync'ed to disk every 5 seconds), and changes were applied to the database in memory. The translog (aka write ahead log) has a separate thread which will do fsync every 5 seconds.

Unless ES blocks on the translog fsync, you can have lost transactions in the 5 second window.


The only thing that fsync() does is take data which is already in the operating system buffers associated with a file descriptor, and write it physically to the disk. If the thread that runs every 5 seconds is really just fsync'ing, then when the process gets killed, any un-synced data is still in those buffers and will be available to read when it restarts, exactly the same as if fsync() had been called. The OS is responsible for maintaining a consistent view of the filesystem such that, unless the machine totally fails (or you circumvent the FS and actually inspect the live block device), fsync() appears to be a no-op.

It sounds like what you're saying is that the documentation is wrong, and that the translog thread is actually pulling data from Elasticsearch's internal buffers and writing it to files. In that case, the documentation which refers to that operation as "fsync" is very badly misleading because it disguises what failure modes it's actually protecting against.


The point of a return from fsync is that you are guaranteed the file has been written to disk[1]. If you don't block on fsync, you can't guarantee the file was written to disk, because the server may have died in any number of ways.

[1] This guarantee occasionally fails too; If you have a battery-backed NVRAM RAID controller, the guarantee is that the write has hit the NVRAM controller with the expectation that it will hit a disk before the battery dies. Throw in a 72 hour power outage, a controller failure, or a massive disk failure, and you can't even guarantee that.


No, I understand that. Maybe I'm not explaining my point properly, so I'll try again:

If you issue a write() syscall from a process, and the syscall succeeds, then the data that was written is present in the OS's cached view of the filesystem, even if the process dies a nanosecond later. That view is shared consistently by all processes on the system. It's true that the changes may not actually be stored persistently on disk, but that difference is unobservable unless something happens to make the kernel lose its cached data.

So from the test suite's point of view, unless part of the test involves actually killing VMs and not processes, it should not be possible for the results to depend on whether or not fsync() was called.


Jepsen is just doing a kill -9 on the java process.

I posted a comment on the blog: https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-...

First I made sure that read() goes through the page cache. (It does as long as there's no O_DIRECT) Then I went and checked the write ahead log on ES.

Turns out from my reading that ES is considering a write to be durable if it is put into a userspace buffer.

https://github.com/elastic/elasticsearch/blob/master/src/mai...

Data is pushed to kernel space whenever the buffer gets full. Then it is fsync'd on the timer.


Nice research.

In case anyone else is wondering why that Github link is broken, the file in question was renamed a few hours ago. Here's a working permalink: https://github.com/elastic/elasticsearch/blob/fafd67e1aef091...


The 'Recap' section has good advice to address the issue of data loss:

My recommendations for Elasticsearch users are unchanged: store your data in a database with better safety guarantees, and continuously upsert every document from that database into Elasticsearch. If your search engine is missing a few documents for a day, it’s not a big deal; they’ll be reinserted on the next run and appear in subsequent searches. Not using Elasticsearch as a system of record also insulates you from having to worry about ES downtime during elections.


I do something similar but my data is not that important, as most of you guys work on - I have a very high rate of data being input from users for curating purposes in a bioinformatics lab..So I load the data onto a mongodb cluster and run the elasticsearch-river-mongodb [1] to keep the elasticsearch in sync with the mongodb.. I do not know whether it solves this problem - but I am happy that I have data safe in mongodb - any inputs regarding my method would be appreciated..

[1]https://github.com/richardwilly98/elasticsearch-river-mongod...


"Data safe in Mongo" seems... less reassuring than I think you meant it to be:

https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-read...


Watch out, rivers are being deprecated: https://www.elastic.co/blog/deprecating_rivers


> I am happy that I have data safe in mongodb

Oh no.

So you are protecting your data from unsafe Elasticsearch by storing it in an even more unsafe MongoDB[1]? You should probably put a database that knows how to store data somewhere in your stack and learn to use that.

[1]: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-read...


Elasticsearch has its flaws, but are there really any alternatives that do what it does? In my case, I need to do fulltext search across millions of documents -- nearly a TB of text in all, and still growing. Many database engines can do basic fulltext, but what else can scale like Elasticsearch does while offering powerful fulltext features like fuzzy matching and "more like this?"


There is Solr of course.

But I think the recommendations the post are entirely reasonable:

My recommendations for Elasticsearch users are unchanged: store your data in a database with better safety guarantees, and continuously upsert every document from that database into Elasticsearch. If your search engine is missing a few documents for a day, it’s not a big deal; they’ll be reinserted on the next run and appear in subsequent searches. Not using Elasticsearch as a system of record also insulates you from having to worry about ES downtime during elections.


This makes me feel good because that's the same conclusion I came to with my ES implementation almost 2 years ago.


I think the main point is that Elasticsearch should not be used as a primary store. It seems obvious (or maybe not?), but some people do use it as such. Just like Redis should only be used as a cache.


Exactly... ElasticSearch doesn't market itself as the best fault tolerant DB. If you're using ES as your main data persistence point, you're missing the point of it.


I haven't used it in ~5 years, but Sphinx (http://sphinxsearch.com) might be a reasonable competitor. Back then it was pretty pleasant to use...especially it's indexing abilities, which were significantly faster than ES. I can't speak to it's distributed ability since our index was never that large, but the documentation purports to have that capability.

Sphinx has some curious features, like being able to run as a MySQL storage engine (no doubt the result of the author being ex-MySQL), but it's also got most of the features that you come to expect in search engine, so I think it's fair to call it a competitor. That said, we still use ES.


Sphinx does indeed a very good job and is continuously improved. It is very fast and with small memory footprint.

We also built https://techusearch.com which uses Sphinx indexes (think of a SaaS version of Elasticsearch backed by Sphinx).


Riak search works well. The indexing is just Solr, but the cluster and data itself is managed differently.


Riak Search 2.0, or Yokozuna is what you're thinking about.

DISCLAIMER: I work for Basho.


There are a few options e.g. Microsoft FAST or you can roll your own. However Lucene based systems can leverage the wide variety of addons and is battle tested so it's hard to recommend against it.

Just remember the golden rule of architecture. Always assume every component will catastrophically fail at some point and design accordingly. In the case of ElasticSearch/Solr it means don't use it as a primary data store, backup regularly and have a mechanism to insert deltas.


Rolling your own is surely the worst possible option - there's very little chance you'll build something better than ElasticSearch on your own, and a very high chance that you'll build something much worse.


In order to use any remains of FAST you would need a sharepoint installation. Which is not anywhere near free or open source, though. I think the DB + ES/Solr is the best alternative. Possibly doing snapshots often from ES could help eliminate the need for a db?


It is commercially licensed, but MarkLogic scales horizontally and is capable of full text search and structured queries (via xquery)

https://developer.marklogic.com/learn/2009-07-search-api-wal...

They also have Alert queries, which are similar to ElasticSearch percolation feature.

MarkLogic is also an ACID compliant store with configurable HA, as well as asynch replication to a remote site. it could potentially serve if correctly configured, as the primary store.

---

ArangoDB might have these capabilities as well. And again, it is an ACID compliant database so the data committed, the indices, and the resulting searches should respect transactional boundaries at any moment of time. I had not looked at ArangoDB deep enough to understand what limitations it might have though. It does not have similarity search capability, and hierarchical data like ElasticSearch has https://docs.arangodb.com/2.3/IndexHandling/Fulltext.html ----

I checked RethinkDB, and it appears that it suggests to integrate with ElasticSearch, meaning that elastic search will have a copy of the data from Rethink (and therefore subject to commit protocol of ElasticSearch)

--- Somebody already mentioned Riak with search capability. I think this one is much more similar to Marklogic's capability (and it is free) The advantage of these types of systems is that your primary store and text indexes are sitting on top of the single copy of data (or replicated invisibly within the infrastructure) http://docs.basho.com/riak/latest/dev/using/search/


ES is probably to easiest to cluster. There's Solr which is a much more matured technology but wasn't cluster in mind, I think it uses zookeeper.

Another Lucene base tech is RavenDb but IIRC it's for microsoft.

what else offer something like ES? Any lucene based nosql, ES is lucened base anyway. You can roll your own and use Lucene library.


TL;DR; All previous data loss scenarios still exist, however the window during which data can be lost is much smaller.


And all known defects are now documented. http://www.elastic.co/guide/en/elasticsearch/resiliency/curr...


I bet the people who work for these database companies have nightmares about getting bug reports from Kyle.

But seriously, his work is super impressive. Kudos to Stripe for funding this.


Isn't a bug report always a good thing? I mean the real nightmare is your users sending outraged emails about corrupted data and you have no idea what's going on. With the bug report (and especially from Kyle) at least you got something to chew on. As people say, a reported bug is a bug half-fixed.


waaaay back in the mid 2000s when search engines were large, complex, expensive pieces of enterprise software not unlike a commercial database, it was always assumed (and the customer told) that the search engine cuts corners for the sake of speed, in both indexing and searching, and that it should never be counted on as a database, or any kind of authoritative data store. it does one thing: search staggering amounts of information quickly at acceptable levels of performance and accuracy.

looks like expectations have moved beyond that, which is good, but it's not an easy problem to solve... especially with billions or trillions of documents in the index. leaving out all the the important stuff that made a database slow is what made a search engine fast.


Key takeaway for me:

"crash-restart durable storage is a prerequisite for most consensus algorithms, I have a hunch we’ll see stronger guarantees as Elasticsearch moves towards consensus on writes."

It seems like ElasticSearch is inching towards a proper consensus algorithm, at least optionally, which makes me wonder yet again, why not just implement RAFT? While I won't speculate I will point out that the answer for other systems appears to have been related to ego (ie, distributed systems are easy).


The funny thing is there is a community plugin that fixes almost all of this by using Zookeeper for discovery and master election.

The not so funny thing is Elastic employed the primary developer of said plugin and basically shut it down.


actually, it doesn't fixes the mentioned bug. I am only posting this here to make sure people won't go and try and install it thinking it does...

There is a difference between cluster level master election and replication semantics in Elasticsearch (and other similar systems). Even if you use Zookeeper for cluster level master election, one still needs to handle point to point replication (which doesn't go through zk).


I should have been more specific. The ZK plugin fixes the issues related to network split-brains and other nasty partition conditions.

The newly discovered translog issue is also a problem and is not solved by ZK but is also a lot less scary.


I'm pretty pleased with Elasticsearch's progress on durability. The snapshot/restore feature has been pretty nice to work with!

That said, having a "single source of truth" and regularly refreshing Elasticsearch is a huge pain in the butt. I currently maintain a ~500 line syncing script that takes ~15 minutes to run.

Adding a new field in a doc_type means:

- Adding a column in Postgres

- Adding a field in the doc_type mapping (I think being explicit with the field is better practice)

- Adding code to the syncer to update the field.

Ouch.

Also, the syncing scripts required a steep learning curve. At first, I was upserting everything and the syncing script took forever. To solve this, I made the syncer fetch all the documents in Postgres and all the documents in Elasticsearch then update the specific changes (thankfully, the dataset is small enough that it easily fits in memory).

I'd really love to scrap this portion of my infrastructure...


Just today I successfully finished an experiment: An existing web-app with Postgres as the source of truth pushing update events to a Kafka queue. From there, a screenful of Go forwards those events into Elasticsearch.

It doesn't solve every problem and it might be a lot of new parts if you're not going to use Kafka for anything else. I will be using it for other things like caching and push events. This kind of syncing problem seems to crop up in a lot of places.

A nice thing is that I don't have to care about Elasticsearch durability much, because I can simply rerun the ingester from the beginning of the log, as long as Kafka doesn't lose data.


I think putting a stateful, persistent, transactional stores in the middle of two stateful, persistent, transactional stores is a bad idea.

The idea is to sync secondary store B to continuously be a perfect replica of A, so: A —> B. What a lot of people do, you included, is add a third store as an intermediary: A —> Q —> B. Now you have three complex pieces of software rather than two.

The thing is, you already have the state that Q covers: It's A.

For the record, we made the same mistake. We put RabbitMQ in the middle. Now we had many problems:

* What do we do if the queue loses messages? (Manually reindex from N days ago.)

* How do you know if the queue has lost messages, due to bugs, or network downtime or similar? (Well, you don't really know. If unsure, manual complete reindex.)

* What do you do when the PostgreSQL transaction completed, but it's for some reason unable to reach RabbitMQ to post a queue message? (Manually scan logs, figure out how far back to backfill, reindex from there.) Fixing this "correctly" would require some kind of two-phase commit, which queues don't support.

Lots of manual intervention required to run a system flawlessly. It's not just about running a consistent system; it's about knowing when you're consistent or not, and how to repair.

Also, there are logistical issues:

- What do you do when people run batch jobs producing millions of updates, and users want to update documents (and see their changes) at the same time? You have no recourse but to create traffic lanes — queues and more workers.

- How to run multiple queue consumers. You have to use version constraints (update only if newVersion > oldVersion), because you will end up processing updates out of order, something which only works if the original source has a version field (ES does support "external" version numbers). Turns out a queue makes these checks happen more often because you often get multiple adjacent updates for the same objects. Kafka can de-dupe, fortunately, but RabbitMQ can't.

- How do you update multiple target stores (let's say, both ES and InfluxDB)? You have to create separate queues for each of them, so that you get true fanout. Now you have added more "middlemen" on top of the first one, more stuff that can go out of sync.

My conclusion after struggling with this for a while is that the database is the truth, but it's also the only one that's coherently transactional. So one should keep the change log close to the truth, meaning in the database. You can use a PostgreSQL "unlogged" table for performance.

Secondly, you will want to use time-based polling that invokes a simple state machine that can travel back in time. Basically, you run a worker that keeps a "cursor" pointing into the database transaction log. Let it grab big batches every N seconds. The cursor doesn't need to be transactional as long as it's reasonably persistent; if you lose the cursor, your worst case is a full reindex, but if you are unable to update the latest cursor, worst is case is just a small amount of unnecessary reindexing. This worker can live side by side with the queue-based processor, and if you shard it, you can run multiple such workers concurrently.

Everything else comes out of this logic. For example: If ElasticSearch is empty, it can detect this, and set the cursor to the beginning of time. It knows how far back the cursor is, so it knows whether it's in "full reindex" mode or "incremental mode", something it can export as a metric to a dashboard. It can also backfill, by moving the cursor back a little bit. And by using a state machine you can also put it in "incremental repair" mode, where it can use a smart algorithm (Merkle trees were mentioned by someone else recently) to detect holes in the ES index that need to be filled.

Things like building a new index now becomes a trivial, because you just start a worker instance that points to a new index, but from the same truth data store; being smart about state, it will start pulling the entire source dataset into the new index. The old worker can continue indexing the old index. Once that instance is done, you can swap the new index for the old one, then delete the old worker.

Finally, to solve the problem of real-time vs. batch updates: In addition the above worker you run a separate worker that listens to a queues. Whenever a non-batch update happens (a "batch" flag needs to be indicated in all APIs and internal processes), push the ID of the affected object on a queue, but not the object itself; rather, let the worker pull the original from the store. This way, your queue (which requires RAM/disk) stays super lean and fast. Give the worker a small time-based buffer (like 1s) so that it can coalesce multiple updates if they're happening rapidly, and use an efficient query to get multiple objects at the same time. And use versioning to avoid clobbering newer data.

Of course, the system I've outlined is probably not workable for Google or Facebook, but it will scale well and will keep things in sync better than something queue-based.


You make some good points, but there is a big architectural difference between RabbitMQ and Kafka. The solutions you point out work just as well with Kafka as they do with a unlogged PostgreSQL table, but of course there are tradeoffs for each of them. The unlogged tables has transaction guarantees, but I am not sure if I want to hit my production database with huge read loads on every reindex of a secondary data store.

I haven't really looked into it, but Botteled Water[0] looks like it can combine the good log properties of Kafka with guaranteed delivery from postgres.

[0] http://blog.confluent.io/2015/04/23/bottled-water-real-time-...


Kafka is certainly better than RabbitMQ in some respects. (In others, it's disappointing: It's practically useless if you're not running on the JVM, as clients for languages such as Go, Ruby and Node aren't up to date with the "smart" Java client. It's also clearly more low-level and designed for large installations, and less friendly to small ones.)

The problem with storing indexing state outside the database — using a queue, for example — is transactionally protecting the gap between the database and the queue. Bottled Water is cool in that it can actually bridge that gap safely, as I understand it, since PostgreSQL will keep the decoded stream until you've been able to propagate it to Kafka. On the other hand, if you have the stream, do you need Kafka? Can't you just push it directly to ElasticSearch?

For us, this issue — this and standardizing on an elegant cross-language RPC — is probably the main architectural challenge we're facing right now in our microservice development. We have tons of microservices with private data stores that need good search and also internal synchronization between microservices, which is coincidentally the exact same problem space: You have service A with its complex data model, and then a service B that wants to subscribe to updates so that it can correlate its data with that of A. It's a complicated problem that requires a simple solution.

I am not sure if I want to hit my production database with huge read loads on every reindex of a secondary data store.

Hopefully a full reindex shouldn't happen that often, though. And a full reindex would require a full scan of your production database (not the transaction log) anyway, since you don't want to keep the entire change log around forever (and can't, since the log only starts at the point when you started running this system).


> On the other hand, if you have the stream, do you need Kafka? Can't you just push it directly to ElasticSearch?

I think the separation is something very nice here. We have something like an Apache Storm topology (though we use a own Mesos based framework here) for every datastore we want to populate. If we want to add a new datastore we just have to find a library for it and can whip up a new topology. That is much more convenient than having to build support for each datastore into something central like Botteled Water and can be tweaked nicely to the specialities of the datastore.

> since you don't want to keep the entire change log around forever (and can't, since the log only starts at the point when you started running this system).

If we initialize a new Kafka topic, we push the relevant data into it once from the production database, and after that Kafka dedupe will keep it from growing too large.


The separation is nice, although I would counter that if your only primary data store is Postgres, and you want to go the logical decoding route, Postgres already has the queue: The decoded transaction log. There's no need for a queue on top of a queue. All you need now is a client that can process the log sequentially and emit each change to the appropriate data store.

Things like Kafka would be more appropriate if you have multiple producers that aren't all Postgres.


Some good points here too.

IMHO it's a huge waste having to always be re-inserting your stuff constantly.

I believe batching with ack's is the best option for reliably sync'ing data between two systems with good performance. I know a few people that work at various companies doing events and IoT where they can't drop events and they almost always end up doing batches(or micro batches, whatever you want to call it).. Would be nice if ES offered something along these lines.


There is a Bulk API for ES, which we are using for filling it, or do you mean something else?


Thank you for your detailed thoughts. You obviously have much more practical experience with this kind of system.

Many of the problems you mentioned I am aware of, and also have no workable solution yet (detecting lost messages being the biggest - Merkle-tees sounds like a very interesting approach, maybe even applied at the log-level?).

As mentioned in another reply, Kafka does support the kind of "pointer-to-log" setup you mention. Also Kafka is designed for lots of consumers, each with different characteristics. In principle, I should be able to sync something like memcache with the same information I need to sync Elasticsearch. The same holds for a websocket-server that reads from this stream and forwards new events to web-app clients. So I don't see the need for more than one "queue" yet, maybe that will show up in practice.

Also your setup would require a lot more coordination to handle updates from multiple postgres instances, if I understood correctly.

That being said, I'm still in the experimental phase with all of this, I will publish a writeup once I gain a bit more experience.


Kafka does indeed have a good design. But it doesn't solve the potential transaciton gap between your store and the queue.

For example, if you commit a transaction but you're unable to reach the Kafka queue (because you crash, you're SIGTERMed, or there's heavy load causing a network blip, or any other number of reasons), you'll lose updates. You can't very well write to Kafka before you commit, because it's not visible yet outside the transaction.

The only way is to use a transaction log in the same database, in a way that lets the log be read after the commit is done. Logical streaming would let you do this (Bottled Water [1], as someone else here mentioned, does this with Kafka) in a safe way. It's conceptually identical to storing a transaction log table, but wouldn't require as much custom code, and you'd get incremental updates for free.

[1] http://blog.confluent.io/2015/04/23/bottled-water-real-time-...


Yes, I fully recognize the problem with double-writing. I will definitely try out Bottled Water. I was also thinking about replacing Kafka with a much simpler, lower-throughput system (because we are lightyears from LinkedIn's requirements).

Two reasons why I can't just use postgres (I'd love to): 1.) Kafka (or whatever queue we settle on) will be used for logs and metrics as well, data that doesnt flow through postgres.

2.) Postgres stores the data-model of my business-domain, at the lowest, normalized level. But derived data-stores are inherently denormalized and I want to be able to use them without talking back to my source-of-truth all the time. So currently I'm passing DTOs to Kafka, just like I would to any API request. This data is not easily available at the postgres-level.

I'm not yet sure on the right abstraction level for events. It seems very natural to have them contain information that I would send to clients directly.


So what's your "source of truth"?

We have an application that might be similar. It receives analytics events from frontends. It uses (currently) RabbitMQ to distribute it to multiple "sinks", including InfluxDB, ElasticSearch and websockets; the main sink is one that stores the events as flat files (one JSON hash per line) in S3. That's what we consider our master data.


For all application-data events I consider postgres to be the ground-truth. That is somewhat unfortunate, because one can't easily place a queue in front of the database. For metrics and logs, the Kafka topic itself (which is persisted similiar to your flat files) would become the master. The use-case is pretty similiar.

Might it be feasible to have something like postgres work with an external WAL? That would solve the problem I guess, as well as leave us with a single "persistent" system.


This really is a great explanation and the strategy I have taken with several similar types of systems (not ES, but various specialized indexes that are "slaved" to a master db). In fact, it's very similar to the replication model of MySQL.

One thing you didn't mention but should be pointed out is that you have to manage clock skew carefully. If you have a single database instance, you can rely on timestamps inserted by the db server ("select now()") but if your master database is clustered, even this is not trustworthy. And of course same goes for timestamps generated by your application. Clocks are the bane of distributed systems.

My usual strategy (aside from ensuring everything is ntp synced) is to pick a plausible time period that represents an amount of clock skew that the system should never experience and subtract that from the cursor at every poll. It means you'll always reindex the last few seconds (or whatever you pick for N) worth of data, but you won't miss out on updates.


Yeah, you should generate an always-increasing sequence number in the source database.

Timestamps cannot be used to "select ... where T > ?", since at the time of polling another record may come in with the same timestamp as the newest one, and this will skip records; "where T >= ?" won't work either unless you can de-duplicate, and that requires local state.


Unfortunately you don't get monotonic sequence numbers in distributed databases.

Timestamps actually work very well for most applications, at the cost of extra reindexing proportional to how closely you can manage clock skew. Keep in mind that you are always subtracting a reasonable max skew value and getting extra records. All operations on the secondary store must be upserts.


A lot of very good experience here, thanks for sharing. Luckily for us (ok, not just luck), we are using ES to index Hadoop (both HDFS and YARN), and we are using MySQL Cluster as our source of truth. For HDFS, we create logging tables that store mutations to inodes and we have a program that transactionally periodically pulls mutation batches and applies to the ES. Failures means ES falls behind but we have no data loss. For YARN, however, we just want to index recent stats - what containers are being allocated right now to which apps. So, we use the streaming Event API for MySQL Cluster (native C++ api) that we push into ES. Failures mean lost data, but we get near instaneous updates in ES, which is crucial as it is used as part of an interactive service (monitor YARN applicaion progress).


Could you use something like the notify[0] and listen commands to help with this?

http://www.postgresql.org/docs/9.0/static/sql-notify.html


Maybe?

It would be interesting to see what a Postgres - Elasticsearch foreign data wrapper would look like...


I have a couple of fields I keep in my SQL tables... (migrateBatch,migrateStart,migrateEnd) ... migrateBatch is a UUID field that's set with the migrateStart, from there a regular process pushes to a queue, where the data is then fetched from the DB and updated in ES/Mongo/whatever... My sproc checks for updated records against the migrate/export date, and if I ever want to re-submit something, I just clear those fields in the table.

It's not the absolute fastest way to go, but it's very solid... I can usually have 500k records migrated in well under half an hour, the actual lookups for denormalized data are a bit heavy.


One more thing to add to the Elastic reliability:

> We recommend installing the Java 8 update 20 or later, or Java 7 update 55 or later. Previous versions of Java 7 are known to have bugs that can cause index corruption and data loss. Elasticsearch will refuse to start if a known-bad version of Java is used.

http://www.elastic.co/guide/en/elasticsearch/reference/1.5/s...


Kind of OT, but losing data in these cases is not even what I'm most concerned about with ES: I had to recreate the ES cluster way too many times now that I'm really glad that PostgreSQL keeps on running, or just restarts without bricking the data files, even after OOM or out of disk errors.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: