Hacker News new | past | comments | ask | show | jobs | submit login
Storing 50M events per second in Elasticsearch (datadome.co)
142 points by Benfromparis on Oct 22, 2019 | hide | past | favorite | 44 comments



3 years ago, I made a simple calendar app in django, and I wanted to use Elasticsearch so users can search and find an event, and to use it to populate an upcoming events list. There's only about 10,000 events in the database.

I quickly realized what a pain it is to use Elasticsearch, for a simple app like mine.

Pain points:

1) You have to setup and recreate part of your database in elastic search. So you essentially end up with two databases. Which now you have to keep in sync.

2) I was getting unpredictable query results from Elasticsearch, which after a few days, and much head scratching turned out to be that I was running out of memory.

3) When a user added a new event, it was not being added to elastcsearch index automatically. I could not figure out how to do this reliably. I could make it work reliably only after a sync of the entire Elasticsearch index. But this meant that it was next to useless, to use for the Upcoming Events List. Since I only wanted to sync the index once a day. Confusing the users, as to why their event was not showing up. And I gave up, and just ended up implementing the Upcoming Events List directly from my database in python.

4) Elasticsearch came without some security settings not set by default, and after a few months it was hacked. I had to download a new version and wasted more time.

I still use Elasticsearch, but only for search, and not the upcoming event list. And I don't think it was worth the complexity that it added to my project.


This is a mistake many people make. Elasticsearch is probably overkill for your particular use case.

It’s similar to bringing a F1 car to a go-cart race and then being surprised you aren’t able to finish the race because you don’t have a pit crew able to maintain your vehicle.

I’ve built and owned large Elasticseach clusters at Fortune 50 companies for providing log search as well as document search. Like anything, administering an ES cluster requires planning, engineering, and process/documentation.

I wouldn’t consider using ES if I didn’t have a dedicated ops team to help in its administration unless possibly using a managed service like the one AWS provides.

It’s a very powerful tool; it was a mistake to think you can just casually throw it in your stack without fully understanding its complexity.


Elasticsearch isn't a database, it's a search engine built on top of Lucene. Although some may use it as a document-store, and their own marketing claims it's ok, it never ends up being a good choice: https://www.quora.com/Why-shouldnt-I-use-ElasticSearch-as-my...


I have been put off by Elasticsearch's complexity a number of times. Can I ask why with searching a limited about of text you didn't juse use Postgres' full-text search?


I'm not the person you're replying to, but does Postgres nowadays have a straightforward way to do tf-idf or BM25-style information retrieval?


Not op but not that I know of.

I commented below - I highly recommend Xapian for small projects to test the waters. It’s the SQLite if search.


Or you could use the FTS extension of SQLite. https://sqlite.org/fts3.html


I was new to Python, Django and Postgres. I was looking up how to do search and stumbled on articles how to use Elasticsearch in Django. So I went with it.


Should you find yourself there again I recommend Django Haystack and Xapian.


10k? Hell, any vanilla RDBMS can handle this (including SQLite).


we use zombodb https://www.zombodb.com/ to keep ES and Postgresql in sync, it has its flaws but when it works it works perfectly. That at least helps with 1 and 3, which are indeed a major annoyances.


Kafka streams can solve this use case fairly well - though setting up & managing infra may be a bit more than what you'd want to deal with for a hobby project


+1, or use any other log-based replication mechanism (e.g. Logstash). The point is that instead of having two independent systems that can easily go out of sync (if not using distributed transactions) and become permanently incostistent with each other, you'll now have the database as the primary source of data (commonly referred to as the system of record) and Elasticseach as a secondary, eventually consistent search index. This approach sacrifices read-your-writes consistency though, but for a search index this can be tolerated.


If the database is the primary source of data, how do you get the data from there into the log-based replication method? I assumed the OP meant you'd write to Kafka, and the messages would be processed twice: once to write to the DB, and once to ElasticSearch.

Not wanting to do that for a small project, but wanting a better architecture than I've got, I'm curious about your proposed approach.


Two possibilities: either the app writes both to database and Kafka (ideally using an atomic commit) or CDC is setup in Kafka to read database's transaction log (this is faster)

> you'd write to Kafka, and the messages would be processed twice: once to write to the DB, and once to ElasticSearch

This would be equivalent to using a message queue, which (in contrast to log-ordered replication) does not ensure same consistency guarantees (in this case (1) RYW for database writes and (2) database being always at least as up-to-date as the search index)


It requires a lot of pampering, but I quite enjoy using it and discovering it's possibilities. I am using it for the startup project which I work on, that has search feature very similar to Instagram one. Do you have any other suggestions for search engine that is flexible enough? I don't want to couple things too tightly by using zombodb and similar stuff.


If you really need search, I think its the clear winner still. I don't think its terribly hard to manage/operate, just that you do need to do some initial planning otherwise it will balloon out of control.

Dynamic mappings can mess things up really easily, so its best to disable them in favor of using a pre-defined static map for the type of documents you will be ingesting. What I've encountered in the past that usually causes things to break, is when 90% of your documents contain a field called "Date" that contains a ANSI date field, but the other 10% contain "Null" (string instead of an ANSI date). Since the documents don't match the dynamically generated mapping, they fail to be indexed.

Shard management is also critical and this largely depends on the type of data you are indexing. If the data in ES is unique (not just a copy of a database you already have), you will want to have some sort of cross-region/DC replication strategy as well as a backup strategy.

Fortunately both of these are pretty easy. ES has a mechanism of using tags that allows you to define things like regions, data centers, really whatever you want, and shards can be routed based on rules defined over these tags.

A setup I've used in the past is to have 5 nodes in LAX DC 5 nodes in LAS DC, any data that is ingested into LAX is replicated into shards in LAS and vice versa.

Backup to S3 is rather trivial now thanks to the built in export options in the newer versions of ES.

With a little bit of planning ES can be a great addition to your stack, just be sure you do the initial engineering so you can avoid a big headache in the future.


Thanks, very insightful.


That's why whenever I need to have a good full text search I pick https://ravendb.net/ . Literally close to 0 maintenance required and it's very feature rich.


Here it is in glorious action, if anyone is curious: https://www.pincalendar.com/search/events/?q=


it appears in this document:

* DataDome is a security company, and gets web traffic in near real-time for clients; a lot of traffic in some cases with very specific numbers given, like daily peak loads.

* DataDome only retains records for 30 days, and the most attention is given to the most recent traffic, to detect attacks

* an ElasticSearch deployment records all of the traffic records downstream from Apache Flink; a new feature added to ES this year, improves the management of ES indexing, and that solved problems that DataDome was having.. things are better! write an engineering blog post !

* re-indexing is done nightly, and implemented in a cloud environment that can handle the (heavy) work to rebuild the indexing.

These numbers are impressive. Earlier criticisms of ES are being addressed, and ES is stable and a cornerstone of the architecture. A company called DataDome is providing real services in near real-time. Congratulations to the team and an interesting read.


I was curious about their numbers

> Storing 50 million of events per second

> A few numbers: our cluster stores [...], 15 trillion of events

> We provide up to 30 days of data retention to our customers. The first seven days of data were stored in the hot layer, and the rest in the warm layer.

15e12 / 50 MHz is 3.5 days.

I guess 50 MHz is the peak ingest rate.


I don't think writing a clickbaity title like this is fair. You just write 200k large documents per second, period. Good for you but to be blatantly honest it's actually not a lot.

I'm not saying you shouldn't have written this post, but rather suggest you be fair to your readers (and yourself). Otherwise you could just make up random titles like "Writing 1 trillion log lines per second" (by storing 1,000,000 1-byte, newline-separated log lines per document).


This part left me scratching my head:

> We have set “replica 0” in our indexes settings

> Now let’s assume that node 3 goes down:

> As expected, all shards from node 3 are moved to node 1 and node 2

No, as there are no shards that can be moved, as number of replicas was set to zero and one node went down. Not sure what they are trying to explain here.

> In order to resolve this issue, we introduced a job which runs each day in order to update the mapping template and create the index for the day of tomorrow, with the right number of shards according to the number of hits our customer received the previous day.

This is a very common use-case(eg. logging), but it's surprising that Elastic has nothing to automate this.


> This is a very common use-case(eg. logging), but it's surprising that Elastic has nothing to automate this.

You can set an index template to be used on new indices that match a pattern, which is a very common thing to do. It sounds like what they did was modify the template daily, which is less common IME. It's not clear why they had to manually create the index, though. That should happen automatically.


> You can set an index template to be used on new indices that match a pattern, which is a very common thing to do

It is, but how can you tell in your template you want to keep shard sizes under 50GB? You can't.

The best thing you can do (as they did) is, based on historical data, update the template, so that the new index will have shards that (hopefully) are under 50GB.


Indices are composed of one or more primary shards. Each primary shard can have one replica. Three nodes, each with one primary shard as a part of that sjngle index, no replicas in play at all.


> Indices are composed of one or more primary shards. Each primary shard can have one replica. Three nodes, each with one primary shard as a part of that sjngle index, no replicas in play at all.

Ok, 3 nodes, each with one primary shard. No replicas. 1 node goes down, one shard is no longer found in the cluster, because it was in the missing node. That particular index, and in fact the whole cluster, are now RED.

Unless you discard that shard (force reroute, with accept_data_loss), nothing is going to be recovered and the missing shards will not be allocated anywhere.


Oh, yes, sorry i see your perspective now. You will get data loss in this example. My understanding of the example was that it is showing how one node can end up with all the write operations, i wasnt under the impression that it was a "real" cluster.


author here: > replica 0 It's an example for the article and the intention was to remove the complexity of primary/replica shards. Let's say "shard" is an unit and no matter about primary or replica. In fact with replica to 1, the behaviour would be the same but in the diagram it will have twice more shards. What we wanted to show is IF one node goes down and up after few times AND a rollover occurs just after then this node will handle all those new shards and so handle all the write. here we have a spread write issue

> introduction of a job which runs each day this job has many purposes: 1) because of rollover, our indexes are now suffixed by -000001 then -000002 etc...our applications no matter of the rollover post and get doc by the alias in front of these indexes suffixed by -000001... So if you don't create the index for the next day in a daily basis index design, your application will push at 00:00:01 a new index with the alias name and it won't be in rollover "mode". 2) because we are using ILM feature, we need to define the ILM rollover alias in the template and it changes each day because our indexes name are "index_$date" 3) performance issues: we have a lot of traffic and if we do not create index before for the next day, we will except a lot of unassigned shards, cluster yellow etc...

in fact, yes, it's a common use-case (daily based index) but maybe not with rollover + ilm


> replica 0 It's an example for the article and the intention was to remove the complexity of primary/replica shards.

Ah, got it. So maybe it would be best said as "for the following example, ignore any replicas".

> in fact, yes, it's a common use-case (daily based index)

But it is not automated by the Elastic folks. Do you have any intentions of open-sourcing a portion of this job?


>> Each day, during peak charge, our Elasticsearch cluster writes more than 200 000 documents per second

What is this 50M in the title?


They state each document has 250 events, 200,000 document/sec x 250 events/document = 50m events/sec.


"Each document stores 250 events in a seperate field."

Curiouser and curiouser.


Good catch! We will add it in the article


It's misleading for sure but they're writing 250 'events' per document.


200k documents per second is a lot less impressive, no?


I don't know. Have you tried it? I worked on a Kafka streams service written in Java that processed "changelog" messages (it involved one query to CosmosDB per message, and logging the result to Kafka for downstream processing by other systems). Now, we had a rather limited number of workers (4 or 8? I don't remmember), but getting to 100k messages per second was rather challenging.


200k documents per second really isn't that much for Elasticsearch. The single-instance setup we have in my small company (around 25 people total) has been sent in excess of 40k/s at times, and even then it doesn't slow down noticeably.


and less catchy.


"200k documents per second" would be less catchy you are right, but it also doesn't reflect the complexity to handle that throughput. Most of the benchmark we can see around like "1m document per second" are using small documents in POC environement. In our setup, each of the 250 fields are store and indexed in ES, making it CPU and I/O intensive, in a production environment.


Reading this reminds me of the pains of running an ElasticSearch cluster. We just moved to Elassandra. No more red status.


Hadn't seen Elassandra before. Thanks for the tip.


I've seen elasticsearch clusters like this have consistency problems. Turns out it's a problem in a security setting to have an off-by-one error.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: