Hacker News new | past | comments | ask | show | jobs | submit login

This is their application logs. They need to search into it in a comfortable manner. They went for a search engine with Elasticsearch at first, and Quickwit after that because even after restriction the search on a tag and a time window "grepping" was not a viable option.



This position has always confused me. IME logs search tools (ELK and their SaaS ilk) are always far too restrictive and uncomfortable compared to Hadoop/Spark. I'd much rather have unfettered access to the data and have to wait a couple seconds for my query to return than be pigeonholed into some horrible DSL built around an indexing scheme. I couldn't care less about my logs queries returning in sub-second time, it's just not a requirement. The fact that people index logs is baffling.


If you can limit your research to GBs of logs, I kind of agree with you. It's ok if a log search request takes 100ms instead of 2s, and the "grep" approach is more flexible.

Usually our users search into > 1TB.

Let's imagine you have to search into 10TB (even after time/tag pruning). Distributing over 10k cores over 2 second is not practical and does not always economically make sense.


The question is why would someone need search through TBs of data.

If you are not google cloud and just have your workers ready to stream all data in parallel on x amount of workers in parallel, i would force usefull limitations and for broad searches, i would add a background system.

Start your query, come back later or get streaming results.

On the other hand, if not toooo many people search in parallel constantly and you go with data pods like backblaze, just add a little bit more cpu and memory and use the cpu of the datapods for parallisation. Should still be much cheaper than putting it on s3 / cloud.


I guess I was a little too prescriptive with "a couple seconds". What I really meant was a timescale of seconds to minutes is fine, probably five minutes is too long.

> Let's imagine you have to search into 10TB (even after time/tag pruning).

I'd love to know more about this. How frequently do users need to scan 10TB of data? Assuming it's all on one machine on a disk that supports a conservative 250MB/s sequential throughout (and your grep can also run at 250MB/s) that's about 11hr, so you could get it down to 4min on a cluster with 150 disks.

But I still have trouble believing they actually need to scan 10TB each time. I guess a real world example would help.

EDIT: To be clear, I really like quickwit, and what they've done here is really technically impressive! I don't mean to disparage this effort on its technical merits, I just have trouble understanding where the impulse to index everything comes from when applied specifically to the problem of logging and logs analysis. It seems like a poor fit.


It sounds like you are doing ETL on your logs. Most people want to search them when something goes wrong, which means indexing.


No, what I'm doing is analysis on logs. That could be as simple as "find me the first N occurrences of this pattern" (which you might call search) but includes things like "compute the distribution of request latencies for requests affected by a certain bug" or "find all the tenants impacted by a certain bug, whose signature may be complex and span multiple services across a long timescale".

Good luck doing that in a timely manner with Kibana. Indexed search is completely useless in this case, and it solves a problem (retrieval latency) I don't (and, I claim, you don't) have.

EDIT: another way to look at this is the companies I've worked at where I've been able to actually do detailed analysis on the logs (they were stored sensibly such that I could run mapreduce jobs over them) I never reached a point where a problem was unsolvable. These days where we're often stuck with a restrictive "logs search solution as a service" I often run into situations where the answer simply isn't obtainable. Which situation is better for customers? I guess cynically you could say being unable to get to the bottom of an issue keeps me timeboxed and focused on feature development instead of fixing bugs.. I don't think anyone but the most craven get-rich-quick money grubber would actually believe that's better though.


Would be curious what they are searching exactly.

At this size and cost, aligning what you log should save a lot of money.


The data is just Binance's application logs for observability. Typically what a smaller business would simply send to Datadog.

This log search infra is handled by two engineers who do that for the entire company.

They have some standardized log format that all teams are required to observe, but they have little control on how much data is logged by each service.

(I'm quickwit CTO by the way)


Do they understand the difference between logs and metrics?

Feels like they just log instead of having a separation between logs and metrics.


Financial institutions have to log a lot just to comply with regulations, including every user activity and every money flow. On an exchange that does billions of operation per seconds, often with bots, that's a lot.


Yes but audit requirements doesn't mean you need to be able to search everything very fast.

Binance might not have a 24/7 constant load, there might be plenty of time to compact and write audit data away at lower load while leveraging existing infrastructure.

Or extracting audit logging into binary format like protobuff and writing it away highly optimized.


    > Financial institutions have to log a lot just to comply with regulations
Where is Binance regulated? Wiki says: "currently has no official company headquarters".

    > On an exchange that does billions of operation per seconds
Does Binance have this problem?


Binance boss just came out of prison, so he is not above the law.

And yes, they likely have this problem, because crypto is less regulated and full of geeks, so they have way more automation compared to traditional finance, at least proportionally to its size. You have bots on top of bots (literally, like telegram bots that will send orders to other bots copy-trading other bots).

In fact, there is even an old altcoin dedicated to automation: kryll (https://www.kryll.io/). They have a full no-code UI to create a trading bot with complex strategies that is pretty well done, from a purely technical perspective. They plug into many exchanges, including Binance.

Also, because it's less regulated, copy trading/referral/stacking is the Wild West, and they generate a lot of operations and fees.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: