Hacker News new | past | comments | ask | show | jobs | submit login
Ubiq: A Scalable and Fault-Tolerant Log Processing Infrastructure (2016) (ai.google)
84 points by SriniK on June 14, 2018 | hide | past | favorite | 20 comments



Let me take this opportunity to say I'm appalled all research at Google has been rolled into "AI." Ubiq has nothing to do with machine learning, but you can't access it without wading through a swath of Google-branded AI marketing material. The domain research.google.com now redirects to ai.google. If you want to search for any research publications by Google employees, you will be searching on a domain that first and foremost highlights the AI research teams. In a particularly egregious example, all quantum computing research has been filed under "Quantum AI." The "Recent Publications" section exclusively highlights machine learning research despite there being more recent research by other teams. I can go on.

In my opinion, this move threatens the scientific integrity of Google's research. It's clear that researchers in distributed systems, fundamental theory, security and privacy, networking, language development, etc. are second class citizens. I wouldn't be surprised to learn that there's organizational pressure to inject machine learning into as many publications as possible, even if it dilutes the overall diversity of research.


It's not just research. Android improves battery by 30%, no one bats an eye. But 1/3 of that improvement is thanks to machine learning - off to the Google IO we go!

30% and 1/3 not the exact real numbers, just showing a concept.


> It also guarantees exactly-once semantics for application pipelines to process logs in the form of event bundles.

Well, it's got my attention.


FWIW Kafka 0.11+ supports exactly-once also.


huh...I interviewed with Google about 4-5 years ago and one of the interviewers spent a bunch of time asking me questions about something just like this, but from a "theoretical" perspective.

Pretty interesting to see this actually happen and see where Google ended up with it.


I wonder if the name is a nod to Ubik by Philip K. Dick


That too is what I thought of. Though in the book the product is of dubious quality and constantly changing what it promises. So it's hard to be believe someone thought it a good idea to associate their product


I think you may have misinterpreted what Ubik is in the book. It's not a product, and it seems to be the manifestation of some kind of creative force that negates entropy. The commercials quoted in each chapter are not real commercials in the book.


I hope it comes from Ubiquitous, because I could never quite understand Ubik :)


Published in 2016


Thanks! Updated.


The most scalable, fastest reaction time and simplest log processing tool is awk.


Ok, I'll bite. Most scalable, you say. I have 250TB of Django logs, how do you recommend I use awk to process them to determine the 99th percentile response time that's faster than using SparkSQL?


Off-topic, but do you actually run ad-hoc SparkSQL queries on the whole dataset sometimes? Are the logs actually stored as text files on disk? How long does such a query take and/or how many racks of machines do you need for that? Should be more than 1 million drive-seconds just to read the data from disk, right?


SparkSQL doesn’t support gz. Are your logs splittable on a file-by-file basis or are they in gz format?

Where are your logs stored? Is that a distributed storage? Will SparkSQL not eat all of the bandwidth of it’s ethernet interfaces?

Yeah; sure.


It's not so much that SparkSQL doesn't support gz as that gz is slow because you can't parallelize the reads. Regardless, parquet format in hdfs so yarn can allocate containers local to the chunk to be processed. Scales nicely.


Yeah, but you are making an assumption I’ve got these 350Gb of logs in HDFS format. Which takes time to set up.


Not really, you said SparkSQL doesn't support gz, which is incorrect and the thrust of my comment. The anecdote about parquet is orthogonal to gz support.

pedantic sidebar: hdfs isn't a file format, it's a distributed file system layered over a traditional on-disk filesystem. For example you might have: json logs, in a gz-formatted file, tracked in the hdfs filesystem, stored on disk in an ext4-formatted filesystem.


> SparkSQL doesn’t support gz

Yes it does. Source - use Spark SQL routinely. You're right that multiple small Gzipped files are not an ideal input source as it'll create a bunch of small tasks, but Spark definitely does support GZ.


You don't, you process the logs right away with ask as in piping to awk store the statistics in a file.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: