Ubiq: A Scalable and Fault-Tolerant Log Processing Infrastructure (2016)

throwawaymath · on June 14, 2018

Let me take this opportunity to say I'm appalled all research at Google has been rolled into "AI." Ubiq has nothing to do with machine learning, but you can't access it without wading through a swath of Google-branded AI marketing material. The domain research.google.com now redirects to ai.google. If you want to search for any research publications by Google employees, you will be searching on a domain that first and foremost highlights the AI research teams. In a particularly egregious example, all quantum computing research has been filed under "Quantum AI." The "Recent Publications" section exclusively highlights machine learning research despite there being more recent research by other teams. I can go on.

In my opinion, this move threatens the scientific integrity of Google's research. It's clear that researchers in distributed systems, fundamental theory, security and privacy, networking, language development, etc. are second class citizens. I wouldn't be surprised to learn that there's organizational pressure to inject machine learning into as many publications as possible, even if it dilutes the overall diversity of research.

kozikow · on June 15, 2018

It's not just research. Android improves battery by 30%, no one bats an eye. But 1/3 of that improvement is thanks to machine learning - off to the Google IO we go!

30% and 1/3 not the exact real numbers, just showing a concept.

carapace · on June 15, 2018

> It also guarantees exactly-once semantics for application pipelines to process logs in the form of event bundles.

Well, it's got my attention.

EdwardDiego · on June 15, 2018

FWIW Kafka 0.11+ supports exactly-once also.

bane · on June 15, 2018

huh...I interviewed with Google about 4-5 years ago and one of the interviewers spent a bunch of time asking me questions about something just like this, but from a "theoretical" perspective.

Pretty interesting to see this actually happen and see where Google ended up with it.

equalunique · on June 14, 2018

I wonder if the name is a nod to Ubik by Philip K. Dick

bripeace · on June 14, 2018

That too is what I thought of. Though in the book the product is of dubious quality and constantly changing what it promises. So it's hard to be believe someone thought it a good idea to associate their product

atombender · on June 15, 2018

I think you may have misinterpreted what Ubik is in the book. It's not a product, and it seems to be the manifestation of some kind of creative force that negates entropy. The commercials quoted in each chapter are not real commercials in the book.

ASalazarMX · on June 14, 2018

I hope it comes from Ubiquitous, because I could never quite understand Ubik :)

bzillins · on June 14, 2018

Published in 2016

sctb · on June 14, 2018

Thanks! Updated.

nwmcsween · on June 15, 2018

The most scalable, fastest reaction time and simplest log processing tool is awk.

Godel_unicode · on June 15, 2018

Ok, I'll bite. Most scalable, you say. I have 250TB of Django logs, how do you recommend I use awk to process them to determine the 99th percentile response time that's faster than using SparkSQL?

firebacon · on June 15, 2018

Off-topic, but do you actually run ad-hoc SparkSQL queries on the whole dataset sometimes? Are the logs actually stored as text files on disk? How long does such a query take and/or how many racks of machines do you need for that? Should be more than 1 million drive-seconds just to read the data from disk, right?

heavenlyblue · on June 15, 2018

SparkSQL doesn’t support gz. Are your logs splittable on a file-by-file basis or are they in gz format?

Where are your logs stored? Is that a distributed storage? Will SparkSQL not eat all of the bandwidth of it’s ethernet interfaces?

Yeah; sure.

Godel_unicode · on June 15, 2018

It's not so much that SparkSQL doesn't support gz as that gz is slow because you can't parallelize the reads. Regardless, parquet format in hdfs so yarn can allocate containers local to the chunk to be processed. Scales nicely.

heavenlyblue · on June 16, 2018

Yeah, but you are making an assumption I’ve got these 350Gb of logs in HDFS format. Which takes time to set up.

Godel_unicode · on June 18, 2018

Not really, you said SparkSQL doesn't support gz, which is incorrect and the thrust of my comment. The anecdote about parquet is orthogonal to gz support.

pedantic sidebar: hdfs isn't a file format, it's a distributed file system layered over a traditional on-disk filesystem. For example you might have: json logs, in a gz-formatted file, tracked in the hdfs filesystem, stored on disk in an ext4-formatted filesystem.

EdwardDiego · on June 16, 2018

> SparkSQL doesn’t support gz

Yes it does. Source - use Spark SQL routinely. You're right that multiple small Gzipped files are not an ideal input source as it'll create a bunch of small tasks, but Spark definitely does support GZ.

nwmcsween · on June 15, 2018

You don't, you process the logs right away with ask as in piping to awk store the statistics in a file.