The Architecture of a Large-Scale Web Search Engine, Circa 2019

wpietri · on Dec 14, 2019

I hadn't heard of it, but apparently Cliqz is a search engine [1] and browser [2] built by a German media company [3].

[1] https://0x65.dev/blog/2019-12-01/the-world-needs-cliqz-the-w...

[3] https://en.wikipedia.org/wiki/Hubert_Burda_Media

prox · on Dec 14, 2019

How does a new engine find webpages at start? Does it work from the Dns system and indexes every domain name? At a certain point it will follow links I presume, but how does it start?

ssubu · on Dec 14, 2019

[Disclaimer: work at Cliqz] We do not crawl the web in the traditional sense, our search was bootstrapped on query logs. It is the very reason we could succeed in building a search engine with minimal resources, in comparison to our competitors.We have written about this in a lot more detail here :

How we collect data : https://www.0x65.dev/blog/2019-12-03/human-web-collecting-da...

How we build the search using this data: https://www.0x65.dev/blog/2019-12-06/building-a-search-engin...

Feel free to peruse these posts and ask questions!

petra · on Dec 14, 2019

What about tools for power searchers ? Google have abandoned us.

Are you planning to create strong tools in that area ?

For example: custom search engines, the NEAR operator, limit search to sites that don't update that often or aren't linked to very strong sites(against SEO), etc

ThePhysicist · on Dec 14, 2019

Does really all of your data come from the human web project or do you also buy clickstream data from data brokers?

ssubu · on Dec 14, 2019

We speak about this is much more detail in this post (https://0x65.dev/blog/2019-12-05/a-new-search-engine.html), but in short, we prototyped our search initially with data we purchased from data-brokers. Once the concept was proven and HumanWeb was deployed (2015/2016), we rely only on our data.

leeoniya · on Dec 14, 2019

last time Cliqz came up on here was not in the best of contexts...

https://old.reddit.com/r/firefox/comments/74yo19/cliqz_and_m...

pythux · on Dec 14, 2019

The "last time Cliqz came up" can be found here: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu.... We have been posting multiple articles on our tech blog, explaining what we do and how we do it in great details. Your link points to an old thread of more than two years ago.

There were more recent discussions about Cliqz no latter than this month, in particular here: https://news.ycombinator.com/item?id=21676252

[disclaimer: I work at Cliqz]

leeoniya · on Dec 14, 2019

yeah sorry, should have said "last i remember".

> We have been posting multiple articles on our tech blog, explaining what we do and how we do it in great details.

it's possible to have both great tech and loose morals - the two are not mutually exclusive, and one does not absolve the other (e.g. facebook's social experiments)

has there been a followup to any of the points brought up in the reddit thread?

solso · on Dec 14, 2019

[Disclaimer: I do work at Cliqz]

There is plenty of documentation on data collected (see first posts regarding Human Web on the tech blog), how anonymization works, why record-linkability on data collected is prevented (and forbidden), etc. Furthermore, source code can be inspected, as well as traffic in the case documentation is not enough. I believe that is a better proxy to assess "morality" than random accusations on reddit or opinions formed solely on a half-baked press releases.

Do we need to refute all miss-conceptions and FUD that might arise due to the fact that 1) we collect data to build our services (search) and 2) we are funded by a media company (VCs seem to be more pure for an unknown reason).

The answer is no. Cannot recall who said that it takes much more effort to refute BS than to generate it. (That does not go for your comment in particular, that's why we replied, but for many of the comments and some of content of the subredit that you mention.)

neiman · on Dec 14, 2019

> There is plenty of documentation on data collected (see first posts regarding Human Web on the tech blog), how anonymization works, why record-linkability on data collected is prevented (and forbidden), etc. Furthermore, source code can be inspected, as well as traffic in the case documentation is not enough.

Question is, is it opt-in data collection or do you make the choice for me? If it's opt-in, great. Otherwise, I don't want to read your "plenty of documentation" and so on and so forth.

rewq4321 · on Dec 15, 2019

You can apparently opt-out. On mobile but I think it was in one of their blog posts.

neiman · on Dec 15, 2019

I really rather opt-in:-)

ksec · on Dec 14, 2019

If I remember correctly, Yahoo Open Source their current and next generation Search Engine Vespa [1], why wasn't that used, and instead starting from scratch?

[1] https://vespa.ai

netankit · on Dec 14, 2019

Vespa is a very interesting project. But, it came quite late for us (Sept. 2017 [1]).

Work on Cliqz Search started way earlier ~2013. Our work on Kubernetes and modernizing our architecture was also started around year 2016.

[1] https://www.verizonmedia.com/press/open-sourcing-vespa-yahoo...

lowdose · on Dec 15, 2019

Did you guys choose Go over Java?

markpapadakis · on Dec 15, 2019

I like how the blog post title is, likely, based on Google's Page and Brin seminal paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine". :)

This blog post, and other in the series, mention RocksDB is used for the index, but it's not explicitly described how and to what end. I 'd love to know the details.

fessguid · on Dec 17, 2019

Thank you for great article. I've been writing my own search engine during last 3 years and it's funny how similar my setup is to yours with K8S/Kafka/Streams/Go/RocksDB. Actually about RocksDB - are you using it from Go(via gorocksdb?). Now and then I have hard time optimising RocksDB and still have very loose understanding how much RAM it will consume

streetcat1 · on Dec 14, 2019

Just be careful. Kubeflow is heading toward GCP only features (for example, they are dropping cert-manager), while you are betting on AWS.

yowlingcat · on Dec 14, 2019

Would it be fair to say you're saying that Kubeflow may become eventually incompatible with EKS?

streetcat1 · on Dec 14, 2019

Yes. Or more likely there will be a version of kubeflow for EKS. Alas amazon is pushing sage maker.

Note that kubeflow R&D is all done by google cloud.

bkyan · on Dec 14, 2019

I wonder if they have any plans for an API...?