Hacker News new | past | comments | ask | show | jobs | submit login
To Break Google’s Monopoly on Search, Make Its Index Public (bloomberg.com)
859 points by JumpCrisscross on July 15, 2019 | hide | past | favorite | 597 comments



Ex-Google-Search engineer here, having also done some projects since leaving that involve data-mining publicly-available web documents.

This proposal won't do very much. Indexing is the (relatively) easy part of building a search engine. CommonCrawl already indexes the top 3B+ pages on the web and makes it freely available on AWS. It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job.

(For comparison, when I was at Google nearly all research & new features were done on the top 4B pages, and the remaining 150B+ pages were only consulted if no results in the top 4B turned up. Difficulty of running a MapReduce over that corpus was actually a little harder than running a Hadoop job over CommonCrawl, because there's less documentation available.)

The comments here that PageRank is Google's secret sauce also aren't really true - Google hasn't used PageRank since 2006. The ones about the search & clickthrough data being important are closer, but I suspect that if you made those public you still wouldn't have an effective Google competitor.

The real reason Google's still on top is that consumer habits are hard to change, and once people have 20 years of practice solving a problem one way, most of them are not going to switch unless the alternative isn't just better, it's way, way better. Same reason I still buy Quilted Northern toilet paper despite knowing that it supports the Koch brothers and their abhorrent political views, or drink Coca-Cola despite knowing how unhealthy it is.

If you really want to open the search-engine space to competition, you'd have to break Google up and then forbid any of the baby-Googles from using the Google brand or google.com domain name. (Needless to say, you'd also need to get rid of Chrome & Toolbar integration.) Same with all the other monopolies that plague the American business landscape. Once you get to a certain age, the majority of the business value is in the brand, and so the only way to keep the monopoly from dominating its industry again is to take away the brand and distribute the productive capacity to successor companies on relatively even footing.


I think it is possible to make way, way better search engine because Google Search is no longer as good as it used to, at least for me.

I can no longer find anything remotely good quality, I discover new and quality stuff from social media like Twitter and HN.

The search results seem to be too general and too mainstream. Nothing new to discover, just a shortcut to the few websites like Reddit, StackOverflow for more techie thing and Wikipedia and the few mainstream news websites for the rest.

I usually end up to search HN, Reddit or StackOverflow directly as the resulting quality is better as I can get easily specific. Getting specific is harder on Google because it just omits or misinterprets my search query keywords quite often.


The reason for that is because Google's building for a mainstream audience, because the mainstream (by definition) is much bigger than any niche. They increase aggregate happiness (though not your specific happiness) a lot more by doing so.

It's probably possible to build a search engine for a specific vertical that's better than Google. However, you face a few really big problems that make this not worthwhile:

1) Speaking from experience, it's very difficult to define what "better" means when you don't have exemplars of what queries are likely and what the results should be. The reason search engines are a product is that they let us find things we didn't know existed before; if we don't know they exist, how can we tweak the search engine to return them?

2) People go to a search engine because it has the answers for their question, no matter what their question is. If you had a specific search engine for games, and another for celebrities, and another for flights, and another for hotels, and another for books, and another for power tools, and another for current events, and another for technical documentation, and another for punditry, and another for history, and another to settle arguments on the Internet, then pretty soon you'd need a search engine to find the appropriate search engine. We call this "Google", and as a consumer, it's really convenient if they just give us the answer directly rather than directing us to another search engine where we need to refine our query again.

3) Google makes basically 80% of their revenue from searches for commercial products or services (insurance, lawyers, therapists, SaaS, flowers, etc.) The remainder is split between AdSense, Cloud, Android, Google Play, GFiber, YouTube, DoubleClick, etc. (may be a bit higher now). Many queries don't even run ads at all - when was the last time you saw an ad on a technical programming query, or a navigational query like [facebook login]? All of these are cross-subsidized by the commercial queries, because there's a benefit to Google from it being the one place you go to look for answers. If you build a niche site just to give good answers to programming queries or celebrity searches or current events, there's no business model there.


> It's probably possible to build a search engine for a specific vertical that's better than Google.

Funny, I don't disagree with this, but my perception has been that Google seems to detect when I've switched roles from one type of programmer to another. I don't know if that's organic from the topics I'm looking up or not, but if I'm looking up a generic string search, it seems to return whatever language I've been searching for recently. (very recently in fact)

My point is, it seems like the search engine intuitively understands my "vertical" already. Maybe it's just because developer searches are probably pretty optimized.


I think its totally possible, two examples already:

Google Ads (used to?) lets you target by "bahaviour" vs "in-market". They can tell the difference between someone who is passionate about beds, maybe involved in the bed business (behavior) and the people who are making the once-in-a-decade purchase of a bed (in-market).

Google can tell devices apart on the same google account and keep together search threads. I might be programming on my desktop making engineering searches but at the same time I'm googling memes on my phone; both logged into the same account.


> Speaking from experience, it's very difficult to define what "better" means when you don't have exemplars of what queries are likely and what the results should be.

Better is a search engine that takes your queries more literal. This is what everybody means when they say Google used to be better. The query keywords and no second guessing.

When you insist on Google using verbatim mode or something, you often don't get results. Which is bullshit because I remember 10 years ago, queries like these had me plowing through the results, so much that you actually had to refine the query -- you can't do that in Google any more, at least it's not refining, it's more like re-wording and re-rolling the dice. But it all feels very random and you don't get a feel for what's out there.

I mean sure there is a place for a search engine like this, if it works well. And in its own way, Google works well.

I sometimes do want my query to be loosely interpreted like I'm an idiot, and I head straight for the Google. Ever since I saw the "that guy wot gone painted them melty clocks"-meme, for certain types of queries I have indeed found that if I formulate my question like I got brain damage, I get superior results. Because that is the kind of audience Google wants you to be.

But sometimes you don't feel like the lowest common denominator and you don't want to be treated as such. And there should be a place for that, too.


Very interesting perspective. I completely understand your point. It used to be a tool, not it is more like a system with a mind of its own. I might need both.


Why do you say there is no business model in a search niche? StackOverflow and pleny of listing sites (Tripadvisor, Yelp, Zillow, Capterra to name a few) have been successfully built in this exact premise and the user experience of searching for restaurants, real state or software on these sites is usually much better than searching directly on Google due to the availability of custom filters and the amount of domain-specific metadata that the global search engines cannot read. While it's true that most of these sites heavily rely on SEO to drive inbound traffic from the big G, there is no doubt that they are perfectly viable businesses.


StackOverflow and those other sites aren't search engines. They may have search engines in them but not many people use them (the only time I reach StackOverflow, booking.com etc is via search engine referral). They're user content hosting and curation sites.


Technically you are correct, in the sense that they do not crawl the web like Google or Bing do. But from a user perspective, they do provide a very useful service of aggregation, discovery and comparison of structured data that is way more effective than using Google search queries, if you know the type of information you are looking for.


It's the corpus that matters, mostly. The StackExchange sites are Q&A formatted and with an SKG graph (such as in solr), you can do topic extraction on questions OR answers which then leads to being able to match other answers (with links) to other questions, among other things. With related topics, many other things come to life.


Sure, they have a business reason to do exactly what they do but I think as people grow up they specialize and the general stuff that fits everybody becomes useless. Google tries to personalize search results but that so far yielded echo chambers, not personalized discoveries.

I can't get better products by searching Google, I can get the best-spammed products or most promoted products only.

The fact that I am getting low-quality service and Google is printing money means that there is a place for good a good service and if that service cannot emerge due to Google's practices, it probably means that the regulators need to take action.

Or maybe the search is dead, long live social media.

The gist is, I am not happy with a service but the company that makes that product makes a lot of maney. Can't tell if I am an anomaly or if other people feel the same way because Google is a monopoly and maybe the regulators should make it possible to compete with Google and see if there's a space for a better service.

Yes yes, I am the product but I am the product only if I am happy with the stuff I'm getting in return.


... in return for you being the product? Haha. I don't think Google sees their end of that "transaction" being an actual transaction. You're an individual, and Google doesn't deal with those.


How would google's practices stop me from creating a search engine?

Keep in mind when Google started, Yahoooooo! Was the big player and Google overtook them by simply being better


Everything turns into an echo chamber eventually.


> navigational query like [facebook login]

Definitely have seen malicious adds for "facebook login", though that was probably 2016 or 2017.


I see comments like this all the time. Am I alone in that search results, for me, have gotten significantly _better_ since a couple years ago?

I can't help but think it's partially due to people using tools _specifically designed_ to make Google's job harder (FF SandBoxes, uBlock, etc) and not understanding the implications of using them... and then blaming Google for returning "bad" results.


I get a lot more seo spam than I used to, but the results are still quite good. I think we should give google some credit for that at least.

Like, a lot more seo spam though.


People have gotten really good (i.e. it's their full-time job) at "gaming" Google. That's not to say Google is fallible - every search engine is game-able depending on its algorithm - it's just that these people are _very_ clever.


They don't even need to clever as much as persistent, because of the selection effects.


> specifically designed to make Google's job harder "Better search" doesn't necessarily mean "more personalized search."


Unless you're a very average person, I'd argue it does.


Google have metrics on how much better it makes search - at least when I was there it makes quality a lot better, but not, say, double the quality. I think in the early days of the company they thought personalisation would be a much bigger win than it was - it was big enough that it didn't make sense to turn it off or anything like that, and you can see it in action when people say their results are customised to the programming language they are most recently using. But most of the time it's not doing all that much - and the biggest component of it was basic stuff like location and language.


> Getting specific is harder on Google because it just omits or misinterprets my search query keywords quite often.

I have this problem too. Google often thinks that I made a typo and presents me results for things I didn't searched for or care about and I have no way to force it to search for things I really want.


This is exactly when I switch to brain damage mode querying. You like fixing typos Google? Have some typos. You like figuring out what I really mean Google? Here, I'll formulate my query like a deranged toddler on PCP, best of luck!

Maybe it just feels more successful because it lowers my expectations. But at least you get to mash the keyboard like a maniac, do no corrections, press return and watch it just work.

It's kind of like watching Google do a "customer is always right" squirm.


If there were viable alternatives, people would shift over time.

If I type in “<name> Pentagon” on Google, the first link is LinkedIn. DuckDuckGo doesn’t even list it at all. There’s countless examples where DuckDuckGo just can’t find basic information. DDG is just unreliable beyond it’s silly name.


I'm always confused by this. I have ddg as the default on my home computer and Google is the default on my work. So I'm constantly using both. There aren't really any apparent differences to me in results. I'm not sure what everyone else is searching, but I search everything from how to spell a word that I should definitely know all the way to niche topics in physics.

Maybe it's because I don't have tracking enabled in Google (I'm not logged into my account when at work) and opt out of tracking where I can. Maybe this is the difference between the lack of difference I see and the huge difference so many others see. But I still don't see it as an issue because I generally find what I'm looking for with one search. Might be the third item, but that's not an issue to me.

I hear this so often that I assume something has to be different. I'm curious if others have ideas as to what it might be, or if I correctly identified them.


I use DDG as my default everywhere, and when I don't find something, I'll !g it as a bit of a last resort.

I'd estimate I'm doing that maybe 5% of the time. It seems to be even odds that I find a satisfying match, though, obviously those are all the hard queries.

The hardest queries are trying to dig up details about stories in the news.


I try to use and like DDG, but the results just aren't as good. For example, it seems to be completely unaware of Docker Hub. Like, pages from that entire subdomain never show up. I can search "Docker hub" and it doesn't even show up.


For that specifically, use !dhub or !dockerhub to search the site directly. Really, the magic of DDG is bang queries.

(Search for bang queries with, not surprisingly, "!bang".)


usually I just do !g and that solves the problem ;-)

But also, thank you. I didn't realise there were so many bangs.


I agree, unfortunately the search is really really sub-par and like others said, frequently doesn’t find basic things no matter how specific the keywords I use are.

I feel it might have even been better at one stage ?


Unless you're searching in Russian, DDG is mostly a skin for Bing search results anyways. The major players in the search engine space are Google, Bing/MSN, Yandex, and Baidu - with the latter two being mostly language-specific.


I find DDG has pretty acceptable or even good results most of the time.

The real power is in the "bangs", though; you can use the `!` to immediately jump to the first search result without seeing a search page, or use `!g` to switch to Google for this particular query, among others. It enables a sort of power-user usage that one wouldn't get with Google.


I don’t really get the logic, just use a good search engine in the first place ?


I'm saying that DDG can be "good enough", and that not having to click around on a results page can save you time if you know what you're doing.

I understand that for some people that's not enough of a time savings to make a difference, but I know DDG well enough to be able to `!` things and almost always immediately get to a successful result. I treat it as an extension of my brain at this point.


The logic is when you made DDG your default search for the address bar. Then it becomes the zero-stop jump off point for all the other search engines they have !bang syntax for (which are thousands, I think).

I used to configure those as search keywords in Firefox (and before, Opera), which do roughly the same without the exclamation point. But on a new browser, even just configuring your favourite top 5 searches is a lot of hassle compared to just setting DDG as the default search and using their bangs.


It's for when the good search engine is the site's own page.

If I'm working on python and numpy and I want to look up `argsort`, I know I want to search the numpy page, so !numpy argsort takes me right there.

Any kind of web dev is !mdn whatever and I don't have to scroll through a dozen BS tutorials, I just get the specs.


The !bang feature I use the most is !w for wikipedia, however I don't use wikipedia enough to justify making it my default search engine on the nav bar.


Your browser can assign keywords to custom search engines so you could just type "wiki blah" to see Wikipedia or "jira 123" to load a specific ticket.


What does a viable alternative look like?

I've been using Bing for the past few months; it's not great or terrible but is it "viable" enough for people to shift to over time? Or is it not viable because it's backed by a major corporation?

I'm sure there are search quirks with each engine but I've seen issues with Google too and yet it's the "devil we know" ... so people unconsciously work around them.


I've used Bing for years now. The only time I go back to Google is if I'm searching for something super specific (normally programming related). Bing takes care of most of my search needs.


I wonder if this is due to google possibly ignoring the robot.txt and Bing (which powers DDG) accepting Linkedin's request? https://www.linkedin.com/robots.txt


I've been using DDG almost exclusively & find it's results to be better than Google's with the exception of local businesses & maps. Google still has an advantage there.


Neither DDG or Google return any LinkedIn results for me unless I also add LinkedIn to the search, in which case I get the same results for both search engines.

Google knows what you want before you even ask. You might find that convenient, I find it unsettling.

I guess it’s not as bad as Facebook; at least Google doesn’t spoon feed you.


This ^ times a 1000.

Google simply has the best search product. They invest in it like crazy.

I’ve tried bing multiple times. It’s slow, it spams msn ads in your face on the homepage. Microsoft just doesn’t get the value of a clean UX.

DuckDuckGo results are pretty irrelevant the last time I tried them. There is nothing that comes close to their usability. To make the switchover, it has to be much much better than Google. Chances are that if something is, Google will buy them.


One thing to keep in mind when comparing DuckDuckGo to Google is that people do not use Google with an alternative backup in mind. When you DDG something and it fails, you can always switch to google.

But what about when Google fails? Unlike DDG, there is no culture of switching between search engines when googling. Typically, you'll just rewrite the query for google. And as rewriting the query is an entrenched part of googling, you are less likely to notice this as a failure. It is this training that's the core advantage nostrademons points out.


This right here is why I don't understand people who complain about DDG's search results. If you simply make the commitment to not use Google, for whatever reason that may be, then using DDG becomes exactly the same process of rewriting search queries until you get the thing you're looking for.

I've been using DDG exclusively since I was a contractor at Google years ago and have never had a problem finding things with it...


I don't necessarily agree. The hard part of search is building the index and differentiating _real_ promotion from the _fake_. There's a lot of SEO manipulation that Google does a good job avoiding.


Webspam is a really big problem, yes. It's very unlikely that you'd be able to catch up or keep up in that regard without Google's resources.

Building the index itself is relatively easy. There are some subtleties that most people don't think about (eg. dupe detection and redirects are surprisingly complicated, and CJK segmentation is a pre-req for tokenizing), but things like tokenizing, building posting lists, and finding backlinks are trivial - a competent programmer could get basic English-only implementations of all three running in a day.


I am not even that good of a programmer and I also agree with you that index relatively trivial. Other major issues, besides fighting spam:

- Hardware infrastructure and data center presence for extremely fast search from anywhere in the world. - Near real-time search suggestion. - personalized search results based on past search + geolocation. - Search to get instant results without having to go to a website.

Just to name a few. Google Search is the gold standard of a search engine, not because its Google or because they have been around for a long time and the brand name sticks (I am sure it helps too), but for the simple fact is no search engine is even remotely close to being as good as google and I have tried them all more of the less and given them shot. They are just not good at all.

I also don't understand the hate towards google being in charge of so many products so many people use, ie, Mail, Maps, Chrome, Android, Docs (to name a few). It's simply because they are damn good at it. If its a crime to make a product so good that people continue to use it, then I don't know what else people are supposed to do. As if we are asking google to make shit products, I just don't understand the reasoning.


It has nothing to do with the number of products, it’s what they do with their influence over the market. See AMP and incompatibilities between Gmail & IMAP, for example.


You concentrating on the literal interpretation of the phrase “give access to the index”. This is non-technical article which didn’t go into details, just read it as “give access to index & ranking”.


> Google simply has the best search product.

The best available doesn't necessarily mean the best possible. And Google is far from it, and it's getting worse, not better.


I've definitely noticed a decline in quality from Google results over the past few years in particular. I don't know if that's because SEO has gotten control of the results of if Google's algo is shoving lower quality up higher for revenue but it's become difficult.

Using a bit of Google-fu I'm usually able to find what I need quickly but it's still more of a hassle than it used to be.


There's exponentially more background noise than there used to be

It's easier to return the most relevant 10 results when there's only 10 thousand options than when there's 10 trillion options with 10 thousand new ones created every day.


I work at Google but not on Search.

My guess is that it's because Google Search now also has to cater to queries from Assistant. Being required to handle web, mobile, and assistant probably necessitated tradeoffs in quality of one over another.

More generally I feel like as the company gets bigger it just gets much harder to handle all the complexity and keep things focused.


I don't know why you're getting downvoted, because the quality has 100% tanked over the last few years. I agree that there may be some selection bias between us, but it's at least got some of my normie non-technical friends commenting about it, so it's not completely without merit. I have a couple of theories, one of them is also a warning.

First, I think search results at Google have gotten worse because people are not actually good at finding the best example of what they're looking for. People go with whatever query result exceeds some minimum threshold. This means when Google looks at what people "land on" (e.g. something like the last link of 5 they clicked from the search page, and then which they spend the most time on according to that page's Google Analytics or whatever), they aren't optimizing for what's best, they're optimizing for what is the minimum acceptable result. And so what's happening is years and years of cumulative "Well, I suppose that's good enough" culminating in a perceptible drop in search result quality overall.

Second, Google has clearly been giving greater weight to results that are more recent. You'd think this would improve the quality of the results which "survive the test of time" but again, Google isn't optimizing for "best" results, they're optimizing for "the result which sucks the least among the top 3-5 actual non-ad results people might manage to look at before they are satisfied". So this has the effect of crowding out older results which are actually better, but which don't get shown as much because newer results have temporal weight.

My warning is this, too, which you've surely noticed: Google search has created a "consciousness" of the internet, and in the 90s it used to be that digitizing something was kind of like "it'll be here forever" and for some reason people still today think putting something online gives it some kind of temporal longevity, which it absolutely does not have. I did a big research project at the end of the last decade, and I was looking for links specifically from the turn of the century. And even in 2009, they were incredibly hard to find, and suffered immensely from bitrot, with links not working, and leaning heavily on archive.org. Google has been and is amplifying this tremendously, by twiddling the knob to give more recent results a positive weight in search. Google makes a shitload of money from mass media content companies (e.g. Buzzfeed) and whatever other sources meet the minimum-acceptable-threshold for some query, versus linking to some old university or personal blog site which has no ads whatsoever. So the span of accessible knowledge has greatly shrunk over the last few years. Not only has the playing field of mass media and social media companies shrunk, but the older stuff isn't even accessible anymore. So we're being forced once more into a "television" kind of attention span, by Google, because of ads.


I find the single hardest thing to search for these days is anything more than a few months old on YouTube... They hate older videos, it feels like. Beyond that, I keep seeing suggestions on new content from years ago... it's just weird.

I know it's not google proper, but I'd guess a significant number of their searches are specific to youtube.


I believe they try to put newer content first in order to make a more fair distribution of views. If you order results by popularity on yt, you will see that it uses just an "order by view count desc" (no relationship with like/dislike ratio), which is bad because it keeps popular some not so good quality videos published on first yt years.


Worse still, imho is that it may not be a popular video I'm looking for. I really wish they'd factor in a "I have viewed" for results.


I disagree. It works great for me. Maybe once every few days I will use !g when I can't find something, but I rarely end up finding it on Google either.

I read somewhere that someone used a skin to make ddg look identical to Google. After doing that, they never even thought about using Google again.


Microsoft thinks what they have is Clean UX.....


Microsoft just needs to get their head out of their ass! With the amount of money they have spent- and what they have to show for it- they should should just can the entire Bing team (or whatever they call their search engine team today). Not only have they not sucked- if they just folded - they'd let the monopoly argument against Google ride somewhat.


Sure, it costs $50 to grep it, but how much does it cost to host an in-memory index with all the data?

This is not a proposal to just share the crawl data, but the actual searchable index, presumably at arms length cost both internally & externally.

The same ideas could be extended to the Knowledge Graph, etc.

IMO the goal here should not be to kill Google, but to keep Google on their toes by removing barriers to competition.


The data was about 55TB of compressed HTML last I looked, so that's about 70 r5a.24xlarge instances, each going for $5.424/hour, so about $350/hour or $250K/month. That's not cheap, and definitely not something you'd put on your personal credit card, but it's well within the range of a seed-funded startup. Sizes may vary a bit depending upon the exact index format, but that should be a rough ballpark. With batch jobs being so cheap, you could experiment a bit with your own finances and then seek funding once you can demonstrate a few queries where your results are better than Google. If you actually have a credible threat to Google, you'll have investors breathing down your neck, because it's a $130B market.

API access to either the unranked or ranked index in memory wouldn't do anything useful, BTW. To have a viable startup you need something a lot better than Google, which means that you need algorithms that do something fundamentally different from Google, which means you need to be able to touch memory yourself and not go through an API for every document you might need to examine. Remember, search touches (nearly) every indexed document on every query - if you throw in 200ms request latency for 4B documents your request will take roughly 25 years to complete.

Knowledge Graph is already public - it was an open dataset before it was bought by Google, and a snapshot of its state at the point Google closed it to further additions is still hosted by Google:

https://developers.google.com/freebase/

(It's only 22G gzipped, too - you can download that onto a personal laptop.)


"Remember, search touches (nearly) every indexed document on every query" - wait, why does that happen?

Doesn't it only touch ones with at least one of the search terms in, or stemmed/varied words relating to some of the terms? And does that via an index?


I struggled with how to word that in a way that's both true, understandable, and doesn't give away any proprietary information. Added "indexed" to clarify but I didn't fix up the numbers, so they're likely an overestimate.

Basically, yes, it uses an index and touches only documents that appear in one of the relevant posting lists. However, after stemming, spell-correcting, synonyms, and a number of other expansions I'm not at liberty to discuss, there can be a lot of query terms that it needs to look through, covering a significant portion of the index. Each one of these needs to be scored (well, sorta - there are various tricks you can use to avoid scoring some docs, which again I'm not at liberty to discuss), and it's usually beneficial to merge the scores only after they have been computed for all query terms, because you have more information about context available then.

There's a reason Google uses an in-memory index: it gives you a lot more flexibility about what information you can use to score documents at query time, which in turn lets you use more of the query as context. With an on-disk index you basically have to precompute scores for each term and can only merge them with simple arithmetic formulas.


> Basically, yes, it uses an index and touches only documents that appear in one of the relevant posting lists. However, after stemming, spell-correcting, synonyms, and a number of other expansions I'm not at liberty to discuss, there can be a lot of query terms that it needs to look through, covering a significant portion of the index.

But, reading through the other comments, leaving out this part would make it better than Google.

Maybe stemming. I remember when Google added stemming (somewhere in the early 2000s). I was conflicted about it because I didnt want a search engine to second-guess my query (can you imagine??), but I also saw the use because I was already in the habit of trying multiple variations.

Auto spelling correct is a no-no. Just say "did you mean X?" and let people click it if they misspelled X. No sense in querying for both the "typo" and "corrected" keywords, because the "typo" would rank much lower, right?

Similar for synonyms. Either it should be an operator like ~, or maybe it should just offer a list (like the "did you mean" question) of synonyms to help the user think/select similar words to help their query.


> Each one of these needs to be scored (well, sorta - there are various tricks you can use to avoid scoring some docs, which again I'm not at liberty to discuss)

You mean like Wand or BMW?


> Knowledge Graph is already public > https://developers.google.com/freebase/

That dump is outdated, not supported, and very incomplete comparing to what google has now.


Perhaps move the google index and the facebook graph to "utility" companies, with google/facebook being frontends/consumers for those companies. Tiered access costs based on query/access volumes could fund the utility, and allow smaller companies to have access with costs based on their scale, if they can monetise as they scale up to cover the costs then they should not be in business.


>The comments here that PageRank is Google's secret sauce also aren't really true - Google hasn't used PageRank since 2006.

That's quite a claim considering they were reporting PageRank in their toolbar until 2016, and toolbar PageRank was visible in Google Directory until 2011.

Are you talking about PageRank from the original patent?


It is a seemingly incorrect claim. Google has semi-recently, publicly said they still use PageRank as one of their signals.

https://searchengineland.com/google-has-confirmed-they-are-r...

https://twitter.com/methode/status/829755916895535104


They replaced it in 2006 with an algorithm that gives approximately-similar results but is significantly faster to compute. The replacement algorithm is the number that's been reported in the toolbar, and what Google claims as PageRank (it even has a similar name, and so Google's claim isn't technically incorrect). Both algorithms are O(N log N) but the replacement has a much smaller constant on the log N factor, because it does away with the need to iterate until the algorithm converges. That's fairly important as the web grew from ~1-10M pages to 150B+.


> That's fairly important as the web grew from ~1-10M pages to 150B+.

This is the weird thing -- it feels smaller. Back in the early 2000s it really felt like I was navigating an ocean of knowledge. But these days it just feels like a couple of lakes.

(also, I'm pretty sure it was billions already quite early on?)


So what's the name of the new algorithm?


>The real reason Google's still on top is that consumer habits are hard to change, and once people have 20 years of practice solving a problem one way, most of them are not going to switch unless the alternative isn't just better, it's way, way better.

I agree about consumer's habits, but not about quality - i mean google of today is worse search engine than google of 5 years ago.

Now google tries to guess, badly, what you meant, instead of giving you what you asked for. The pleasure of dealing with IT systems is that they give you what you ask them for, not what you meant - it introduces extra error, and worse - one that cannot be fixed by user.

I can rephrase my querry, and google will still interpret it - leading to same batch of useless results.


I can also comment here. I built and still run a petabyte-scale web crawler:

https://www.datastreamer.io/

Common Crawl and other sources do in fact have a ton of data that can be used which is very affordable.

The DATA itself stopped being a real competitive advantage probably 2008-2010.

Google's major advantage now is its algorithms and the fact that they've proven it works and is reliable.

Most importantly, its the brand. Google MEANS search in the US and that won't change anytime soon.

PS,... if you need tons of web and social data Datastreamer can hook you up too :)


>"Indexing is the (relatively) easy part of building a search engine. CommonCrawl already indexes the top 3B+ pages on the web and makes it freely available on AWS."

Interesting I would have thought that crawling at this scale and finishing in a reasonable amount of time would still be somewhat challenging. Might you have any suggested reading for how this is done in practice?

>"It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job." Curious what type of Hadoop job you might referring to here. Would this be building smaller more specific indexes or simply sharding a master index?

>"Google hasn't used PageRank since 2006." Wow that's a long time now. What did they replace it with? Might you have any links regarding this?


Crawling is tricky but it's been commoditized. CommonCrawl does it for free for you. If you need pages that aren't in the index then you need to deal with all the crawling issues, but its index is about as big as the one most Google research was done on when I was there.

$50 gets you basically a Hadoop job that can run a regular expression over the plain text in a reasonably-efficient programing language (I tested with both Kotlin and Rust and they were in that ballpark). $800 was for a custom MapReduce I wrote that did something moderately complex - it would look at an arbitrary website, determine if it was a forum page, and then develop a strategy for extracting parsed & dated posts from the page and crawling it in the future.

A straight inverted index (where you tokenize the plaintext and store a posting list of documents for each term) would likely be more towards the $50 end of the spectrum - this is a classic information retrieval exercise that's both pretty easy to program (you can do it in a half day or so) and not very computationally intensive. It's also pretty useless for a real consumer search engine - there's a reason Google replaced all the keyword-based search engines we used in the 80s. There's also no reason you would do it today, when you have open-source products like ElasticSearch that'd do it for you and have a lot more linguistic smarts built in. (Straight ElasticSearch with no ranking tweaks is also nowhere near as good as Google.)


Thanks for the detailed response. I appreciate it. I will look into CommonCrawl. Cheers.


IMHO a simpler and probably the only viable way to force competition is to legally force Google to not respond to any query on certain periodic time periods.

For instance, if you were to forbid Google from operating on every odd-numbered day, then 50% of the search engine market and revenues would immediately be distributed among competitors and furthermore users would be forced to test multiple engines and they could find a better one to use even when Google is allowed to operate.

Obviously this has a short-term economic cost if other search engines aren't good enough as well as imposing an arbitrary restriction on business, so it's debatable whether this would be a reasonable course of action


Banning anything is often not a good policy since it usually creates secondary markets.

Depends on how you count the date, this could create markets where people in different countries will sell Google search results to each other. New VPN providers pop up with the promise of 24h Google coverage. Software startups switch to a system where you bing work 16 hours straight, then get the next 32 hours break and repeat. "Breaking news" has a new Oxford definition, since newspapers change plan to publish news 5 minutes before Google opens for search. Electricity price increases for the first 2 hours of the odd-numbered day to combat the spike in demand. Comcast introduces a new fast lane at only $199 a month that has no slow down access to Google. University groups lobby for a new exemption in the law allowing unrestricted weekdays access. Political parties lobby for also blocking Google on the day of debates, regardless of whether it's an odd-numbered day. It's kinda fun to keep going.


Any search engine that was unavailable for 50% of the time would soon have 0% of the market, not 50%.


This can be solved, in the odd-days example, by making either the second most popular or all other search engines operate only on the even days (as well as making the restriction apply to the most popular engine instead of Google in particular).

This has other drawbacks of course.


Actually the omnibox made it really easy to switch to ddg. With an occasional fallback to google.

I have no problem with advertising etc. but the tracking and selling of data is such an idiotic thing. We as consumers should have a global internet-law, and be reimbursed for data leaks or usage outside the scope of the application.

By no problem with ads I mean the original ads of google. Very clear they were ads and not intermingled with the results. Scrolling down for the results is nuts. I will click ads if they’re relevant, regardless of if they’re on the right or in the results. So please stop supporting this fraud against advertisers.


I think that the "fallback to Google" might actually tend to diminish consumer confidence in DDG. Every time you use it, you basically say to yourself "$newRiskyStrategy fails sometimes, we still need $oldReliableStrategy".

Instead, what might help DDG is a plugin that detects when you go past the first or second page of Google search results, and suggests that you might get better results on DDG. It's a little intrusive, but the mental nudge becomes "$oldReliableStrategy has flaws, try $newRiskyStrategy". You get a positive emotional interaction with DDG rather than "forcing" yourself to use it all of the time and "failing back" to Google.


> when I was at Google nearly all research & new features were done on the top 4B pages, and the remaining 150B+ pages were only consulted if no results in the top 4B turned up

This may help to explain the poor quality of some the results on queries I run on Google lately that return content obviously written for ranking in SEO but that have very little value.

I have 2 questions:

- What make"the top 3B+" the top ones?

- How can I "force"a search on the other 150B+ pages?


I find it odd that you claim to be a former Google search engineer and in the end boil down the success of Google search to brand recognition / loyalty. You kinda glossed over the insane complexity of building and maintaining a high quality search engine, really weird comment to be honest.


> you'd have to break Google up and then forbid any of the baby-Googles from using the Google brand or google.com domain name.

Just let "google" become the generic term for search, as it's already well on its way.


Page Rank is a synonym for link juice. So when you say Google hasn't used page rank since 2006, can you confirm that you are talking about link juice as opposed the the old toolbar representation of page rank? And assuming you do mean link juice, well why do links still work so well for seo?


Okay, this is a relatively serious proposal to require Google to allow API access to its search index, with the premise that it would democratize the search engine ecosystem. There are some issues with the regulations he proposes (you have to allow throttling to prevent DDoS attacks, and you can't let anyone with API access add content to prevent garbage results), but it's roughly feasible.

The main problem is, I think the author is wrong about what Google's "crown jewel" is. Yes, Google has a huge index, but most queries aren't in the long tail. Indexing the top billion pages or so won't take as long as people think.

The things that Google has that are truly unique are 1) a record of searches and user clicks for the past 20 years and 2) 20 years of experience fighting SEO spam. 1 is especially hard to beat, because that's presumably the data Google uses to optimize the parameters of its search algorithm. 2 seems doable, but would take a giant up-front investment for a new search engine to achieve. Bing had the money and persistence to make that investment, but how many others will?


> 1) a record of searches and user clicks for the past 20 years

From what I can tell, Google cares a lot more about recency.

When I switch over to a new framework or language, search results are pretty bad for the first week, horrible actually as Google thinks I am still using /other language/. I have to keep appending the language / framework name to my queries.

After a week or so? The results are pure magic. I can search for something sort of describing what I want and Google returns the correct answer. If I search for 'array length' Google is going to tell me how to find the length of an array in whatever language I am currently immersed in!

As much as I try to use Duck Duck Go, Google is just too magic.

But I don't think it is because they have my complete search history.

Also people forget that the creepy stuff Google does is super useful.

For example, whatever framework I am using, Google will start pushing news updates to my Google Now (or whatever it is called on my phone) about new releases to that framework. I get a constant stream of learning resources, valuable blog posts, and best practices delivered to me every morning!

It really is impressive.


> Also people forget that the creepy stuff Google does is super useful.

For the same reasons you’re exalting them, I have non-technical friends who asked me how Google knows so much about them (and suggestions on how to avoid it) because they found it too creepy.

I don’t think people forget Google’s results are useful; some just think they’re more creepy than valuable. You seem to have picked your side in that (im)balance, and other people prefer the other side.

There’s also the relevant consideration that no matter how useful they may be, they should have no right to impose themselves on you. By this I mean that one should be free to refuse their creepiness, understanding the price is their usefulness. Yet, Google is the subject of privacy violations all the time, and they are caught time and again lying about what they collect on users.


I don’t think people forget Google’s results are useful; some just think they’re more creepy than valuable. You seem to have picked your side in that (im)balance, and other people prefer the other side.

Just as a general observation without taking either side:

People routinely fail to recognize both sides of a particular thing. It's why we have sayings like "You don't know what you've got til it's gone."


I wish interfaces were more straight up about their intentions and made it easier to implement account level partitions. For work I love Google's magic tracking effects, but at 1 am, hell no.


You can have multiple identities in chrome[0], even guest identities.

[0] https://support.google.com/chrome/answer/2364824


Right. I'm just saying it should be clearer. Ex: I want to have a list of accounts, netflix style, that I'm presented with on an empty chrome window. If in fact multiple identities don't merge data implicitly in anyway than this is just a UI issue.

But I have a hard time believing google truly partitions everything in a multi account setup.


It would be immensely useful if Google understood that normal people have multiple facades that they use in different contexts. Probably several professional (which project / component was I working on again), private but family friendly (planning gifts for relatives, etc), and private but clearly out there (stuff you don't want to shock 60 year old parents / young kids / etc with) profiles.

Also, for incognito stuff, it'd be nice to have read-only basing on stock profiles related to various activities or people.


It is actually possible to operate without relying on Google or any other big tech firm. Who is forcing you into these privacy dilemmas? All of their services are a choice you are making. You don't need to accept any of it if you don't want to.


> You don't need to accept any of it if you don't want to.

Tell that to the people who had their privacy violated by Street View[1]. And the people who specifically disabled location services on their Android devices but were still tracked[2]. Or all the people who have no idea what Google Analytics is and never consented to it, but are profiled by it everyday.

> All of their services are a choice you are making.

I do my best to avoid privacy invading companies, and as a technical user I find it tiring and know I deal with consequences (e.g. broken websites). It perplexes me that comments like yours still pop up. We’re not the only segment of the population that exists; non-technical users are the majority, and they have the same right to privacy as we do, with a modicum of transparency. If even technical people are regularly tripped by privacy invasions we didn’t know about, what chances do non-technical users have?

[1]: https://www.nytimes.com/2013/03/13/technology/google-pays-fi...

[2]: https://qz.com/1131515/google-collects-android-users-locatio...


Street view is debatably invasive. I understand this might seem hand wavy to someone really concerned about privacy issues, but

1.generally speaking I would think VERY few people care about an image of their property being on street views. 2. It's not really illegal to take pictures so even from a legal standpoint it seems like a gray area. 3. I understand there can be individual reasons for not wanting this, but it seems to be a very large net positive. And I would apply that statement to most other tracking and data policies they have.

If they are lying about how their services track people, that is definitely grounds for concern. The transparency can definitely be improved, but still these are people with Android phones and people using Google Analytics. No one is forced to use these things they are free to use any other service or create their own.

And my attitude is out of pragmatism and how I think privacy issues should be handled. I don't have any problem with the way Google uses my data so I don't care to fix a non problem. And I don't see it as their responsibility to change a way of business when anyone is free to use any other service or create their own, since I don't find it offensive.


The first sentence of the linked New York Times story:

> Google on Tuesday acknowledged to state officials that it had violated people’s privacy during its Street View mapping project when it casually scooped up passwords, e-mail and other personal information from unsuspecting computer users.

That answers your first three paragraphs. There’s no “if” to their lying and privacy invasions. They’ve been caught and admitted their actions time and again.

> No one is forced to use these things they are free to use any other service or create their own.

It is here I will respectfully give up on continuing the conversation with you. You’re either ignoring my main point or truly don’t care for the majority of users. Most people don’t understand the ramifications of these choices and for good reason; they are hard to understand. By suggesting non-technical users create their own services and devices, I’m now wondering it you’re trolling me.

> And my attitude is out of pragmatism (…) I don't have any problem with the way Google uses my data

Which is valid, but irrelevant. I’ve already mentioned in the top post different people make different choices. I presented another side and used facts to justify it. If you’re going to answer with mere opinion, you’re not adding to the points made by the original poster.


That snippet of the NYT story omits critical context: The data they captured were random wifi packets (probably for use in Skyhook-type location fixes by way of mapping out where APs are). Sounds like they were doing the equivalent of a wardrive and captured more than the AP advertisement message.

This is information that Google doesn't have any need for (noise) and didn't want in the first place.

They also self-reported the failure, where they could have just nuked it and we wouldn't be having this conversation.


What? You seem to be misunderstanding my statements.

My first points were about the streetview product. Scooping up passwords is obviously not the intent of that product, maybe that was an error or they changed the core product at some point? I can't read the paywalled article.

I'm not suggesting non-technical users create products... you're reading so far out of context. Just because user X can't create a new product does not mean that we should place sanctions on company Y. I'm glad you used facts somewhere else because in this post you just illogically connect a bunch of dots.

Yes some of it is my opinion and alot of this is yours. But a fact is still no one is forcing you to use these products, then you went off about stolen passwords and trolling and resigned yourself from the argument. That sounds like a rationality of a completely one-sided biased individual in itself, respectfully.

Yes everyone agrees transparency is good and lying is bad. Google is not Evil Or Benevolent. They're just people...

"And I don’t use them. I hoped that by continuing to mention non-technical users you’d get it, but this was never about me. You keep bringing up that argument, but read what you replied to in the first post — I recounted the experience of non-technical people I know, not my experience. Stop telling me I have a choice; the point is not us, it’s non-technical users who don’t have the knowledge to make informed choices!"

Haha you are so ridiculous. This was your first post:

There’s also the relevant consideration that no matter how useful they may be, they should have no right to impose themselves on you.

Then you say you don't know why I bring up that you don't need to use Googles services... C'mon man get real. That's why the point about using alternatives or creating new ones is very relevant and this entire thread is about sanctions. Don't start a convo you can't participate in and then just claim you won and leave, that's childish behavior.


> Just because user X can't create a new product does not mean that we should place sanctions on company Y. (…) in this post you just illogically connect a bunch of dots.

That is an insane extrapolation, and the reason I don’t want to continue the conversation with you: you’re answering points I’m not making. I haven’t even hinted at sanctions; I have no idea where you’re getting that from.

> But a fact is still no one is forcing you to use these products

And I don’t use them. I hoped that by continuing to mention non-technical users you’d get it, but this was never about me. You keep bringing up that argument, but read what you replied to in the first post — I recounted the experience of non-technical people I know, not my experience. Stop telling me I have a choice; the point is not us, it’s non-technical users who don’t have the knowledge to make informed choices!

> That sounds like a rationality of a completely one-sided biased individual in itself, respectfully.

Believe what you want. I just don’t want to keep wasting my night arguing with someone that started a discussion but refuses to address the points originally made. Why reply, then?

Maybe I’m not explaining myself well enough, or in the correct way for you to understand, or maybe you’re the one not grasping what I mean. It doesn’t really matter where the problem lies, just that it’s clearly not working.

Maybe if we ever meet in person we can resume this conversation, but tonight it’s not being productive, so I genuinely wish you a good week and sign out here.


> It is possible to operate without relying on any big tech firm.

Some writer from Gizmodo tried that last February. Let's just say that you are technically correct that you don't need any of the big tech firms.

https://gizmodo.com/i-cut-the-big-five-tech-giants-from-my-l...


I was curious about this, as I work in multiple languages every day. I almost never use Google though except as last resort if other engines can't find anything. So the result I got for array length was for Javascript. Which is quite high on the hype cycle now, but I only very rarely use it and search anything about it even less frequently.

Sо I wonder how much of the magic you perceive might be just your interests matching the interests of the most other people using Google and thus it's not just Google magically guessing you're into Javascript (for example) now, but Javascript being popular and this being the cause of both Google returning matches for it and you starting to use it? Did you ever do a clean experiment - e.g. try to learn APL or some other relatively obscure language and have Google return all results about APL and none about Javascript?


Going back to OPs point. Google is real good at associating search query to search result. Every time you search and click on something, google learns that association.

So it could very well be that as more users adopt the new language/framework in the first couple of weeks they have taught google those associations.

Google isn’t a search company. They are a distributed machine learning company that make most of their money from learning what people want and showing relevant ads to them.


They have adds to show first, telling what people want comes after that, knowing what people wanted is only to make the second easier and only interesting up to serving the first.

Really good or really bad only exists if there is something else to compare it to.


I always see posts like this here, and then I try it, and I get a page full of "array length" results for Javascript, while everything in the last year that I've searched for has been Java or Kotlin...

Same when I owned a Pixel after hearing about Google Now and their ML magic there. Nothing more magical than an iPhone in terms of suggestions. The camera was amazing, but not all this supposed contextual stuff.


Wild guess: in a surge of privacy consciousness you told Google to stay the heck away from your data. These checkboxes stick forever and couple years down the line some magic feature won't be able to learn from your data. E.g. despite working there, I still haven't figured out how to let Photos recognize people in my pictures, something that definitely is on by default.


Question: have you ever visited this webpage?

https://myactivity.google.com/

For many people it is enough to be totally creeped out about Google.

Also, that Google remembers context can be handy but it is not essential. Without context, I am sure you would be equally capable of finding what you are looking for, although it might take a little more typing since you'll have to supply the context yourself. Imho, convenience is not a good argument for giving away your personal information.


Yeah I must echo your sentiments wrt their Google Now product, it is great. Not only does it provides relevant content but some of it is very new and or obscure which I really appreciate. I have linked people to videos I pulled off my Google Now feed and they are amazed that I know about a video on our very specific shared interest that is less than a couple hours old and has only a few hundred views.


The flip side of this is that it makes it harder for you to stumble upon something related, but new, outside of the filter bubble Google is making for you.

There's no arguing what you're describing is useful, but it's nice to keep in mind that there are downsides even if you ignore the privacy argument (which, IMO, shouldn't be ignored).


Your results may be bad for the first week, but the better results you get later on have everything to do with Google’s long-term user-base.


> Yes, Google has a huge index, but most queries aren't in the long tail.

I'm not quite sure about that. 15% of Google searches per day are unique, as in, Google has never seen them before. [1]. That's quite an insane number.

[1] https://searchengineland.com/google-reaffirms-15-searches-ne...


Sharing for anyone who didn't know there is a very good dataset you can use now. If you don't have a nvme ssd in your computer, I highly recommend getting one for fast i/o.

http://commoncrawl.org/ http://commoncrawl.org/the-data/ http://index.commoncrawl.org/

related.. Mark's blog is amazing and worth more than any data science degree imho.

https://tech.marksblogg.com/petabytes-of-website-data-spark-... https://tech.marksblogg.com


wow, thanks.

[edit] in my experience yacy works really well. You have it crawl the sites you frequently visit and their external links and it quickly accumulates to something more accurate than google.



Wow, 15% unique searches is indeed quite an interesting figure. With that said, what OP said is definitely not disproved. Just because 15% of searches are unique, that doesn't mean the most relevant result is buried in the tail end. I mean I can think of loads of my own searches that are probably unique or rare, but lead to the same popular results because of typos, improper wording etc.

Without some clear numbers on that from a major search engine, I think this might be very difficulty to infer.


Especially with voice searches. People are searching entire sentences rather than specific keywords which are much more likely to be unique.


Do people do this?

Or do you mean queries forwarded by home assistants trying to parse inputs?


> Do people do this?

The calling card of the developer realising that real users never act like you expect :)


Real users will use your product in ways you never imagined.


Heh, yes, they do. Which is a reminder that devs are not "typical" users.

As a developer, I search using keywords; for example, if I was looking for property for sale in Inverness, I might search for "property Inverness", whereas I've seen and heard "typical" users use something like "find me a 2 bedroom house with a garden for sale in the North of Inverness" - much more verbose, and containing stop words and phrases unlikely to help (I think!).


I do the same as you, but was just thinking that if most users search using full sentences then Google will spend most effort optimizing for that, so maybe we're the ones getting the worse results?


No, the optimization they do for the low-quality query is more than balanced out by the higher clarity and relevance of a well-phrased query. There are often extraneous words that aren't simple stop words, and they're not 100% successful at removing these extraneous ones.


I almost always search keywords while my girlfriend uses sentences and we often get quite different results. If I'm having trouble finding a good result there's a pretty good chance she will find something quickly. Surprisingly this holds true even for programming questions on topics that I know well and she's never heard of before.


> As a developer, I search using keywords;

So did I.

Around the time I left Google behind I had started to search like my wife did, using full sentences. it sometimes worked better I think.


With voice I use sentences: it's far more reliable because of the Markov model (or whatever predictive model they are using).


What does it matter whether it came from an assistant or not?

Natural language is likely the preferred search input method for kids under a certain age, who cannot yet type fluently. My kids formulate very long, complex queries verbally. The other day my son asked Alexa why the machine gun is such a deadly weapon. She replied with a snippet from Wikipedia that was surprisingly relevant.


I often do full sentences and then start deleting words from it if it doesn't work.


can confirm. i search full sentences even from the keyboard


I search full sentences (questions) from the keyboard. I figure I'm not the only to have had the question before, so I ask. Also, I find that blog posts, etc. tend to match well for full sentences.


Yes, sorry - that's me. Copy and pasting Sharepoint error messages


Those searches are unlikely to be unique.


Hmmm - Error: System.InvalidOperationException: The workflow with id=15f08b34-33f5-4063-8dea-d4ca6212c0d6 is no longer available.

is not atypical.


Does that actually work? I must be old school, I always delete such IDs before searching, but then again I used Google back when it actually did what you told it instead of misinterpreting everything for you.


It doesn't seem to have any particular effect on the results that come up. I always used to delete them, and still do sometimes but Google seems to pretty much ignore them in practice.


Which is a wonderful behavior except for all the times that the error numbers are not actually GUIDs but rather identify general errors.


If only :(


Could this be explained by supposing that people are just searching for current events, sometimes national, sometimes international, sometimes very local? If so, you really wouldn't need much indexed to handle those queries. I imagine many queries are also just overly verbose and sentence-length, which artificially inflates the number of unique queries which are actually seeking roughly the same pages.


Good point and 15% is indeed much, but the question would be what "unique" means. If it means that the exact same character sequence appeared for the first time, it doesn't mean that the users searches for a term that has never been searched for.

I mean with the newest advantages like machine learning it's more and more possible to _semantically_ link queries. If that's the case, those 15% could become 5% truly unique searches or even less.

"how dumb is trump" and "how dumb is donald trump" are two different searches but they semantically belong together because they mean the same.


How many of those are confirmed to be of human origin?


Probably quite a few. New things happen. Politics, wars, famous folks, movies, music, diseases, scientific studies, products, brands, model numbers for products, fads and slang. I'm guessing there are other things as well.

Some of the new things are probably variation as well - as others have mentioned, sentences and voice commands can give lots of new stuff.


Now I feel bad for putting gibberish like jsjsjdkktkwoapaoalf in my address bar and searching Google to test if my internet is working..


I just type "test", hopefully they do that too and it is ignored.


I do that all the time, I wonder how common that is?


I would think it’s pretty common. For a lot of people google is the internet. Or at least the reference. If google isn't working it’s almost certain it’s your end. I don’t think anyone else has that reputation for availability amongst the general public.


I think they mean that the results are still from the top pages of the internet. They mean long tail of visited pages, not long tail of searches.

A unique search query could still land you on Wikipedia.


> 15% of Google searches per day are unique, as in, Google has never seen them before.

That is impossible, and therefore wrong (I'm wrong, please see below). To know if a search is unique, as in Google has never seen them before, Google must be able to decide if a query it receives was seen before or not. Even if we assume Google needed only one bit for each message it has ever seen, and assuming it only saw 15% of new messages each day since its creation more than 20 years ago, it would need to store more than 2^1471 bits.

What could be true is that each day 15% of all searches are unique on that day.

Edit: I'm wrong. The 15% of completely unique messages per day are in regards to the messages per day, and not in regards to all messages it has ever seen, therefore exponential growth doesn't apply. To see that, assume Google just received one search query each day for 20 years but it was unique random gibberish, then Google could easily save that even though 100% of all messages per day are unique.


This is somewhat a faulty analysis. One could easily use a high accuracy bloom filter to store whether a search has definitely not been seen before, and that would be an estimate on the lower bound of the error margin.


Yup. This was actually an interview question I got from a former Google search engineer.


Where are you getting these numbers? Google says they get ~2 trillion searches per year. 40 trillion searches over 20 years (way too many) would be 2^44 searches. https://searchengineland.com/google-now-handles-2-999-trilli...

(And they don’t even need to store all searches for all time for this, thanks to Bloom filters.)


The whole point was that 2^1471 is wrong.


It is roughly 1.15^(365*20). That it is wrong was clear from its size. I wanted to use it's falseness to show that the assumptions are incorrect. Which they are, just not how I understood initially.


How are you computing that number? It's definitely wrong.

Assume Google receives 1 trillion queries per year, and has been around for 20 years. Using a bloom filter you can achieve a 1% error rate with ~10 bits per item. So a 200 terabyte bloom filter would be more than sufficient to estimate the number of unique queries.


A Bloom filter is just way overkill.

If you have a list of 20 trillion query strings, and each query string is on average < 100 bytes, you're looking at a three line MapReduce and < 1 PiB of disk to create a table which has the frequency of every query ever issued. Add a counter to your final reduce to count how often the # times seen is 1.


uh, is this sarcasm?

A bloom filter is the most appropriate data structure for this use-case. How is it overkill when it uses less space and is faster to query?


Actually the bloom filter was just an approachable example. There are much more clever and space efficient solutions to this problem, such as HyperLogLog [1] (speculating purely based on the numbers in that article, it looks like a few megabytes of space would be far more than sufficient). See the Wikipedia page on the "Count-distinct problem" [2].

1: https://en.wikipedia.org/wiki/HyperLogLog 2: https://en.wikipedia.org/wiki/Count-distinct_problem


My initial approach was also technically wrong; it tells you the fraction of queries which happen once.

To find the fraction of queries each day which are new, you would want to add a second field to your aggregation (or just change the count), the first date the query was seen. After you get the first date each query was seen, sum up the total number of queries first seen on each date, compare it to the traffic for each date.

You could still hand the problem to a new hire (with the appropriate logs access), expect them to code up the MapReduce before lunch (or after if they need to read all the documentation), farm out the job to a few thousand workers, and expect to have the answer when you come back from lunch.


I don't think it's necessarily impossible to calculate. Using probabilistic data structures arranged in a clever way, it's likely possible to calculate with some degree of accuracy.

I haven't thought this through, but take all the queries as they're made and create a bloom filter for every hour of searches. Depending when this process was started, an analytics group could then take a day of unique searches, and run them against this probabilistic history, and get a reasonable estimation with low error. Although the people who work on this sort of thing probably know it far better than I.

The real question though might be assuming the 15% is right, do we care about those 15%, are they typo's that don't merge, are they semantically different, are they bots search for dates or hashes, etc.


I believe that they're unique in a sense that nobody has typed in that exact query previously.

Of course, Google knows better but to treat every search query literally. Slight deviations and synonyms work for the majority of the people, even if us techies highly oppose them and look for alternative solutions (like DDG) that still treat our searches quite literally.


>2) 20 years of experience fighting SEO spam.

Tangential - but does anyone else feel that google results are useless a lot of the time? If you search for something, you will get 100% SEO optimized shitty ad-ridden blog/commercial pages giving surface level info about what you searched about. I find for programming/IT topics its pretty good, but for other topics it is horrible. Unless you are very specific with your searches, "good" resources don't really percolate to the top. There isn't nearly enough filtering of "trash".


Yes, I feel like Google search results have very gradually become more irrelevant and spammy over the past decade or so.

There are 2 issues, I think.

Firstly, the SE-optimised spam, which has become very good as masquerading as genuine content.

Secondly, Google has dumbed search syntax down a bit, and often seems to outright ignore double quoted phrases, presumably thinking it knows better than I what I want.

As a dev, I do accept I may be an outlier though - with the incredible wealth of search history and location data that Google holds, it seems likely things have actually improved for typical users.


is there a way to turn this " ignore thing off? drives me nuts


Seeing as google has my search history for the past 14 years, they should be able to KNOW that I'm a slightly more technical user and can take advantage of power user features instead of treating me like an idiot


Google signed an armistice in the Great Spamsite War some time around '08 or '09, to the effect that spam can have all the search results aside from those pointing at a few top, trusted sites, so long as they provide any content at all. Bad content is fine. Farmed content is fine. Content that was probably machine-generated is fine. Just content. Play the game, make sure your markov chain article generator or mechanical turks post every day, throw some Google ads on your page, and G will happily put your spamsite garbage at result #3.


There’s a reason for this; click through rate on ads is higher on pages that don’t achieve the user goal.

I suspect that the AI models powering the search results develop a sort of symbiotic relationship with the spam - if the user actually finds what they are looking for by clicking through an ad on an otherwise spammy page, everyone “wins”; the user found what they were looking for with minimum effort, google got their ad revenue, and the spammy page got a little cut for generating content that best approximating the local minimum that links the users keywords to actual intent...


“Farmed content is fine”. I thought that was one of the major (intentional) victims of the Panda update. https://moz.com/learn/seo/google-panda


There are a few widespread scaled publishing operations like IAC which seems to be doing well with the split up of About.com & relaunching it as vertically focused branded sites, but the content farm business model died with the Panda update.

Some of the sites that were hit like Suite101.com went offline. eHow is still off well over 90%. ArticlesBase sold on Flippa for like $10k or some such. One of the few wins hiding in all the rubble was HubPages, but even they had to rebrand and split out sites & merged into a company with a market cap of about $26 million ... and the CEO of Hubpages is brilliant.

Even with IAC on some sites they are suggesting ad revenues won't be enough http://www.tearsheet.co/culture-and-talent/investopedia-laun... "As Investopedia charts its course as a media brand, it’s coming up against the roadblock all publishers eventually hit — the reality that display revenue alone won’t be enough. ... Siegel said he expects course revenue to exceed what’s generated from the site’s free content. While he wouldn’t say what the company’s annual revenue was, Siegel said it grew an average of around 30 percent for each of the last three years."

There is also other factors which parallel the panda update that further diminish the quick-n-thin rehash publishing business model - Google's featured snippets & knowledge graph pulling content into the SERPs so there is no outbound click on many searches - programmatic advertising redirecting advertiser ad spend away from content targeting to retargeting & other forms of behavioral targeting (an advertiser can use a URL as a custom audience for AdWords ad targeting even if that site does not carry any Google ads on it) - mobile search results have a smaller screen space where if there is any commercial intent whatsoever the ads push the organic results below the fold


I agree with this. Most searches give me almost a whole page of ads and stuff up top before the things I’m interested in start showing up way down at the bottom of the page, and even then the results are often spam.

I’ve been using DuckDuckGo and have found I have this problem less. I don’t always find what I mean on DDG, as of now I’d say Google is still better if you’re not sure exactly what you’re looking for is called, but if you know the keywords you need DDG is often better.


Someone linked to an interesting site talking about how to make homemade hot sauce here on HNs. I partly read it and thought it was a great clean site and something I wanted to try. Later going back to find it again I literally spent hours searching, even though I'm pretty sure I remembered some of the exact phrases. For some reason recipe related search results are really really terrible on both Google and Bing.


Could you not find it again via the HN site search? https://hn.algolia.com/?query=%22hot%20sauce%22%20recipe&sor...


This is awesome and helped me find it again! Thank you!


Sometimes sites get dropped from the results because they are malware hosts. It’s more likely to happen to small independent sites. They are also more likely to just pack it up and shut down their sites.


Yeah, this is why I still use and like myactivity.google.com, as creepy as it is. It's helped me re-find so many interesting half-remembered sites and videos and songs I'd previously come across.


Why would you rely on google spying instead of your own browser history?


cross platform support, maybe?


100% agree. For technical queries, as long as a StackExchange comes up, Google is still okay.

But for increasingly more basic searches about a product I'm interested in or a medication or anything else non-complicated that would have gotten me a clean list of decent, non-paid results even 5 years, I'm now getting half a page of sponsored BS and then another half a page of 'created content' written by a bot or shyster explicitly for gaming Google's SEO.

Not only has Google lost almost all their good will (i.e. Don't be evil), but their products aren't even that good anymore, at least not so much better than alternatives where the negatives of using Google outweigh the difference in quality.


Yes, at least half the time I search about a particular topic, it seems the first few pages are written by some contractor in the Philippines probably getting paid $2 / hr who just spent the prior 30 minutes researching the topic.


I am not sure that this take is accurate.

I would agree that programming search results tend to be quite good, but I think this is likely in large part because the average person attracted to programming both has a high IQ and has experience building some part of the web stack. Thus the sites that are quite manipulative in nature would have a hard time trying to fake it until they make it in such a vertical where people are hard to monetize and are very good at distinguishing real from fake. And even if a fake site started to rank for a bit it would quickly fall off as discerning users gave it negative engagement signals.

This is also perhaps part of the reason sites like Stack Overflow monetize indirectly with employment related ads targeted to high value candidates versus say a set of contextually targeted ads on a typical forum page or teeth whitening gizmo ads on the Facebook ad feed.

The lack of filtering of "trash" probably comes from a bunch of different areas

- I think there was a quote that people are most alike in their base instincts and most refined in areas where they are unique. some of the most common queries are related to celebrity gossip & such. There are also flaws in human nature where inferior experiences win based on those flaws. For example, try to buy flowers online and see how many layers of junk fees are pushed on top of the advertised upfront low price. shipping, handling, care, weekend delivery, holiday delivery, etc etc etc

- some efforts to filter trash based on folding in end user data may promote low quality stuff that people believe in. a neutral & objective political report is less appealing than one which confirms a person's political biases. and in many areas people are less likely to share or consider paying for something neutral versus something slanted toward their worldview.

- as the barrier to entry on the web has increased some of the companies that grew confident they had a dominant position in a market may have decided to buy out other smaller players in the vertical & then degrade the user experience as real competition faded. there was a Facebook exec email mentioning they were buying Instagram to eliminate a competitor. Facebook's ad load is now much higher than it was when they were smaller. But the same sort of behavior is true in other verticals too. Expedia & Booking own most the top travel portals.

There has also been a ton of collateral damage in filtering all the trash. So many quirky niche blogs & tiny ecommerce businesses were essentially scrubbed from the web between Panda, Penguin & other related algo updates.


does anyone else feel that google results are useless a lot of the time?

Google doesn't make money from you finding what you're looking for. Google makes money from you searching for what you're looking for.


It has gotten better over the years in some ways even if it feels like it also got worse. I recall pages of "ads and useful lookimh search result keywords" being more common in the past.


w3schools still outranks mdn a lot.


You're not alone. From my perspective, the value of google search results has been dropping for years. And the quality of their search results seems to be dropping in a way I suspect is profitable for google. Most of the results I get back from google these days are trying to sell me something I have no interest in buying.

For example, suppose I do a google image search for "pear", because I want images of pears obviously. The first result is indeed a pear, good job google! Except the first search result just happens to come from Amazon, and also happens to be a pretty shitty thumbnail quality photograph (355x336). It's a pear alright, but why is this particular image of a pear first? Google didn't try to give me the best image of a pear, they tried to give me the pear image they thought most likely to induce a financial transaction. Or alternatively, google let itself get cheaply manipulated by Amazon's SEO. Neither is a good look.

A much better pear image, 3758x3336 from wikipedia, is further down the search results. So it's not like google was unable to find good pictures of pears. And a non-image search for "pear" returns the wikipedia page first, so it's not like google failed to noticed the relevancy of the wikipedia article about pears. Yet the shitty amazon thumbnail of a pear shows up higher in the image search results than a high resolution photograph of a pear from wikipedia.


I would assess Google (& FB's) "crown jewel" as, ultimately, their market share, which is related to your points... and causation runs both ways.

The user data helps/ed Google create the superior UX, as you say. The reach is what makes Google & FB valuable to advertisers. A search engine with 0.1% of Google's user volume cannot charge advertisers 0.1% of Google's as revenue. Returns to scale/reach/market-share are very substantial in online advertising.

I'm glad we're talking though. Those tech giants are too powerful.

Ultimately, the old antitrust toolkit is near useless today, for dealing with tech monopolies. It's not obvious what "break up Google" even means. There are strong network effects and other returns-to-scale. It's a zero-marginal cost business, which was rare enough in the past that economists a ignored it.

We need fresh thinking, a new vocabulary, new tools, but we do need to deal with it.


'Break Google up' would mean you'd have:

* an Office suite / enterprise company (Google Cloud + Docs + Gmail + Business)

* a phone company (Android)

* a search company (Google Search + Advertisement)

* and a media company (Google Play Movies, Music, Books and YouTube)

The names would probably become different in time, but you get the gist.

Amazon and Microsoft could be broken up much the same way, in neat categorical 'silos'. Facebook should be trisected into Facebook, WhatsApp and Instagram again. I have no idea how you would break Apple up without utterly destroying their core principle, vertical integration. There is no way to do what Apple does with MacBooks or iPhones if they don't control the entire stack. I'm not saying they shouldn't be, I just see no way.


So... I think there are two issues with this.

(1) This doesn't actually reduce market share, since each of these are basically different market categories.

(2) Almost all the revenue is from search. That company is the revenue generating arm for the other ones.


(2) is one of the most important points. We have to stop Google from cross-financing new products from other revenue streams so they can no longer undercut or buy all competitors. Google Maps is a good example. They ran it for super cheap a long time to drive out competitors and now rack up the prices.

In contrast to most people here, I think breaking up Amazon is far more important to Facebook, Microsoft, Apple and many other tech companies. Only Google is as bad.


But you have to acknowledge that without the cross-financing those "markets" wouldn't even exist.

Before Google Maps we had a few online map services and they were terrible. Google Maps redefined what it means to have free access to web based interactive global maps, it changed how people find things and it was all payed by the ad business. Later on some monetizing efforts were made for it and competitors started to appear, mostly trying to catch up and copy what Google Maps did, but without the huge cash infusion of the ad business none of this would have happened.

A decade later, people take these things for granted and just want to split services up. I guess it makes sense from their point of view but to me it's not that clear what should happen while still allowing for the type of creativity and speed of development that allowed things like Google Maps to appear because I'm afraid "the next big" thing that could redefine our lives (and improve them) would be slowed down or simply made non-feasible.


> "Before Google Maps we had a few online map services and they were terrible. Google Maps redefined what it means to have free access to web based interactive global maps"

This is not true. MapQuest revolutionized things almost 10 years earlier than Google Maps. Google search is what allowed Google Maps to overtake MapQuest. Also, Android providing real-time traffic data of all their users gave them the winning formula.


Free scrolling was pretty revolutionary.

As was, like, an app that could reroute live instead of relying on pre-printed paper instructions.

Gmaps had both before MapQuest.


You are right that traffic was revolutionary and that's why google maps became the defacto standard. However, in context with the original post, this is exactly why it's unfair. Google has an android that gives them user location data which they then use as a competitive advantage in another space to eliminate all competition. If Android were 1 business and GoogleMaps another, then people like MapQuest could also negotiate deals with Android to get user data and then it's a matter of who has the best platform that wins. That's what is best for the consumer as well. In the current structure, there is no way that a small business like MapQuest could build a smartphone to ascertain user data and nor should they have to. They should only have to build the best map application to succeed in the online mapping space. Having to also succeed in location data aggregation eliminates competition. It's designed so the giants can eat the small guys at will without them being able to fight back.


Worth mentioning a couple factors related to this. You couldn't turn location data on for any service external to Google without also having it turned on for Google & even when you had location services turned off for Google sometimes they still had it turned on anyhow.


I'm not talking about using traffic data. Simply rerouting if you, for example, miss a turn, which instructions on paper can't do.


If you stopped cross financing YouTube then it would stop existing. YT has never made a profit and hosting user generated content in the YT syle is impossible to do profitably.


Google takes 45% revshare on YouTube. Some videos that were demonetized still show ads, so on those Google is taking 100%.

I've seen mid-roll ads on songs on YouTube.

Hosting costs & delivery costs (per byte) drop every year. Every year their compression gets better. Every year their ad revenues goes up. YouTube ad revenues have been growing at something like 30% a year for many years.

I think one reason Google doesn't break out YouTube profitability is because as soon as they show they are profitable they end up getting some of their biggest partners (like music labels) using those profits to readjust revshares.

Also if Google claims YouTube is not profitable they can be painted as the victim for extremist content or hate content they host, whereas if they show they were making a couple billion year a year in profits these narratives would be significantly less effective.

Core search ad prices haven't really been falling yet Google blended click prices keep falling about 20% a year while ad click volume is growing about 60% a year. This is driven primarily by increasing watch time on YouTube and increasing ad load on YouTube. Google blends video ad views in with their "clicks" count.


Yes, my thought was that by breaking everything off from everything else, these silo'd services would suddenly have to compete with the rest of their market at fair terms, instead of being propped up massively by other division(s), and thus would lose marketshare to a multitude of fresh and established competitors.

You are right though, it doesn't deal with the dominance of the search directly. My hope is a complimentary effect to the above also happens: Google no longer gets gobs of personal data from its other services, allowing other search engines to approach its efficacy.

As is clear I'm not really a fan of direct intervention in a single market, I see it as more of a problem when these giants muscle their way and control more and more markets, creating a vicious feedback loop.


> Yes, my thought was that by breaking everything off from everything else, these silo'd services would suddenly have to compete with the rest of their market at fair terms

I think it's instructive to look at the rest of the market. How is Mozilla funded? Basically a single gigantic contract with Google. Even Apple accepts payment from google to become the default, and it's not cheap: https://fortune.com/2018/09/29/google-apple-safari-search-en... The same logic applies to pretty much anything Alphabet spins off -- there's little difference between ownership and those contract.

About the only competition this setup produces is the ability for Mozilla to walk away to a competitor bid, which they did for like a year before bailing out at the first opportunity. There's a huge incumbency bias in these contracts. The first parallel that comes to mind is employer provided health insurance. Everyone gets to bid, but the incumbent knows the claims history far better than the competition and we'd only expect them to lose bids to companies overly optimistic about that history. Google knows how valuable various traffic sources are, but their competitors have to guess, and only when their guess is higher than Google's does it pay off. Does anyone think Yahoo winning Firefox was a good deal? I haven't seen any analysis to support that.

> My hope is a complimentary effect to the above also happens: Google no longer gets gobs of personal data from its other services, allowing other search engines to approach its efficacy.

Wouldn't the most profitable thing for these broken up companies be to sell their slice of the personal data pie as many parties as possible? This seems like a net loss for privacy. How much extra would it be worth to set up an exclusive arrangement?


>You are right though, it doesn't deal with the dominance of the search directly. My hope is a complimentary effect to the above also happens: Google no longer gets gobs of personal data from its other services, allowing other search engines to approach its efficacy.

I'm still not sure how this would work on Apple though, since their main differentiator is their design sensibilities and integration rather than their platform monopolies.

I guess iMessage and the App Store do rely on monopoly rents, but I can't think of any way to sever those links without making the iOS platform less secure.


I'm not sure how much an impact breaking Google up would have, and I say this as someone who has built a product that competes with Google's G-Suite. I want there to be a more level playing field, sure. But each of these siloed businesses would still be a monopoly in its own right.


For Google, you missed the part that makes most money.


Most of those products don't make money by themselves, they exist to keep people in the ecosystem, providing more data for the real moneymaker.

The biggest blow to Google wouldn't be to break it up into lots of small companies, you just need to separate the advertising business from everything else and you've effectively neutered the monopoly. Google's genius isn't in hiring the best engineers to providing a ton of services, it's in convincing people that they're not an advertising company, and that is where Facebook has been falling out of favor recently (I'm guessing that's why they bought Instagram, and why Google bought YouTube).


“... provide more data for the real moneymaker.”

This is a supposition - while perhaps it seems to makes sense, seems true, “must be true”, it doesn’t mean it is true!

Unless you worked on search quality at google you really aren’t in a position to know if, say, google cloud, or android provides useful signals to search (outside of the signals they’d collect anyways if they were different companies).

One thing people are obscuring is just how crazily effective AdWords are. They work for the advertisers, and they earn google like 70+% of the revenue. Confirmed via sec filings which does break that out. Go play with creating an AdWords campaign and try to infer just how much data google really needs to deliver those ads - it’s less than you’d think.

In short: this overall move is more wishful thinking than solidly reasoned. Surveying the field of streaming video, given the amount of studio driven consolidation, is there really a tons of competitors being held down and will spring up? I am skeptical.


That's an interesting thought. I agree with you that most of those products are loss leaders for data mining and thus advertisement.

But my thinking was that if you simply cut off advertising all the products still have massive marketshares and could lean on each other, as long as some succeed. Not to mention investors probably willing to prop up such a massive aggregate marketshare (one only has to look at Uber).

If you 'silo' them, success of one division of previously-Google won't lead to all of them dominating.


Thanks, I completely glossed over that since to me their advertisement division is inextricably linked to their search division. Added!


Almost all of those businesses tie back into their main business which is advertising.

Who is going to sell those ‘neat silos’ cheap advertising to survive with no other business model? Google?


Search advertising is different from web display advertising is different from streaming video advertising.

The cloud services (Enterprise apps, hosting, etc) don't need it.

  I think we'd probably see a worse Gmail (worse ads or aggressive upsells).


I rather cleave them all vertically anyways, rather than be left with a bunch of mini horizontal monopolies.

Granted most of your examples wouldn't be, except for search, but it still seems more interesting to me to just have a bunch of mini googles made from cleaving teams. Certainly that would make for some crazier competition.


Breaking up companies like Google, Amazon, and Microsoft are just not gonna happen in 2019 where huge, global mega-corps are the only way to compete outside of small local markets.

Even though a lot of these corporations build offices, hire non-Americans, and pay tons of foreign taxes in countries in which they do business, the main executives and talent still live in the US, the IP is developed here, and the majority of profits end up back in the home country.

It's better for everyone who actually matters - shareholders, intel agencies, government officials, associated businesses, etc - that these companies remain large and globally dominant, even if it screws over US citizens by having to pay the monopoly taxes and suffer the privacy invasions. We're an insignificant sacrifice in the decision-makers' minds.


Apple's already on it's descent, and at most you'd break off their cloud services, which would immediately die without the support line from the hardware.


> it's roughly feasible

What do folks even mean by "Google's index"?? Google results combine tons of signals, including personal histories for each user. Sharing metadata for the top billion urls wouldn't cover half the functionality, or make a competitive engine. And on the other hand, there may not be a single other organization in the world prepared to manage a replica of the entire data plane that impacts seatch. The proposal is somewhere between underspecified and nonsense.


Thanks, this is mainly what I came here to say. And I just don't see even the vaguely defined "index" as the crown jewel. If anything, it's "relevant results", which is something quite different.


> Bing had the money and persistence to make that investment, but how many others will?

I hypothesized once with an ex Microsoft HIGH up that it probably took 10B to launch bing. He said I was almost exactly on the nose.

Also this is a ridiculous thing to ask for. How much money do you think Google pays for the bandwidth to crawl the web? How much do you think it costs to run the machines that create indexes out of that? How do you value the IP involved in the process?

Google should give away the fruits of that labor for free, plus invest in a reasonable API to download that index? Plus the bandwidth of sharing that index with third parties? It’s probably not even feasible aside from putting disks or tapes on multiple semis to send to clients. The index is 100 petabytes according to [0]. With dual fiber lines, and no latency for mind bending numbers of API calls, that would take 12.6 YEARS to download a single snapshot.

[0] https://www.google.com/search/howsearchworks/crawling-indexi...


[flagged]


The hacker news guidelines specifically advise against this kind of comment.

'Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."'

'Be kind. Don't be snarky. Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.'

https://news.ycombinator.com/newsguidelines.html


> Indexing the top billion pages or so won't take as long as people think.

This is what makes me wonder why we don't have a LOT of competing search engines. Perhaps i'm vastly under-estimating the technology and difficulty (I could well be - it's not my domain) but it surely it can't be THAT hard to spawn Google-like weighted crawl-based search results?

It's a long-since solved problem - heck, pageRank's first iteration recently came out of patent protection - it could just be copy'pastad. Why aren't all the big companies Doing Search?


SEO spam, and poor quality content I would guess. Google has bolted on a ton of ML over the last ten years to fight it.


And yet most Google results that don't point at one of a handful of major sites are SEO spam :-/

The spammers won. Google gave up and settled for "we like the right kind of spam—the kind that took a little effort, and makes us money".


I did a search earlier today on Google for "north face glacier" - turns out that the company North Face has a Glacier product so as far as I can tell that's all the search results contain.

Searching for "north face glaciation" did help as the first page of search results did have one entry on the topic I was actually searching on!

Maybe they should have a "I'm not buying anything" flag!


This has been the problem with results for the past few years. E-commerce gets priority in all things and you have to wade through pages of useless links if you want actual content about what you are searching for.


Big brands have the ad budget to advertise. That drives awareness. If they have offline stores, those can be thought of as both destinations AND interactive billboards which drive further brand awareness and demand for branded searches.

Many of the top search queries are navigational searches for brands.

And so if tons of people are searching for your brand then if there is a potentially related query that contains the brand term & some other stuff then they'll likely return at least a result or two from the core brand just in case it was what you were looking for.


It's not just ML, but the people that provide the labeling for the ML.

Google pays some large number of people to do search and grade the various results they get to see if the answers are good, which then helps feed back ML.

Heck, according to this article[0], google has been paying people to evaluate their search results since 2004.

[0] https://searchengineland.com/interview-google-search-quality...


It doesn't feed back into the ML directly, according to Google. Instead they use it to evaluate changes to search algorithms. If they get an increase in thumbs up back from the Quality Raters then their changes were positive. If not, they figure out why.


The original 2012 FTC investigation of Google anti-trust activity showed how they might have abused this process. Interesting read, no matter which side you take: http://graphics.wsj.com/google-ftc-report/


I feel for certain topics, especially anything to do with tutorials or coding, even Google falls foul to SEO content. Just Google ‘android custom ROM <phone model>’ for instance. There’s stock pages for all of them, identical save for the phone model, and clearly not applicable.


PageRank was an innovation at the time but modern search engines require training models on lots of query logs to get good performance. Its expensive to make a really good search engine.


It is because people just stick with their best usually instead of using a variety of search engines. It becomes rather winner takes all.

Google for general search. Duckduckgo fir general if you want something a bit more private but not extreme enough to run your own spiders. Bing mostly for porn search - not being snarky some people do consider it to have better results.


And searx.me if you want to be even more private, and you can run that yourself if you so choose.


Querying an index isn't a solved problem, building it is.

It's easy to gather the necessary data, but it's hard to know which parts of that data are the most relevant for finding good content and avoiding bad content. Is it more relevant if key words show up in links or titles than in the body of the text? If so, SEO spam sites will include a bunch of keywords in links and titles. Is it more relevant if keywords show up in the first 200 visible words of the page? If so, spam pages will make tons of pages with relevant keywords at the top.

The hard part about building a search engine isn't indexing the internet, it's adapting to spam. Spammers are continually adapting to changes in the algorithm, so the algorithm needs to adapt as well. And the more popular your search engine is, the more money you make and the more able you are too adapt to spam (and the more spammers focus on your engine).

So, the problem isn't that Google has a better index (though I'm sure it does), the problem is that nobody else has the will to spend the money necessary to tune the search algorithm to stay on top of spammers. When Google started, companies didn't care as much about improving their index and instead focused on building their other content (Yahoo, MSN, etc). Google saw the value of search and got a lead on everyone else in terms of curating results, and now they have the momentum to stay in front and have shifted to building content to improve monetization. Nobody else has the monetization network for search that Google has, so they'll continue having the problem that other companies had (Microsoft wants to point you to their other services, DuckDuckGo is limited by their commitment to privacy, etc).

In short, Google wins because:

- it was better when it mattered - it makes money directly from search - its other services improve their ability to understand what users want, which improves search quality and ad relevance

You can't make a better algorithm by being clever, you make a better algorithm by having better data, and that's hard to come by these days. The only way I can think of a competitor stepping in is if they target an underserved demographic and focus data collection and monetization there, and DuckDuckGo is close by targeting privacy conscious power users.


> The only way I can think of a competitor stepping in is if they target an underserved demographic and focus data collection and monetization there, and DuckDuckGo is close by targeting privacy conscious power users.

The irony there is that DuckDuckGo can't collect much of that data precisely because of their privacy focus.


> The hard part about building a search engine isn't indexing the internet, it's adapting to spam. Spammers are continually adapting to changes in the algorithm, so the algorithm needs to adapt as well.

Adaptive crawlers?


> Querying an index isn't a solved problem, building it is...

You didn't just hit the nail on the head; you drove it all the way in with a single blow. Bravo.


Most likely answer: lack of diversity in revenue models.

Outside of ad revenue, search has always been seen as something of a "charity" effort for the internet. It's "boring" infrastructure work that can be critically useful but doesn't really make money directly on its own. No one wants to pay a "search toll" and there's no government agency in the world that the internet would trust as a neutral index to run it as actual tax-basis infrastructure.


Which begs the question, if adblock makes advertising based models go the way of the dodo, what happens to search?


"indexing" is only part of the problem, it's a batch job. I find being able to respond to searches across a huge data set in the order of milliseconds (while having planet scale fail over) be a lot more challenging to implement.


It's not the 'raw' search itself. It's the billions (trillions) of queries they've captured: Person X searches for query Y and clicks on result Z.

This is far more valuable than the general page rank algorithms that were initially developed and have already been duplicated many times in academia and business.


It's so weird how about 1/3 of the time on DuckDuckGo, I add a !g in frustration .. half the time I still get nothing and I end up posting on Stackoverflow but half the time I get a little more useful information.

Google custom tailors results for each and every machine. Even if you're not signed in, Google uses your browser fingerprint, the OS it's reporting and location/IP data to custom fit results. There is no "stock" google result.

This is something DuckDuckGo et. al. can't do if they want to focus on a privacy model. DDG does offer location specific searches, which can be helpful.


Aside from the quality issues that others have already mentioned, I think that simply gaining traction for a new search engine is incredibly difficult - people typically use whatever is the default in their browser, or/and Google/Baidu/Yandex (which are surely the best known in their respective regions).

Consider DuckDuckGo, which sells itself on privacy, but after more than a decade has only 0.18% market share. Without the power to make it the default in an OS or browser, you'd have to have a really strong value proposition to convince people to switch.


I don't think this is correct. For years, the #3 search query on Bing in the US was "Google", and globally it used to be a double-digit percentage of all Bing queries. That suggests to me that people with a default Bing search engine had learned in droves to click their way to the preferred engine regardless of what the default was, and did so without being technically skilled enough to change the default once and for all. I don't know how large a group the latter is, but it seems hard to argue that the two together are small.


> Why aren't all the big companies Doing Search?

They are.


> 1) a record of searches and user clicks for the past 20 years

If a government was serious about getting more players in the search industry, they would force Google (and all other players) to make this data public.

Simply say "All user-behaviour data used to improve the service must be freely published".

Make the law apply to any web service with more than 20 million users globally so small businesses aren't burdened.

If the data cannot be published for privacy reasons, the private parts must be seperated and not used by google or it's competitors.


Imagine the amount of bureaucratic burden these proposals would impose (even for small business, cause it is not obvious how to count users, etc.).

> the private parts must be seperated

This means literally making legal interpretation of all documents on the net, to determine whether each of them is private or not.


> If the data cannot be published for privacy reasons, the private parts must be seperated and not used by google or it's competitors.

As a user that notices the impact of this data: please no, thanks though.

Have you ever visited youtube's home page in incognito mode? It's... bad. Really bad. Not allowing any company to use this (obviously very private) information in ranking would simply make their products suck, horribly, compared to today.


>Have you ever visited youtube's home page in incognito mode?

Do you like the personalized recommendations because of channel subscriptions?

I always get the "anonymous default" home page with YouTube and don't care. The home page is just a wasted load before I can start typing in the search bar. As a bonus, staying incognito means all the videos on the right-side panel are related to the current video. Not related to a music video I have playing in another tab.


Pretty much, and the potential for criminal activity is astronomical if you give them access to an open index. Things like every website on the web hit with the same zero day on the same day for maximum profit. Build your own best kiddie pron site evah! with direct access to the index and your own ranking system. What your admin pushed a config that left the admin pages open? Go time!

As someone who was operationally responsible for a search index (formerly VP Ops at Blekko) the kinds of things crooks tried to do was pretty instructive on how they use search in advancing their efforts.


>20 years of experience fighting SEO spam

I think we've reached an equilibrium state on this that has significantly degraded the educational quality of search engine results.

The total garbage SEO spam we used to get is gone, which is nice, but what it's been replaced with is technically relevant but mostly manipulative advertising. Product searches will basically give you a bunch of no-name blogs who are almost definitely paid off by one vendor or another.

Even actual inquiries are inundated with search results that do answer the question, but do so in extremely cursory and incomplete way. Or, in the case of recipes, Google seems to prioritize results that give you long, meandering narratives before they actually talk about their recipes. It has some very weird ideas about what people actually want when they search.

One of the most annoying things is how impossible it is to actually find the website of a local business, especially a restaurant, by Googling. Your hits are always Googles' own cobbled together dossier on the restaurant first, then some combination of Yelp, Grubhub, Postmates, AllMenus, etc. pages. If the restaurant has a website you can't tell and it's probably way on the bottom or on a second page of results.

In the past it was a handful of very decent results amidst a sea of total garbage SEO spam. Now it's a sea of mediocre content farm stuff, but it ranged from difficult to impossible to actually dig into detail on things anymore. The old spam we could at least dismiss as crap within a fraction of a second of seeing it. The new spam you have to actually read most of it before you realize it doesn't have what you're looking for.


Via API access you'd be effectively getting access to the index _plus_ the derivative search quality improvements _based on_ user data, even if you're not getting user data itself. That would certainly open the door to competition, especially on a niche basis e.g. you want to build a platform dedicated to drones - you can combine drone reviews and news with videos plus e-commerce results. The result could be awesome in sparking all kinds of small business building on Google's API.

> 2) 20 years of experience fighting SEO spam.

That's probably a key issue here though. Providing an API potentially makes it easier for spammers to identify ways to boost their content in a well automated manner.


> That's probably a key issue here though. Providing an API potentially makes it easier for spammers to identify ways to boost their content in a well automated manner.

How so? Unless you give reasoning for the scores, or provide live updates etc, just putting an API on search wouldn't change much - you can APIfy search now, there are multiple services offering it as a service. Granted, at some point it's getting expensive, but for SEO research, you're probably not running a million queries.


Totally agree. Googles' golden egg is not the index but the datasets containing searches done by the user (together with location data from Android and Maps, and speech data from Assistant).

As far as I remember Google is actually shrinking its index in terms of number of indexed websites because 90% of the internet are irrelevant for the majority of searches. Basically "quality over quantity" if you can say that.


> Basically "quality over quantity" if you can say that.

This is even more depressing. Google was such a wonderful tool for us nerds because we could finally find those usenet posts, personal blogs, tech mail lists, etc. of all the esoteric subjects that had been hard to find previously. Before Google, you'd use lists of curated links (e.g. Yahoo) for a given topic that had been traded back and forth between various sites and other interested netizens.

It's apparent that Google is becoming worse and worse for these types of searches, while it concentrates of more popular queries like "When is the next <my show> on" or "What is the current sports-ball score" or "How big are Kim Kardashian's boobs".

Just like Craig of Craigslist recently came out with an article saying the internet has actually made the news media worse, not better for informing citizens - something he did not predict correctly - it's apparent that Google is pushing us in the same negative direction in the ability to find quality information on non-consumer knowledge.

* https://www.theguardian.com/technology/2019/jul/14/craigslis...


> most queries aren't in the long tail

But that's where differentiation occurs. Every search engine will get short tail results correct. We go back to Google because it also performs with the weird queries.

I agree that algorithmic superiority will probably perpetuate Google's dominance. But making its index public is (a) legally precedented, (b) conceptually simple and (c) a small step in the right direction.


Gotta say my experience is very varying with long-tail type queries, I usually try DuckDuckGo and if that fails I search Google. They find very different things, DDG tends to be less filtered in terms of spam sites and fake news, but it also finds results of dubious copyright nature, for example.


I've had the same experience with DDG, which I use as my primary search engine. If I'm looking for a specific e.g. scientific paper or a recent news article, it doesn't have it. I run the search through Google. That's purely an indexing problem.

On the other hand, if I have a health-related search, I run it through Google. DDG has the proper content. It's just that it priorities the blog spam. That's an algorithm problem.

Relieving the former, as the author's proposal would do, makes DDG more competitive. As a second-order effect, it would also let DDG priorities resources towards the second problem, making them more competitive still.


From my experience, for long-tail queries, DDG also a lot more NSFW results than Google.

Bing does have the reputation of being better for NSFW searches than Google, so I guess that it's normal to have more NSFW false positives as well.


Index them is not hard, ranking them to yield a useful first page is.


I'd wager any startup that tries to crawl a few sites like Amazon, Yelp, Linkedin, etc will be blocked. Google, however gets a pass because they're Google. So yes, I believe their huge index, and ability to crawl any site at will is a huge, huge advantage for them.


I built a search engine that was able to crawl Amazon and Yelp. The toughest sites were reddit and facebook.


at scale? millions of pages a week? And now? I wrote a crawler that could crawl Amazon as early as a year ago too, but now it doesn't work.


And google sucks at those too.


Amazon lets anyone crawl them, Yelp has a whitelist and no you can't get on it, Linkedin has a whitelist and no you can't get on it, Facebook has a whitelist and no you can't get on it.


The long tail is important, even if it's a small percentage of search's (which it isn't anyway).

Same reason people won't buy electric cars with 100 mile ranges, even if they very rarely travel more than 100 miles.


Storage and bandwidth are cheaper than ever before, people scrape a billion pages for much more mundane purposes these days, even for academic papers.

Having a full text index on that is more involved but hardly impossible. You're completely right that it's not at all Google's secret sauce. Bing has clearly indexed much more than that, plus invested a ton in actually returning good results from their index. And still nearly nobody cares. It's just not easy to make a better Google, and the people most likely to figure out how to do that already work there.


The Common Crawl corpus is already available and stored on S3 - so analyzing billions of web pages is literally already available with an AWS account and a simple map reduce job.

I'd actually advocate for making public an anonymized list of actual search queries.

Domain specific search engines could evolved based on the demand of what has already been searched for.


Anonymizong search queries is extremely hard, if not impossible. See https://en.wikipedia.org/wiki/AOL_search_data_leak for example.


> It's just not easy to make a better Google

It depends which sense of "better" you mean. It's nearly trivial to make an ethically superior search engine by just not building the spyware bits of Google.

It's difficult to make a search engine that's "better" along the dimensions of speed, profitability, etc.


That exists, it's called duck duck go, and even less people care about it than Bing. For the most part, people don't actually care about Google collecting their entire search history and combining it with their other data on you. We may live to regret that in a hypothetical future where the government turns more authoritarian and requisitions that data for evil.


I made three statements. They're all true as far as I can see. Would the downvoters care to speak up?


Devil's advocate:

Some argue (not necessarily me) that Google isn't necessarily purely optimizing for quality using that 20 year click-and-search log, that they're accepting some inefficiency by biasing for political (left-leaning) gain or "censorship by obscurity". If competitors could more easily build alternatives, which, say, didn't have those biases, then arguably that'd put more competitive pressure on Google to not use their monopoly for bad stuff.


More importantly, Google's core competency is PageRank. Sharing the index != sharing PageRank. As time goes on, others will use inferior algorithms, and become worse. This scheme will not accomplish what it intends to do. Also, you can't just force people to give away their property.


It is the crown jewel because people choose Google precisely because they are understood to have the largest index. It's comparable to Verizon marketing 'the largest network,' but with many more benefits accrued to the company who is believed to have the largest search index.


Since the author compares the proposed API to what startpage.com does, I'm guessing he's not talking about "index" as in "raw documents", but basically Search as an API with all the sorting and ranking done.


well considering the complaints I read about Google's search quality going down for users on HN all the time I have a theory that highly technical users are adversely effected by the search improvements so an improved search engine targeting that group would essentially be one searching on what you typed.

I also happen to think that is the search engine I would prefer. I think I could build that pretty quick if I had the api access.


What Google has is "I'll google that".


In particular, they have the google.com domain. That is literally their most valuable asset.


No, most queries are in the long tail


For #1 I'd prefer if Google didnt share my search history with anyone. That would also go against GDPR in Europe right?


Caveat: The author is not a technologists

Robert Epstein (born June 19, 1953) is an American psychologist, professor, author, and journalist. He earned his Ph.D. in psychology at Harvard University in 1981, was editor in chief of Psychology Today,

He has also made some questionable claims about google manipulating search results to favor Hillary Clinton.

https://en.wikipedia.org/wiki/Robert_Epstein#cite_note-15

His research is based entirely on his own experience

“It is somewhat difficult to get the Google search bar to suggest negative searches related to Mrs. Clinton or to make any Clinton-related suggestions when one types a negative search term,” writes Dr. Robert Epstein, Senior Research Psychologist at the American Institute for Behavioral Research and Technology.


The comments he made are not just questionable, they're outright wrong (and a great example of the problem with cherry-picking data):

https://www.vox.com/2016/6/10/11903028/hillary-clinton-googl...

https://www.politifact.com/punditfact/statements/2016/jun/23...

(Disclosure: I work at Google, but this opinion is my own)


Google's claim that the algorithm is generic is demonstrably false. Type in "hillary clinton e" and there is no suggestion for "email", type "donald trump e" and email is the first suggestion. Given the news content that we know is out there, that can only be the result of adjusting the results for clinton specifically (if anything, we would not expect "email" to be autocompleted for trump). This is not research that tells us what exactly Google is doing, but you cannot deny the example.


This is not "research" period. Using one arbitrary search comparison to draw conclusions about the nature of a system that processes billions of queries a day is pretty weak. Additionally, I don't get the same results you do. "hillary clinton e" does not bring up emails, nor does "donald trump e" bring up emails (the first results I see are election, education, england visit, ex wife).

I'm not ruling out the possibility that google actually is manipulating search results, but this is not proof of that.


Try "hillary clinton emai". From a fresh chrome session in NYC I get nothing, not a single autocomplete result. On the other hand "donald trump emai" gets:

* donald trump email * donald trump email address * donald trump email list * donald trump email newsletter * donald trump email list signup

And just to drive the point home I tried "root_axis emai" and got "root_axis email". Try anyone else and you get similar results, 'barack obama emai', 'george bush emai', etc etc. So yes, this is proof that the results are scrubbed for Clinton email.


I got curious and tried the names of a bunch of public figures. Some of "<first and last name> e" yielded "email" as the suggestion. But these did not: elizabeth holmes, tom jones, tom cruise, brad pitt, gwyneth paltrow, roger federer, will smith, jimmy carter.

Since Hillary Clinton is not unique, then it's not proof that her results are treated differently.


Why would you expect “brad Pitt email” to be something that auto completes? You would, on the other hand expect “Hillary Clinton email” to auto complete because there was a huge controversy about it.

I’m not saying google is manipulating auto complete intentionally (though they might be), I’m just saying your counter examples are irrelevant.

It would be like “Donald trump Russia” NOT auto completing then someone saying “but neither does Taylor swift Russia, so we’re good.”


The poster claimed Hillary Clinton was unique, meaning the only person that applied to. For me, she was not. Since she's not unique, then her being unique can't be used as evidence.

Claiming that she's unusual, since you expect it to work for her based on stories written about her is a different claim.


You're misrepresenting my claim.

1. I said "only for Clinton" in the contect of Trump vs Clinton. Then I compared to other U.S. politicians, where the example still holds. The intended meaning is perfectly clear.

2. Obviously the fact that Clinton was the topic of a scandal involving email is the assumed context here. That's why I said "if anything, we would not expect "email" to be autocompleted for trump" (implied: but for Clinton, email is a more relevant search term based on published news, etc.).


They all work fine for me, e.g., 'tom jones emai' -> 'tom jones email', only Clinton didn't.


None of those other people were involved in major stories with email in the headline. If you look at the actual results, you'll see that it doesn't make any sense to not get suggestions for "Hillary Clinton email".

Compare these results:

* https://www.google.com/search?hl=en&q=hillary%20clinton%20em...

* https://www.google.com/search?hl=en&q=will%20smith%20email


Seems to go both ways though as I'm not getting any auto completion results for "donald trump stormy daniels" either. I'm guessing they scrub things that are highly sensationalized in the news.


"stormy daniels" doesn't get any autocomplete results for me even without trump, my guess is this is more about adult search terms getting blacklisted rather than political. For example "donald trump e j" gets "donald trump e jean carroll", E Jean Carroll recently accused Trump of a serious sexual assault and this is autocompleting.


When I type "Donald Trump R" (or Ru or Russ etc) no autocomplete results contain the word Russia despite plenty of news coverage. When I type "Donald Trump Epstein" no autocomplete results. Must be a conspiracy by google to protect the president... or drawing conclusions based on individually cherry-picked autocomplete results is like drawing conclusions from numerology.


The same thing happens with Russia for "barack obama rus" "hillary clinton rus", "george bush rus", etc. - none autocomplete for me despite the fact that there was lots of news items relating those politicians and russia during their careers. However Clinton is the only one that doesn't appear to autocomplete for email, suggesting that it is specific to her. When I type "donald trump epstei" I get "theo epstein donald trump" (for some reason the word order is flipped), which would suggest to me that epstein is not blocked in the way you suggest, and it's just that not enough people are searching that term, or that the autocomplete algorithm hasn't caught up to the latest news on epstein and trump yet. However "donald trump e j" does autocomplete to "donald trump e jean carroll", which is relating to a serious scandal for the president. This isn't cherry picked I'm afraid, it really does look like intentional blocking of "hillary clinton email" from autocomplete.


None of those people had a gigantic Russia scandal though, Trump did, so you must still account for this unexplained aberration. If anything, the Trump/Russia scandal had more coverage than the Hillary e-mail scandal, so it's an even more difficult aberration to explain.

Also, if I type "hillary e" I get "Hillary Emails PDF" as an autocomplete suggestion. If I type "clinton e" I get several email suggestions "clinton email PDF" ,"clinton email film", "clinton email FOIA", "clinton email download".

Please include these confounding results in your analysis


I can believe Google is attempting to de-emphasize scandals on both sides. As an aside, I do get "russia investigation" as autosuggestion for trump.

I think the bigger problem is that these adjustment appear to be done manually. We know that's what they're doing to avoid racist autocomplete phrases (and reasonably so, I think most people would agree). Having such judgments made on specific political topics/events/scandals will inevitably result in political bias, especially in an organization whose workforce is so politically skewed to one side.


My only point is that autocomplete results for arbitrary one-off queries isn't instructive, especially because many people see different autocomplete suggestions. As I've already pointed out, its possible to find strange results for anything if you look hard enough.


If you type "Donald Trump Ru" into an incognito search you will not get the Russian Investigation. To GP's point, it's numerology to keep focusing on this and drawing conclusions of political bias.


Actually both Obama and Clinton had major stories relating to Russia including the failed 'reset', the annexation of Crimea during the Obama administration, obama's "red line" of using chemical weapons in syria which was circumvented by russia, and Obama telling Mitt Romney that Russia was not a threat in the 2012 presidential debates. So yes, they have huge stories relating to russia. And your autocomplete results just bolster the evidence that 'hillary clinton email' has been made a special case and is not organic!


Those stories barely saw a fraction of media coverage compared to Trump/Russia. The Obama/redline thing was primarily reported as a story about Syria, not Russia. If you type "obama red line" you get plenty of Syria suggestions which makes sense.

This is the problem with search query anecdotes, it ultimately produces a subjective and pointless debate about how one should interpret search suggestions for arbitrarily selected one-off queries. There is no methodology here, and we don't even know how widespread any particular suggestion results are. Any person with an agenda will be able to cherry-pick search queries that confirm their narrative.


For me "donald trump ru" autosuggests "russia investigation" (4th place). Compared with absence of "email" even for "hillary clinton emai".


Try news.google.com. It works there.


That's the point: this is not research, but whatever is going on at Google, the explanation has to account for examples like these. It's simply one observation that you cannot discount.

I just tried searching again a few times with new private windows, and "email" alternates between first and fourth suggestion for trump. But the more important point is the absence of the suggestion for clinton: we know it's been in the news extensively, we know people searched for this phrase a lot, and now "email" has been removed from the suggestions only for Clinton. I tried searching a few more U.S. politicians, and for all of them "e" autosuggests "email" somewhere between first and fourth place. So the complete absence for Clinton does not look like a generic algorithm change.


Right, and further observations show that it's not unique to Hillary Clinton. This means you can discount the claim since it uses cherry picked data.


But we still haven't answered the actual question: Why doesn't "email" come up in autocomplete for Hillary Clinton, when it clearly should?


The question doesn't need answering any more than any other arbitrarily selected individual query needs an answer. Why does "trump helsinki" have zero suggestions? Why does "bill oreilly sex" have zero suggestions? Why does "alex jones sandyhook" have zero suggestions? I used right-wing examples because I presume any celebrity or left-wing examples will be considered evidence in favor of your position, but there are plenty of examples all over the place.


Yes we have. Further observation shows Google removes autocomplete for controversial items, like "Russian Investigation" for Donald Trump. If that example doesn't answer your question then you have confirmation bias.


When I type "Donald Trump R" I don't see any autocomplete for "Donald Trump Russia" despite plenty of news coverage on this topic. So what? This isn't proof of anything. I can indeed discount "one observation" because it is literally a single search query used to draw a conclusion about an insanely complex system that processes billions of queries a day. I am open to the possibility that google is manipulating search results, but to demonstrate this you need to account for many other possible search queries that produce seemingly unexpected results. Dissecting one politically charged query and claiming it is proof of google's malfeasance doesn't make sense. Anyone can string together a couple strange query results to support their own subjective narrative about what should appear and the supposed sinister machinations behind the query results.


"hillary clinton emails" is a topic that is widely published around the web. Autocomplete should pick up on this and recommend it. I even went to the trouble of typing "hillary clinton emai" and no autocomplete suggestions were brought up.

It is a rather suspicious result. Suspicious enough that it is hard not to imagine a deliberate act is behind it. I admit that I don't have any proof.


How is autocomplete a sign of bias...Take an average voter...not very plugged in..he types in "hillary clinton e" and gets no autocomplete suggestions...he thinks Hillary Clinton didn't do anything wrong with her e-mail server? Do you seriously believe this?

Curious....Did you get any results for "hillary clinton e-mails"?


This is not the most scientific test, since previous searches are generally taken into account. Was this test conducted from a system that mostly searches for / clicks on pro-trump or anti-trump content?


Well, that's kind of the point: it's not scientific, but it's relevant. I believe this was also the example that was recently used in a Project Veritas video, with the same results.

I searched from a Firefox private window over a VPN from the Netherlands. But since the results are the same (regarding presence/absence of "email" as an autocomplete term) I don't think it matters much.


Speaking of cherry picking, I wouldn't use vox as reliable source for bipartisan data.


Do you have an example of Vox being dishonest?



I'm not sure this answers the question. Clearly Vox has a progressive (maybe originally neo-liberal?) editorial bias, but that may or may not mean that they have dishonestly distorted facts. There is a big difference between editorial bias and dishonest reporting!


>Overall, we rate Vox Left Biased due to wording and story selection that favors the left and High for factual reporting based on only one failed fact check and appropriately issuing a correction to a second. (5/15/2016) Updated (M. Huitsing 5/30/2019)


speaking of cherry picking, I wouldn't use autocomplete suggestions as a source for bias. Has anyone on this thread claimed bias in the search results?


Or politifact...


I've seen plenty of politifact and snopes fact-checks that go against the conspiracies that you guys seem to think are underway in those types of organizations.


tomweingarten can't see past the tip of his ideological nose. It's gonna be such a shocker to him when his megacorp gets shattered into a million little pieces.



Here is the trend data where google autocompletes two results while not autocompleting other two. The trend data groups the results pretty clearly, however the autocomplete engine does not agree.

If the trend data is not being used for the autocomplete results, what is? And why?

https://trends.google.com/trends/explore?geo=US&q=hillary%20...


do an image search for "european people art"

what's up with that


Just FYI the completion results in the omnibox have little to do with the search engine results. Clearly the search engine produces millions of hits for “Hillary Clinton emails”. The completions are a completely separate system based on what people type in the box, not what’s in the index, and it’s laser-focused on producing interactive results.


> It is somewhat difficult to get the Google search bar to suggest negative searches related to Mrs. Clinton

Would be nice to know whether it was because of the search history bubble he lived in or not.


Someone mentioned you can't get any suggestion for Hillary Clinton Email.

If I type "Hillary Clinton Emai" I don't even get a suggestion for email :/


He does raise some good points in his findings:

https://sputniknews.com/us/201609121045214398-google-clinton...

I know Sputnik isn't a good source, but according to him, they were the only ones who would publish the findings without edits.


Thats a pretty strong indication that his claims are BS. Sputnik is a literal propaganda arm.


It's certainly grounds for more scrutiny, but not outright dismissal before even looking at it.


Sputnik has twisted my actual research from grad school into garbage propaganda. I've seen first hand how "accurate" their reporting is. I'm likely to believe that something adjacent to their claims is true but I believe that slapping their name on something makes it less likely to be true.


I usually find expert consensus to be pretty good grounds.


they were the only ones who would publish the findings without edits.

They only let him do that because it fits with their agenda. The real test is if they would let him publish an article on Putin's corruption...without edits..


> He has also made some questionable claims about google manipulating search results to favor Hillary Clinton.

Despite it being off topic, can we define why those claims are questionable? Is their data proving those claims wrong? Because with all the Google political controversies over the past few years, and given the political donation history of Google employees, it’s highly plausible that search results are manipulated to favor certain politics over others.

If the “questionable claims” have been disproven or are inaccurate, then it would seem that you’d provide some proof. Essentially, it you are to claim the search engine was not biased towards Clinton, certainly there would be some proof of that? It’s more reasonable to suspect Google manipulating search engines than not, given the political environment at Google.

The real “questionable claim” is that Google is neutral in any way — which is kind of the entire premise of the article. If Google were completely neutral, then why would their monopoly on search need to be broken?


Disclaimer: Googler.

I'm going to retread a comment I made when I first learned about Epstein a few months ago: https://news.ycombinator.com/item?id=19167084

In short, and ignoring my ad hominem attack on his motivations, I encourage you to read/skim his two "studies" [1][2] and see how absurd they are. You might dismiss my claims and summaries as biased, but I think I was pretty open-minded towards his conclusions until I read them.

[1] https://www.pnas.org/content/pnas/112/33/E4512.full.pdf?with...

[2] https://aibrt.org/downloads/EPSTEIN_&_ROBERTSON_2017-A_Metho...


>Despite it being off topic, can we define why those claims are questionable?

the claims are questionable because his methodology is questionable. If he claims google is biased, he should have a good peer reviewable study that proves this..not google is biased because google didn't auto prompt me with "created AIDS" when I typed in hillary Clinton....

And he's the one making the claim that google is biased...The burden of proof is on HIM.

This is a forum for people in the tech world, right? Shouldn't we question N=1 "studies"?


What about Project Veritas? People claim the statements by Google employed were taken out of context, but I've gone back, listened to them, looked at the videos, and it's hard to think in what context anything they said is acceptable.

Even if the specific engineers and managers in the video clips don't have the level of authority to make the changes they're talking about, it's still chilling that their attitude could be common in Google and they see political ends of their great power as being some kind of great responsibility; instead of respecting the idea of equal/diversity of opinion.


Project Veritas has historically operated by guiding people into saying something ridiculous, either by themselves acting ridiculous (and convincing the person they're talking to that they're crazy) or just driving the conversation to ridiculous areas.

Maybe this Google 'expose' is the first time they're not guilty of that, but anyone who still finds them credible after their last several blatant mischaracterizations is far more forgiving than I would be.


Jen Gennai had this to say in her article https://medium.com/@gennai.jen/this-is-not-how-i-expected-mo... :

" ... Project Veritas has edited the video to make it seem that I am a powerful executive who was confirming that Google is working to alter the 2020 election. On both counts, this is absolute, unadulterated nonsense, of course. In a casual restaurant setting, I was explaining how Google’s Trust and Safety team (a team I used to work on) is working to help prevent the types of online foreign interference that happened in 2016. Google has been very public about the work that our teams have done since 2016 on this, so it’s hardly a revelation. The video then goes on to stitch together a series of debunked conspiracy theories about our search results, and our other products. ... "


The jester's job is to criticise the tyrant in his own court. Just enjoy the show.


* instead of respecting the idea of equal/diversity of opinion.*

Going to go out on a limb and say I don't respect "Hitler was a bad man" and "Hitler did nothing wrong" equally. Individual employees are allowed to have opinions...even opinions I don't agree with.


The "Manipulating instant search results in favor of Hillary Clinton" claim has been independently debunked and anyone still standing behind it are only signalling their technical illiteracy and/or political agenda for playing a victim card. [1][2][3]

That's not really off-topic - The fact that the author still supports the claim calls into question their ability to make further claims about the subject.

[1] https://www.snopes.com/fact-check/google-manipulate-hillary-...

[2] https://www.vox.com/2016/6/10/11903028/hillary-clinton-googl...

[3] https://mashable.com/2016/06/10/clinton-google-search/


None of those blog posts debunk the idea that Google manipulates search results to favor particular political parties. Mashable has a statement from Google (the other two don't), saying that they don't but why would they say if they were?

None of these debunk the claims made by Google employees in the Project Veritas videos.


Except you're wrong in that you can't logically prove something doesn't exist. You can't prove that pink unicorns don't exist, just as much as you can't prove that political bias in the search results don't exist.

All you can do is disprove the claims of their existence. Someone tries to claim there's a pink unicorn in the garage, and you can check the garage and say that the it is pink unicorn free. Someone tries to claim political bias exist for hillary clinton in instant search results, and these articles disprove that claim as using cherry picked evidence.

Now, if you suspect that the instant search results are politically biased, then the burden of proof is on you to provide evidence of that existence - preferrably without cherry picking evidence to fit an agenda yourself. Otherwise it's just hand waving and click bait.


> Now, if you suspect that the instant search results are politically biased, then the burden of proof is on you to provide evidence of that existence

The proof is a senior Google employee admitting to bias and manipulating results in the Project Veritas video. There's also plenty of anecdotal evidence you can see for yourself as a user. In addition to that I know many people who work at Google and the vast majority of them have extreme political bias.


This is precisely the hand-waving I was talking about. An employee rambling in a bar does not constitute evidence of search result manipulation, especially when the person recording it is known for stretching information, inciting people to commit voter fraud, and crossing the U.S. Mexico border dressed as deceased Osama Bin Laden to prove some point.

Likewise, the handful of people you know having political views, in an organization of 85,000 people, is not evidence that that organization's search results being biased.

If there is so much anecdotal evidence then it should be easy for you to prove, right? Or are you afraid of being disproven?


You asked for proof and I gave it to you. I'm sorry you don't like the source, but unless you want to address what Jen Gennai said, I don't think you have much of a point.


Jen Gennai herself points out how Veritas is lying about what she said: https://medium.com/@gennai.jen/this-is-not-how-i-expected-mo...

James Okeefe is publicly known for selectively editing video to make false representations, and this duck is no different.

You specifically claimed that there's plenty of anecdotal evidence. Where is it?


The "proof" is extremely shaky. Veritas has a record of creating these types of videos where they draw conclusions out of thin air. Here's what the person in particular has to say:

https://medium.com/@gennai.jen/this-is-not-how-i-expected-mo...


There's also plenty of anecdotal evidence you can see for yourself as a user. In addition to that I know

> anecdotal evidence > I know

So much hard data here...real hacker news material...


You conveniently leave off:

"The proof is a senior Google employee admitting to bias and manipulating results in the Project Veritas video."


https://medium.com/@gennai.jen/this-is-not-how-i-expected-mo...

>Project Veritas has edited the video to make it seem that I am a powerful executive who was confirming that Google is working to alter the 2020 election. On both counts, this is absolute, unadulterated nonsense, of course. In a casual restaurant setting, I was explaining how Google’s Trust and Safety team (a team I used to work on) is working to help prevent the types of online foreign interference that happened in 2016. Google has been very public about the work that our teams have done since 2016 on this, so it’s hardly a revelation.


From the article:

"But what about those nasty filter bubbles that trap people in narrow worlds of information? Making Google’s index public doesn’t solve that problem, but it shrinks it to nonthreatening proportions. At the moment, it’s entirely up to Google to determine which bubble you’re in, which search suggestions you receive, and which search results appear at the top of the list; that’s the stuff of worldwide mind control. But with thousands of search platforms vying for your attention, the power is back in your hands. You pick your platform or platforms and shift to others when they draw your attention, as they will all be trying to do continuously."

But this is a huge problem. I'd rather have 10 independent search providers instead of 10 companies proxying the results of google. It's worse, if I don't even know from which index the results come from. I guess, many people don't know, that Startpage shows you Google results.

I don't want Google results! I want different web crawlers ordering the results according to my taste without tracking each and every page impression of me. Give me that and I'll switch in a heartbeat.


> I don't want Google results! I want different web crawlers ordering the results according to my taste

Okay, great, we'll see how you browse and use that to determine...

> without tracking each and every page impression of me. Give me that and I'll switch in a heartbeat.

Uh? How exactly do you want them to do that?


The search engine basically owns an index of all websites. I assume, that crawling websites and storing an index is a solved problem. Not trivial on that scale and expensive. But conceptually solved.

What google does really well are two things:

1) It ranks those pages according to the well-known PageRank algorithm, although the details are kept secret. They also learned how to punish sites, which try to mislead the algorithm (Duplicate content, excessive SEO, link farms, etc)

2) What they also do well is returning search results in a very short time, based on a relevancy of your search term combined with the precalculated ranking. It also looks at your history in this step and the data it tracks about you.

My preferred search engine would open up both bullet points, giving you the option how pages are ranked and how relevancy is determined. For example: Rank personal blog posts higher or rank news higher or only use an inbound link count. Or you could get fancy and crowd source ranking (similar to hacker news). You could also configure search term relevancy. For example: did you ever search google for something specific and it automatically adjusted your term to something similar, more popular?

You could open up all those decisions and give the user options. Google on the other hand keeps everything secret and uses your personal data in unknown ways. I would love to have a search engine which is configurable. Yeah, not an easy thing to do. But it would be awesome.


While you may be able to open up some switches, I don't think you can build a reasonable user experience on something like that, since anything that allows you significant control over how the results are chosen (beyond the simple control you've mentioned, like prefer certain types of results) will be both too complex for anything other than like distributed common sets of options (like a plugin that comes with presets) and/or have so many knobs as to make tuning for good results infeasible.

I would argue Google already supports some of what you want, like allowing you to search just for news, although additional filters based on something like the niche you're looking for or the type of content don't hurt.


I guess at least in principle, the search engines could return many more results than actually needed, along with some feature vector per result. Then a model could be trained on the users machine to sort the results according to their preferences


The other problem with this is that it still can't change human nature. Ok, so this plan is implemented and any site can serve google results and order them as they want with an API. People are still going to go to their favorite far right or far left outlets, which can now access google results and show only the articles that they know their users want to see. The "filter bubble" problem could even be worse than it currently is in this scenario.


That was my first thought. So this is a thing, and now we have dedicated search engines to showing you the 'true' search results from either the left or right. And it's that much easier to stay completely enveloped in the echo chamber.


Maybe this article should be made public too.


The irony of this being behind a paywall. An expensive paywall at that.


The effect of this on Alphabet's revenue would be nil.

The majority of Google's Revenue comes from Google, Youtube, Gmail and Play. They make so much $ because they have the biggest network effect of advertisers-eyeballs in the world along with Facebook. That. Is. Unbreakable. Even more than a social network's network effect, because the friction to switch budgets and people in a company is higher than a guy telling their best friends to download an app.

And then, YT is a network effect. And then, Play/Android is also a network effect. And then there's the branding. But presumably every big company has the latter. Still, what a brand. Everyone knows what Google or Android is. Every. Single. Human.

Finally, because they make all this money, they can pay to be the default on the other half of the devices, Apple's devices, to use Google as default. Last time I checked, $5B a year.

Hence, this article is so bad.

I don't even care about Google, just saying.

edit: did I mention Chrome? They've got chrome too, with the googleverse as default.


Also - just the amount of people who would game the system afterwards all the search results would be utterly useless.


Ye and anyway you could return x10 worse results than Google's current results and still become the new dominant search engine if you've got infinite $B a year to outbid Google to be the default on browsers and operating systems and a big salesforce to onboard the advertisers. Man this is the exact playbook they used to become the search engine in the first place. They literally were Yahoo's search bar at some point two decades ago.


Microsoft has infinite money, pays users with rewards directly, and has barely gotten any marketshare with Bing.

In order to compete, you would need good results and good performance for your organic results and your ad program in addition to a huge budget for traffic aquisition and patience and ability to execute on strategy consistently over multiple years.

Also, keep in mind that people will judge based on perceived quality, not objective quality. Simply being shown as a Google result increases the perceived quality for most results -- in order to be seen as equal, the competitor will need to have consistently better results.


As you can clearly see for yourself, Microsoft is not as interested in acquiring market share for Bing. Not as much as to outbid Google's contracts with Firefox and Apple to be the default. Bing is clearly a second thought for Microsoft even tho it's big. That's probably because they wouldn't get the same bang for their buck anyway. They have far less advertisers on board. Anyway, according to Microsoft, they own 33% of the US market. That is quite the market share. Especially once you consider it has mostly been acquired thru sneaky toolbar installations and IE.


Microsoft is solving their biggest problem by switching IE to Chromium. Right now everyone uses IE to download Firefox or Chrome. But now that IE will be Chrome they are hoping that enough people will not switch their browser and will start using the defaults (bing, outlook, etc). It's genius. In the past IE had the dominate browser it is not impossible that they can claw back some of that market share.


So although everyone likes to believe google is a monopoly it’s far from it. You have choices- bing, biadu , yandex, DuckDuckGo... there is also nothing about googles search position that prevents you from building a competitor. What we do have is peter thiel backing an administration that’s anti google, Russia, China that are anti google. Why? it’s a source of truth that challenges their lies. We also have an emergent anti ad - cult like backlash against personalized ads. So all of these factors combined and you get a lot of pressures mis information telling you google is evil. Additionally, karma , google led the charge against Microsoft with googles do no evil position against Microsoft- which did have an oem monopoly preventing others from competing. Anyways that is how I see it... so is google near to being a monopoly no I think they would need to be doing a lot worse things and there is room to compete and people should


ah yes, the only reason you could ever have for being anti google, et al, is that you want to lie...

You started with a decent premise, that google isn't an actual monopoly because it has "competitors" (and yes, those quotes are intentional). But then you go off the deep end and completely lost me to the point that I didn't even finish the entirety of your post.

There are many many reasons that someone could be against google's absolute market dominance.


Where do you see their absolute market dominance... I checked here for example http://gs.statcounter.com/ and for browsers globally it's 63% that's not a monopoly... I checked here https://www.statista.com/statistics/216573/worldwide-market-... for search engine and 90% is pretty close but it's not a monopoly... contrast that with MS in the late 90s according to https://www.forbes.com/sites/timworstall/2012/12/13/microsof... and you have 97% that's a monopoly if it's abused which it was proven they did.. https://en.wikipedia.org/wiki/United_States_v._Microsoft_Cor.... again i find it interesting it's hard to talk about google being a monopoly and not mentioning MS...


Google doesn't have market dominance in the same way that Verizon doesn't have market dominance in a specific region because you can also get dialup.

Technically correct, but not truthful.


> or browsers globally it's 63% that's not a monopoly...

You should look at markets, not globally. Google doesn't have a global monopoly (and likely never will because China, Russia won't allow their web to be controlled by a US company), but it has a monopoly in lots of markets (read: countries).


Google isn't a monopoly, or monopsony for that matter. Competitors are a click away.

Having a popular product != having control over the supply of a market. Market dominance through having a better product than the competition is how the free market works.


Google is a profit-seeking corporation that needs to expand into China to generate more money and bends the knee as much as it can to make inroads there. Far from being a source of truth they follow all Chinese censorship laws (why, because they would be blocked from operating there and unable to make money off of the Chinese). China likes everything about Google other than the fact that it is not a domestic corporation.

A company does not need to be a monopoly to face Anti-trust measures made in the public interest. In the 90s you could have run Linux on a PC, installed Netscape on a PC, or bought a Mac, but Microsoft was still treated as a monopolist. The same with IBM in the 70s. There were at least five to ten other major suppliers of mainframes, but IBM was so dominant in the market that they were aggressively regulated for anti-trust.

Google is not the victim of misinformation. Everyone is just starting to understand how important they are as the de facto gateway to information (regardless of choice) and mulling over the implications of it dominating add revenue, search, etc.


I saw Peter Thiel on one of the networks this morning pushing this message. The problem with his message (and yours) is that Google doesn't work with the Chinese Government. Dragonfly was shut down, and for the most part, Google is not even available in China.

Thiel's criticism was that because Google wasn't working with the U.S. Government, and was operating in China, that Google isn't patriotic.

Not only is this rhetoric a false equivocation, but it's dangerously wrong.


>Google is not the victim of misinformation.

>Google is a profit-seeking corporation that needs to expand into China to generate more money and bends the knee as much as it can to make inroads there. Far from being a source of truth they follow all Chinese censorship laws (why, because they would be blocked from operating there and unable to make money off of the Chinese). China likes everything about Google other than the fact that it is not a domestic corporation.

Whew the irony


You are of course right. While I stand by my assertion that they have tried to cowtow to China in order to get into the market, their services are blocked and China has resisted letting them back in. Their efforts are likely dead or stalled with the death of Dragonfly. In any case, major memory malfunction on my part, apologies.


Being a monopoly isn't actually the issue. Having a dominant market position is. This is where European competition law works well.

Google definitely has a dominant market position.


EU antitrust law doesn't even require dominant market position to break the law.


Peter thiel was recently speaking at a conservative whatever in DC spreading nonsense and creating a bogeyman out of Google (but not FB obviously). That they're working on some evil AI. Anyone actually working in the field knows how blown out of proportion and cringy this AI doomsday narrative is but people like thiel don't waste a chance. I'd like someone to tell his Republican audience (pardon the stereotyping) that he is homosexual and recently obtained a New Zealand citizenship just in case. Then watch how his audience reacts.


> What we do have is peter thiel backing an administration that’s anti google, Russia, China that are anti google.

Can you provide some context? Article does not mention Thiel, Russia or China.


Foreign propaganda usually doesn't label itself as such. Nor does this article have to be from such a source. Merely inspired by the noise being made


Are you suggesting that we should discard criticism of google because Russia and China is anti google? China does not like Trump either, so should we not criticize trump because of China?

And I don't even get why you say China and Russia is anti Google. They are anti-information more than anti-google. If google allows them to control what information people get they will have no problem with Google. They are further anti-west and are concerned that Google will play ball with western governments to give them access to Chinese and Russian data. Maybe I am missing what you mean by them being anti Google.


No - but I do think we should be sure we read deeper and question who's motivating a certain position. Ask is there a source of truth that can confirm someone's position. For example, I can see google's search business controls about 90% of the search traffic. I see browser usage around 63%. For advertising (google main source of income) I see they don't dominate they share market with Facebook and more recently a lot of talk about Amazon's position being strong to take more of that market.

See: https://www.statista.com/chart/17109/us-digital-advertising-...

https://www.cnbc.com/2019/02/20/amazon-advertising-business-...

That said maybe my sources are not good? But it's better to think is someone being truthful or trying to manipulate me with a specific position...


I'm not suggesting that. Nor am I saying China & Russia are anti Google. I was only responding to why your statement "article doesn't mention X,Y,Z" was missing the mark on taf2's argument


Carousel requiring (severely favoring?) AMP, requiring exclusively Google javascript/ads is definitely Googling using their advanced user base on search to favor their ad business.

They have already been condemned for their Google Shopping service for monopoly abuse on search.


condemned only in the EU and the solution that was created did NOT help the consumer.


This is dumb. You mean if I work two decades to develop tech that nobody else can copy, I have to open-source it because my competitors are dumbasses?

This is not how it's suppose to work.


I think the bigger issue is how it affects people and the power this tech gives to control various things about socio-political landscape which a single corporation shouldn't be trusted with.


If we're going to start nationalizing the resources of private companies in the name of public good, I think internet tech giants that offer up their services for free should be last on the list. On the top of the list should be corporations that dominate resources like life-saving drugs and residential property.


Not saying I agree or disagree with you, but the idea itself isn't anything radical. The US has had anti-trust since the late 1800s.


I don’t see any mention in this article of what seems like the most obvious way to split up Google, separating their search and ad businesses. (Edit to add: although maybe the effect would end up being similar, if API users serve their own ads but without access to Google’s ad infrastructure.)

That obviously wouldn’t be a simple job, of course, and maybe there are some interesting reasons why it wouldn’t work well.


> I don’t see any mention in this article of what seems like the most obvious way to split up Google, separating their search and ad businesses.

What would be the revenue model for their search then?


The ad business would pay the search business for ad placement. The search business would also be free to auction ad space to other providers.


> The search business would also be free to auction ad space to other providers.

Doesn't that make them an ad business?


I agree, to me it seems like that really does not change anything except it mandates one more middle man.


Yep, it just makes what likely already operates as an "internal" customer an "external" one. Not really sure how it helps with competition?


That’s exactly the point, to make that internal customer external, and make any special access or APIs they have available to competitors on an equal footing.


It makes them like the New York Times, which sells ad real estate. It's a much different business.


Is that not what google's ad business does already? Not sure where the difference is actually. Basically the request seems to be:

Pre Split:

  - Company A
    - provides: search
    - sells: ad space
Post Split:

  - Company A
    - provides: search
    - sells: ad space
  - Company B
    - (re-)sells: ad space
Seems like this solves nothing and can be repeated ad infinitum. There are many companies which already does what Company B does - what exactly will this fix?


Here’s a question that may clarify: who runs the ad auction?

If post-split that’s company B, I’d argue that is a genuinely new structure with some different properties.

Company A would still have an ads team to manage the integration, but they would explicitly outsource ad selection to (potentially) multiple partner companies.


>The ad business would pay the search business for ad placement. The search business would also be free to auction ad space to other providers.

What you're describing is exactly what Google ads are today.


> What would be the revenue model for their search then?

I read OP's proposal as separating Google search business (search + search ads) from its third-party ad business (e.g. AdSense).


> what seems like the most obvious way to split up Google, separating their search and ad businesses

Given the complexity of (a) Google's search and ad integrations and (b) the adtech landscape as a whole, this would be difficult to do legislatively. That leaves settlement with the DoJ, a costly and time-consuming path.

The author's suggestion is not exclusive against a break-up of Google. Its moderation and basis in precedent, however, make it something multiple agencies--not just the DoJ--could implement. Including Congress.


It is actually very easy to do legislatively (laws tell the companies what to do, not necessarily how to do it). The technical challenges fall on Google.


> It is actually very easy to do legislatively

It would be easy to write a law, not to pass one. Generally speaking, when a law is unpredictably disruptive it becomes (a) difficult to pass and (b) time-consuming and costly to defend in court.


The difference in this case is that the usual defenders of private property rights and libertarianism feel as though Google is suppressing them, potentially removing the usual roadblock to this kind reform.

This is a rare instance where anti-corporate leftists and dejected right wingers could actually do something substantive together.

In most cases I would agree with you the defenders of private industry are pretty fierce in the US, but all that ideology sort of dissipates when you feel like you are being oppressed (whether or not, or to what degree it is true).


> the usual defenders of private property rights and libertarianism feel as though Google is suppressing them, potentially removing the usual roadblock to this kind reform

Potentially. It's still more difficult than the author's index proposal. No reason they can't be pursued in parallel.

Regarding a search-ad break-up, I'd guess there would be lots of wrangling over (and lobbying around) defining search and advertising. For example, is Amazon's product search tied to an advertising business, given it sells third parties' products?


Microsoft said the same thing about internet exploror and it's operating system..


> Microsoft said the same thing about internet exploror and it's operating system

And breaking Microsoft's attempted browser monopoly required the DoJ.

To be clear, I believe we will eventually have to break apart--at the very least--Google and Facebook. But there are advantages to the author's proposed tactic of making the index public. It's a cleaner, cheaper, and quicker solution and doesn't harm the odds of a future break-up.


And when it all came down to it, it didn’t matter. Chrome didn’t become dominant on the desktop because of government intervention and Safari didn’t become important because of DOJ either.

Apple and Google competed. Government intervention is rarely the right answer. A bunch of people who are both beholden to lobbyists and ignorant to technology won’t produce the outcome people think it will.

Besides, the last thing anyone should want is more government power. Given the choice between trusting the government - that has the power to take away my liberty and my money - with more power or trusting private corporations, I have much more to fear about government.


That which the market does not naturally provide... does one just morally expect?


Two points.

The slap on the wrist from the DOJ didn’t lead to decreasing dominance on Microsoft. Google/Facebook/Apple competing did.

A search engine is not life or death. Google doesn’t stop anyone from creating a website and reaching consumers other ways.


>A search engine is not life or death. Google doesn’t stop anyone from creating a website and reaching consumers other ways.

As someone outside of the tech bubble, this statement always confuses me. I see it once a week or so on this site. Someone will say 'well, company x is doing y, what's to stop you from just disrupting them and working around?"

Because people get their information in consistent, and predictable ways. And one of those main ways is to search for it using a popular website, so popular it's literally called 'googling' something.

If you are an unknown, a search engine is literally life and death for your company, from what I can tell.

Again, I'm outside the tech sphere, so maybe I'm just ignorant. But I don't think so.


Its not about “disrupting” anyone. There are other forms of advertising than Google - there is Facebook, Amazon, commercials on TV, guerrilla marketing, getting “1000 true fans” and let your product spread organically, etc. if you are completely dependent on Google you are just one algorithm change from becoming invisible and you may need to rethink your business plan. It's kind of silly to start a business with your only method customer acquisition is depending on Google.

I’ve worked for B2B companies that actually had a sales team.


Same results different entities.

Outlaw ad tracking in some way.


But it minimizes conflict of interest.


Why not both?


A list of websites and their content is really not useful at all. Anyone can get this themselves with some really simple programming.

The actual hard part is when it comes to ranking and sorting the data in any useful way, and doing it within like 100ms. Plus various other issues like spam protection etc. This is where Google excels (at least in my opinion).


I wouldn't say that anyone can create an index with really simple programming. There are quite a few technical obstacles that "really simple programming" probably couldn't solve. That being said, I agree that any legitimate company would be able to create an index easily enough. The hard part is ranking and spam detection.


This seems a bit silly - it's not like Google's search results are that much better than Bing or DuckDuckGo (or better at all most of the time). Google has Google Colab, Chrome, and Android, and the whole Docs ecosystem, and they've integrated those things together pretty well while still being loosely coupled enough to switch out any part of the ecosystem with a competitor's product.

Is there a reason to break up Google other than that they are doing well? Other search engines seem to have no problem establishing their own niche and doing well.


Colab? I don't think that google's position is held in power by that.


I wanted to give a small example to represent the zillion other little small examples that I didn't list


I'm of two minds here: Google's whole reason for ascending to where they are is the PageRank algorithm which is why Google was created in order to monetize. I see this in similar veins to Apple and iOS: would we support calls for Apple to be forced to allow iOS to be installed on non-Apple hardware? If not, then why would we insist on Google giving up it's reason for being, it's reason a lot of us use it to find relevant information?

Then again, the concentration of power in a handful of operators likely threatens the open internet.


Not a lawyer, but this seems to be conflating patents with copyrights?

i.e. iOS (especially new versions) would fall under copyright protection [0].

PageRank is a patented technique for search. The patent apparently ran out about 6 weeks ago [0].

While both copyright and patents are intended to protect creators for a certain period of time, copyright protects a specific work and patents protect an idea. Patents should generally expire much more quickly since they cover a much broader topic.

I also realize both systems are completely crippled at the moment, but I'm trying to stick to what they're at least intended to be.

[0]: I'm sure they have tons of patents centered around iOS but this is about protecting the OS itself. [1]: https://pulse2.com/googles-pagerank-patent-expired/


Makes sense actually. Yeah I think I am conflating them.

The patent is expired but could you argue the index Google is accumulating and aggregating could be copyrighted work in the same way that iOS is? It has taken considerable effort, expense, and expertise to cultivate a useful index.

Full disclosure I’ve no idea what I’m talking about just thinking I can’t see what basis people have to force Google to open up their key money maker just because. Other people/companies can create their own indexes with the now expired patent, no?


I don’t want to break Google’s monopoly on search. Google’s search is fantastic. It’s their advertising business knowing too much about me I care about.


> Google’s search is fantastic

The author's proposal, making Google's index publicly accessible, would leave Google search intact. Google's algorithm would presumably remain proprietary.

For people who like Google today, nothing would change. For people for whom Google falls short, there would be new options. Looks like a clear win-win for consumers.


>Google's algorithm would presumably remain proprietary.

This is not a good thing. Basic keyword searching on an index isn't viable to compete, then you go back to when keyword spammers were at the top. Google should be forced to distribute all their search technologies to the public with the threat of the Alphabet corporation being dissolved if they do not comply. That is the only way to get us out of this mess and break Google's stranglehold on the internet.


If you started a business would you be okay with giving up all of the IP you invested in creating?


Like you said, the value of Google is the algo. The index is worth little.

The limits of Bing, Qwant, DDG, etc. are not the number of the indexed pages.

It's that, give them the same pages, and the same search terms, and they don't return results as good.


> the value of Google is the algo. The index is worth little.

But it's worth something. Giving DuckDuckGo direct access to Google's index, including the ability to train models on said index, would improve the competitive landscape.


Or, you know, they could innovate by themselves instead of relying on Google to do all the heavy lifting?


Yeah it seems like this would spawn a lot of weak competitors who only exist because of the protections and would die off the second they're cut off from the API.


Maybe the practical indexing ability is worth more than the index itself. Most site owners happily let the Googlebot into their websites, while a competitor is likely to be caught into some sweeping anti-bot/Captcha measures.


If you want Google search without tracking, you should try StartPage

https://www.startpage.com/


It also seems to work much better. It finds exactly what I type, and not random things that google thinks might fit...


How do they make money?


Non-personalized ads.


And really, that seems like a good cutpoint to me. Search might be a natural monopoly. But advertising sure isn't. A lot of their ad stuff came via acquisition; what was bought can be spun off or sold again.


Why would search be a natural monopoly? Compute time and network bandwidth is less expensive now than it's ever been.


But the profit one can gain from crawling sure grows non-linearly with the number of pages consumed.

Thus, small businesses are less likely to reach the threshold of the index size required to make profit ratio comparable with that of their bigger competitors.


I'm not so sure about search being fantastic anymore. I'm frustrated with how most times Google will completely ignore the exact query terms and return very clickbaity/popular links from medium/quota/etc. It's been happening for a bit now and usually for any serious search, I'll compare results with duckduckgo just to be sure.


I kinda disagree. 15 years ago you could generally type anything and expect the first search result to be the legitimate one. Now you have to dig through the list of SEO optimized scams (or at least dubious websites). Albeit it improved somewhat over the last couple of years.

Example: if you type "broadway shows phantom of the opera", the first non-ad link redirect to www.broadway.com. But www.broadway.com doesn't do anything: they just buy tickets from the legitimate website (www.telecharge.com in that case) with an extra ~$30 fee per ticket tackled on top. I won't go as far as to claim that Google is ok with it because broadway.com must spend tons of money on ads, and not telecharge.com... but I'm suspicious.

Counter example: "green card lottery" now points to the official us gov websites, while few years ago it was scams after scams all the way down.


I think the better move is to find a way to kill the advertising business model that google relies on. If there's no advertising there's not as much need for all this data collection.


>the better move is to find a way to kill the advertising business model that google relies on. If there's no advertising there's not as much need for all this data collection.

What would be the alternative ad-free business model for Google to pay for its datacenters? Paid subscriptions?

I've been performing google searches for 20 years. In an alternate past universe... if I was paying $9.99/month for paywall access to their ad-free search engine, Google Inc would have more sensitive data collection about me -- not less. Google would have decades of my (sometimes embarassing) search history specifically tied to my paid account.

Ads can reduce privacy but they can also increase privacy by making explicit logins optional. (In other words, the anonymous google searches I did at the office during work are uncorrelated with the searches I do at home or at school. Without explicit Google account logins, the searches I do on my smartphone at Starbucks can be uncorrelated with the searches I do on the desktop at home.)

I don't like ads but the alternative ad-free revenue model is worse: Google Inc having my credit-card payment info (which means my real identity) and all my private search queries tied to it.


the advertising model google relies on is just a reflection of how the attention economy and the internet work. I just don't see how this makes sense practically. People aren't going to seamlessly and happily go back to paying $$ for internet services


Google's search isn't even that great, to be honest. I use DDG for search, and use the !g operator when the DDG results aren't satisfying. But my experience with that is that the Google results are never good for those searches, either.

The thing the article misses is that search isn't really Google's crown jewel anymore. It's their position at the top of the adtech ecosystem, and while they bootstrapped that with search, I kind of doubt that search is the main thing driving their ad views today.


but a big reason search is fantastic is because of the data the system has on us. I don't think you can separate the two.


Google treads on censorship as it regards political discourse. I’m not referring to things beyond the pale. I mean they will censor seemingly small things like bury a politician’s peccadilloes or surface something that stains an opponent. That’s dangerous.

One obvious search query string is: reddit + candidate and see some surfaced and some less so.


I know it's hard to get definite proof, but what really makes you think Google employees (only a subset of whom have production access) would go out of their way to censor websites or results? If I search for [any US candidate] I always get both a news carousel and their official website. Maybe this happens in other countries where Google can throw their weight around without anyone noticing?


I lean pretty far left, but this proposal makes me really uneasy. It seems very coercive and ham-fisted. I think instead, Bing should consider opening its index up. It'd be a massive PR coup, they wouldn't lose anything of much value, and anyone who put the index to better use than them would help them by breaking Google's stranglehold over search.


So... this article is a good example of how ??!? it gets once you move from "We gotta do something about these tech monopolies" into the "what should we do?" phase.

How exactly "do* you break up a Google or a FB so that (only one possible reason, but the one cited here) they don't control too much media/mind share?^

Laws usually want to be general, and my suggestion doesn't necessarily lend to that, but I'll suggest it anyway.

Facebook doesn't need to be broken up into several companies. It can just be shut down.

I don't mean that it should (justice-wise) be shut down. I just mean that we won't lack for social media. We will have social media alternatives the day after FB shuts down. Theres a chance we'll get something more open instead. There's a good chance we'll get several small replacements instead. There is 0-chance that we'll lack for ways to share posts and post pictures. This isn't Bell, where we need to keep the phones working. The phones will work fine with or without Facebook.

There is no need for a company to generate $70bn in revenue, in order for us to have social media. That's a key difference from all antitrust cases of the past.

YouTube is another sort of example. If it shuts down, alternatives will pop up with immediately...maybe open ones.

There are justice questions (is it fair to shareholders/employees/zuck?) There are legal validity questions (why FB and not apple?). But, for the practical questions... the problem is an easy one.

^fwiw, I also think this is the most worrying part. These companies have a tremendous control about how and what people think. They make Murdoch media look quaint.


I have zero faith that any new social media ecosystem that emerges in the absence of Facebook would be more secure, or more private -- especially if the new ecosystem consists of multiple independent services sharing data (and if it doesn't turn out that way, that means a new monopoly emerged).

A single large company is also easier to regulate than a bunch of smaller ones.

If you are concerned about monopoly power over social media, go ahead and break up Facebook. If you are concerned about privacy and the security of social media data, you are probably better off regulating Facebook and raising high barriers to anyone who wants access to Facebook's data.


> How exactly "do* you break up a Google or a FB so that (only one possible reason, but the one cited here) they don't control too much media/mind share?^

The problem isn't so much that google has a monopoly on search as that they abuse the monopoly to force their way into new businesses. Force google to spin search off as a separate entity from the rest of the company.


Isn't it? I'm not even sure that "monopoly" is the right frame. Google's "price" in their high market share areas (eg search) is free anyway.

To me, the problem is power, not necessarily market/pricing power or even exclusion power. The problem is that the "public square" is essentially a handful of private monopolies. What Google considers "inappropriate" is banned or disadvantaged on YouTube. What FB (or Twitter) considers distasteful can become absent from the public or political debate.

It's no joke, at this point. Things "woke up" when it seemed that shady stuff happened during trump's election. But.. several revolutions had already started on social media, coups, counter-coups, civil wars. 10s of thousands of public officials around the world have been elected on FB.

That is a lot of power in very few hands, under a veil of secrecy, often with bad incentives. It was cheap printed paphlets and union halls that made fascists-vs-communist street fights a thing in the 30s... mediums. These days, an algorithm notices that telling (proverbial) fascists about communists is good for engagement... And the algorithm optimizes. I'm not accusing them of doing this deliberately doing this, but the feed algorithms optimize this way for a reason.. incentive.


I don't understand why we are talking about this. Google is far from the first "monopoly" that I would like to see broken up.

My guess is that Google's not lobbying effectively.


Because their even more monopolistic competitors are pissed, journalists are pisses because the world changed on them and they have to compete, combined with radicals with a persecution complex and a "no the children are wrong" attitude towards their opinions being unpopular. The tech break up push is one big hypocritical circle jerk from every party.


This just sounds like pissed off rant. Can you draw out your statements a little and explain them?


Fun timing to read this, this last weekend I was playing around with making my own search engine to understand better how ElasticSearch and Lucene work.

It occurred to me that the two most powerful things Google has to work with are records of clicks, and the time users spent on the webpages Google returned. I've argued against Google monopoly before because I can throw together a web crawler and search engine in a weekend, so it's not like it's a hard market to enter.

> According to W3Techs, Google Analytics is being used by 52.9 percent of all websites on the internet

This is the real problem though. When a search engine sees a new query, it uses everything it's got to assert which pages the user wants, but with Google Analytics, they can test their assertions constantly to see if a user actually wanted that web page. Then your future queries could be compared against previous queries that were validated by a user spending several minutes of active time on the returned page.

I'm sure Google's algorithm is great and all, but I really think this is what sets them apart.


> but with Google Analytics...

No: "Search does not use Google Analytics for ranking" https://www.youtube.com/watch?v=LLmO1GE4GvI


I think you misunderstand what I'm saying. I'm not saying Google analytics will get you ranked higher or lower. Just that Google can use the data from analytics to tell if their results were what the user wanted.


It doesn't even have to link the results to a specific user, linking them to the search terms is enough.

It's possible to achieve something similar by measuring bounce rate, but in any case, the sheer number of searches is what matters. For new entrants, this kind of expertise will be very hard to get.


> tell if their results were what the user wanted

That's ranking.


That's validating the rank. Ranking is done in another step.

If they use the data for ranking, they'll lose the huge amount of value it has for validating.


Good idea! We should make Bloomberg’s stock data public too!


I wonder if it's even in America's interests to create a weaker Google/Facebook/Amazon/Microsoft. These companies are dominating globally (excluding China) and bring back so much money, jobs, and influence to America. Weakening them might allow real foreign competition to flourish.


"DuckDuckGo, which aggregates information obtained from 400 other non-Google sources, including its own modest crawler.)"

I looked into it, and it seems DDG is using Bing and Yahoo search API and lots of other sources. I looked into the pricing of Yahoo's search API / Bing search API,it ranges from $0.80 / 1000 queries to several dollars per thousand queries.

It seems to expensive to be economically viable with ads, what am I missing ?


I see the duck crawler on my websites a fair bit now so chances are they're eliminating this expense, but even in 2015 the CEO said they were profitable (although this wasn't at the same scale as now) https://fortune.com/2015/10/09/duckduckgo-profitable/


They've negotiated a lower rate?


Running a search engine is pretty expensive. $0.80 per 1000 queries doesn't seem orders of magnitude off the actual running costs.

For comparison, that works out to ~$15 per year for the average users search volume.

If you can't make $15 per user per year on search, you have bigger problems...


If DuckDuckGo can change their stupid name they will gain market share overnight.


I agree. I've given up recommending DuckDuckGo to non-technical friends/family because they always get distracted by the irritating name. I wish they would change it to "Duck.com".


"duck it" sounds a lot like another phrase.


Because “Google” was a sensible name?


No, but Google is two syllables, and DDG is three. Also, Google doesn't have a duck in the name.


Googly eyes


True but DuckDuckGo doesn’t roll off the tongue as easily. If they named it “Poodle” it’d probably get more traction as silly as it seems.


Or even just Duck[.com] - they own the domain..


Or AltaVista


Lycos


Many just call them duck or ddg now. They bought duck.com


It might have been for free, it's unclear they bought it

https://www.theverge.com/2018/12/12/18137369/duckduckgo-duck...


They added an intermediate page around July 2018 https://web.archive.org/web/20180721002315/https://twitter.c... and probably saw that 99% of people clicked on the DDG link.


The verb form "to duck" would be sublime. For instance "I ducked this guy I met".


Did they buy it recently? I recall Google owning it for a bit through an acquisition? Yeah it certainly would be nice if ddg rebranded and pushed duck as it's name.


Google gave it to them. The sceptic in me saw it as future evidence against being a search monopoly "we even helped our rival you guys"


That's cynic, not skeptic. If Google kept duck.com you would say they were protecting their monopoly. When your model of blameworthiness says that every possible choice is wrong, your model is wrong.


Also, duck.com is easier to type than duckduckgo.com in the address bar of a browser.

There is also ddg.gg which I find even easier to type.


bad engine though (last time i checked was ~3 years ago)


I don't mind google search. Let them have search, they're obviously the best at it. What I have a problem with is literally every other service they provide. Google shouldn't own the most popular browser, the most popular video streaming platform, the most popular map service, the most popular operating system and (one of the?) most popular email services. Why is a single company allowed to have control over essentially half the internet? I don't think there's a precedent to that and it's about time it's tackled.


Why is it people care so much about Google being in many different markets but I never see anyone calling for Samsung or Berkshire Hathaway to be broken up?

Just goes to show how easy populism is to push. Google and Facebook are in the news a lot and used by the average consumer, making them the perfect scapegoats.

Meanwhile, the average consumer just sees Samsung as the TV people (actual industries: apparel, automotive, chemicals, consumer electronics, electronic components, medical equipment, semiconductors, solid state drives, DRAM, ships, telecommunications equipment, home appliances) or Berkshire Hathaway as that old rich guy that drives a cheap car (actual industries (property & casualty insurance, Utilities, Restaurants, Food processing, Aerospace, Toys, Media, Automotive, Sporting goods, Consumer products, Internet, Real estate), so not a peep is made about them or every other horizontally distributed company in the world.


> Meanwhile, the average consumer just sees Samsung as the TV people (actual industries: apparel, automotive, chemicals, consumer electronics, electronic components, medical equipment, semiconductors, solid state drives, DRAM, ships, telecommunications equipment, home appliances)

Samsung was also building military jets and artillery pieces. Imagine the media response if Facebook went into weapons production :)


I get your point, but Samsung is a manufacturer, it sells products. Google and FB can have much more impact leveraging the data they have on everyone, which is why we need as a society to be cautious and keep them in check.


> What I have a problem with is literally every other service they provide. Google shouldn't own the most popular browser, the most popular video streaming platform, the most popular map service, the most popular operating system and (one of the?) most popular email services.

So you're saying they're giving away too much good free stuff and need to be stopped?


I mean, obviously it’s not “free” to society.


> I mean, obviously it’s not “free” to society.

My mistake. I had forgotten that Google is a taxpayer funded organization.


>Google shouldn't own the most popular browser, the most popular video streaming platform, the most popular map service, the most popular operating system and (one of the?) most popular email services.

Why the hell not? Another company is free to make a better streaming platform, better map service, better browser, better email service... The government has no business telling companies which (legal) products they can or cannot create and own.


I don't mind google search. Let them have search, they're obviously the best at it.

I don't think that's a good idea, to let one company "have" search. Basically, you're letting a single company regulate discovery and information for a large chunk of the culture as a whole.

Why is a single company allowed to have control over essentially half the internet?

If that company has control of most of the discovery and monopolizes the information produced, they have one level upstream control of most of the Internet and the culture. It's like they don't just own the reservoir, but own a majority share in the river and most of the dams upstream.


What's stopping you from using Bing Search, Open streetmaps, hotmail, Firefox etc.? Google doesn't "control" anything you don't let it.


I’m using Firefox and DuckDuckGo and it’s okay. But, for example, I naively made gmail my work mail so I’ll be stuck with that for a while. But a few idealists on the Internet going through the effort of de-google-ing their lives doesn’t change their grip on society.

It’s rather obvious and there’s a lot of google employees on hacker news so I don’t really see the point of arguing further. You’re having a giant part of the defining tech of the 21st century controlled by one company. How’s that healthy?


Very few people seem to host their own email infrastructure, even when using their own domain, and I find myself unable to contact businesses and individuals because of gmail's rules that almost never let my messages through.

This isn't even touching on that Amp email garbage that they're pushing.


In a lot of ways I want better search than Google. It's increasingly for the masses, power users are increasingly unhappy.


I think a distinction needs to be made between monopoly and 'best product'. Search is winner take all in many respects. As a user, I don't particularly care about the second best (because I don't pay).

The best way to fix the situation is to get users to pay. And this could be done. Users can pay negative dollars (e.g. get paid). Why would a search engine want to pay it's users? Why would a retailer want to pay me to look at their ad? Why wouldn't they?

Example:. My wife and I buy a not insignificant amount of stuff from LuLuLemon. So I have a deal for Lululemon. Pay DuckDuck half what you pay Google to show me your stuff. And DuckDuck, pay me half what they pay you. Right now DuckDuck gets zero. Google probably makes 20+ bucks a month off me. In this math, DuckDuck nets 5$!

There is a natural equalibrium here. Advertisers quickly start paying zero to people that don't buy stuff, so users can't farm ads. But relevant customers get discounts.

So now I actually care about making the trade off between second best but effectively cheaper or best but more expensive. E.g. a functioning market.

To do this though, you have to start tracking users in some way. But Google already does this, so the situation doesn't change. Some form of surveillance is required to improve relevance. The difference here is, I can choose who surveils me and to what extent rather than being stick with the winner take all.


I disagree. To break the Monopoly a new & innovative way to search needs to rise, not just a clone or "Google wanna be" like Bing/DuckDuckGo etc.

So, somebody gotta do what Google did 20 years ago.


I don't necessarily think re-inventing the wheel every so often is the way to go on some things. When it comes to search, you want to type in your query, and search. It doesn't really get much simpler than having a search box, and I am having a hard time understanding how that can be innovated upon. I am however not saying it CAN'T be done, I just don't see why or how it would be.


Unfortunately, people like him and consultants who will be interested in doing business with Google will be the ones that politicians listen to.

I'd personally do this to solve the problem:

1. Separate Google from Alphabet and separate Google Search from everything else they do.

2. Nationalise the Google brand and search engine along with Bing. Everyone else can stay as-is for now.

3. Allow applications for companies to become a search result endpoints. Ownership would need to be completely transparent. Prevent incumbents from being involved - Amazon, FB, Google, Apple, Microsoft, Samsung etc.

4. All Google and Bing result listings would then be evenly distributed across the search result endpoint providers.

5. The propositions for each of these companies can now be customised knowing that they will get a maximum number of likely searches per year.

6. Prevent any mergers or foreign ownership of these new companies.

Sure the search would develop at a slower pace but its power would devolve into utility again rather than as a way to control what people see.


None of the rules he sets out for access to index are true for ICs within google trying to run over even the top/smallest tier of pages in the index.


This idea is a pipe dream. It would be nationalizing property, it's never going to happen in the US.

If you want to enforce anti-trust actions against google's app store policies, I'm all for that. But forcing them to give away property is insane.


> This idea is a pipe dream. It would be nationalizing property

The author references precedent in the 1956 consent decree with Bell Labs [1]. If anything, that was more extreme. AT&T developed those patents.

The author's proposal doesn't involve Google surrendering its algorithm. Just the index it compiled from public resources using a publicly-subsidized Internet infrastructure. All without content owners' permission.

[1] https://economics.yale.edu/sites/default/files/how_antitrust...


What about the computing and infrastructure resources that Google dedicated to build the index, and continues to dedicate to keep the index up to date?

This is a failed model. There is no incentive for Google to continue to update the public version, and it would quickly fall out of date while Google focuses on their own internal copy. It’s a solution suggested by lawmakers who fundamentally don’t understand how computers work.


> What about the computing and infrastructure resources that Google dedicated to build the index

How is this different from the 1956 consent decree? AT&T spent money developing its patents. But on the basis of longstanding law around public interest, it was forced to license them to third parties. (Note: not give them away.)

> and continues to dedicate to keep the index up to date?

The author explicitly contemplates, again within the context of the 1956 consent decree and many subsequent and preceding actions by the U.S. government (mostly around pipelines, et cetera), use fees paid to Google.

> There is no incentive for Google to continue to update the public version, and it would quickly fall out of date while Google focuses on their own internal copy

Google wouldn't be permitted to maintain dual states. Its algorithms would have to use the public index.


Conflating AT&T to Google is incredibly misleading. AT&T had become a defacto monopoly approx. 20 years prior to that consent decree, not to mention being subjected to an anti-trust lawsuit ten years prior to the decree.

The article makes the conceit of equating an online data store to the telephone infrastructure. Google is not preventing other companies from indexing the internet in any way, shape or form. There is no barrier to another company building a similar data store from scratch using the existing internet infrastructure. On the other hand, laying out your own wires has physical limitations, especially when another company has claimed the most optimal route. In such a scenario, your costs will always exceed theirs (more infrastructure to reach the same consumers).

Regarding forcing Google to use the same index: How does DoJ plan to enforce this? How would DoJ ever be sure that Google is in compliance? I once again assert that the law magically assumes new technology to solve old technology problems without fundamentally understanding how computers work. Computers are copy-on-write by design. Delete always requires an extra step, and the verification that only one copy exists is impossible to make in the digital domain.


> Google is not preventing other companies from indexing the internet in any way

AT&T wasn't blocking anyone from laying interstate copper. It's just prohibitively expensive to do so. The analogy, down to the network effects, is quite apt.


Did you read the article?


Well, the author is an idiot. A search index (as far as the compiled listings of things, not the tech that runs it) would obviously not be covered under this ruling.

Google can't have a monopoly on data gathered from the publicly available internet, that's absurd.


The difference in this case is that the usual defenders of private property rights and libertarianism feel as though Google is suppressing them removing the usual roadblock to this kind reform.

This is a rare instance where anti-corporate leftists and dejected right wingers could actually do something substantive together.

In most cases I would agree with you the defenders of private industry are pretty fierce in the US, but all that ideology sort of dissipates when you feel like you are being oppressed (whether or not, or to what degree it is true).


Some people are arguing that catering search results or what content is allowed on a platform to a specific set of political views makes those platforms publishers rather than mere platforms. Apparently this also has some implications in some political campaign laws I don't really understand.

I think we're heading towards political ideologies having the same protection as religious institutions (which IMO, are exactly the same in all practical matters).


I agree that political ideologies are similar to religions and that protections may be extended to them (although they should already be covered by what is in the constitution). What I don't follow is why it matters that platforms are becoming publishers.

The government has never had the right to meddle in what publishers decide to publish (with exceptions for regulations on pornography and classified information). I can't imagine the government forcing Mother Jones or The Nation to print conservative viewpoints. If the platforms become "publishers" their power to select what information is available increases. They become liable for more of the content on their sites, but they also gain full control over content.

The only way I see to protect freedom of access to information would be to declare certain spaces (Facebook, Reddit, YouTube, etc.) as "privately-owned public forums" where suppression based on ideology would be heavily restricted.


I think it matters in the context of liability.

For instance, if you're a newspaper, and you publish "Politician X is a rapist" and you don't have a way to back up that allegation in court you're liable. As a platform, you're not liable for the content your users post.

If you are allowing people to say "Politician X is a rapist" but not allowing people to say "Politician Y is a rapist", you've crossed into the land of publisher in some people's minds.

Simply being a platform on the web is 'doing business' if you're getting revenue from that operation. Can a platform specifically reject a religious group from their site? Can a platform demonetize a religious group?

I personally don't believe in protected groups, but if we're going to have protected groups, we should apply the same protection to all groups, equally, even if the group size is 1.


I am personally all for freedom of information and also would like the concept of protected groups to go away, however...

Liable is incredibly hard to prove in court, and usually cases are taken in civil court not criminal court.

Can a platform ban specific groups? I see where your argument is coming from. Commerce is allowed by the Constitution for the common good (to serve everyone). Could Walmart, for example, say that conservatives are not allowed to shop there? Definitely not.

However, it has recently been established, through a series of court cases culminating in the Gay Wedding Cake case, that you cannot compel speech from an individual that they don't believe in. Corporations in the US are often given rights similar to individuals. I can see this going to the Supreme Court under a similar argument with Google, etc. (you can't compel us to disseminate what we deem as immoral, anymore than you can make a Christian baker bake a gay wedding cake).

Ironic because leftists were livid with that decision, but it may be the one that allows private companies to censor you on their networks.


I think specifically in the case of platform vs publisher, there is federal law that gives some sort of extra legal protection to platforms. In order to get these protections, they have to behave a certain way. I'm not sure about this, but it's an argument I've heard.

> Liable is incredibly hard to prove in court, and usually cases are taken in civil court not criminal court.

Going to court is expensive, even if you win. Thus, the protections for platforms allow an avenue of easy dismissal of cases, preventing from go very far in the court system in the first place.

> Google, etc. (you can't compel us to disseminate what we deem as immoral, anymore than you can make a Christian baker bake a gay wedding cake).

Sure, but I think then they would qualify as a publisher and not a platform according to some. Compare the results from Google to Bing for the query 'Trump is a'. Did Google hand-tune their algorithm for their results? Did they tune their algorithm to weight more heavily organizations that share their political bias? EG, Prefer CNN over Fox News (or vice versa). Do some sites have insider knowledge to better game the system?

I can remember a time when there were no news results in a google search. It was a better internet.


>In order to get these protections, they have to behave a certain way. I'm not sure about this, but it's an argument I've heard.

This is commonly repeated but not true. In fact, the opposite is true.

https://www.law.cornell.edu/uscode/text/47/230

Scroll down to section C.

    1. No provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another information content provider.

    2. No provider or user of an interactive computer service shall be held liable on account of—
      2.a any action voluntarily taken in good faith to restrict access to or availability of material that the provider or user considers to be obscene, lewd, lascivious, filthy, excessively violent, harassing, or otherwise objectionable, whether or not such material is constitutionally protected; or
      2.b any action taken to enable or make available to information content providers or others the technical means to restrict access to material described in paragraph (1).[1]


I don't see how 2.a would stand in a courtroom.

What is 'good faith' and 'otherwise objectionable' in this context? Is removing all content from a religious group afforded in this definition, as long as the provider considers it objectionable?

'whether or not such material is constitutionally protected' has almost no meaning because almost all speech is protected.

Another interesting point is, while it's allowable by the law to restrict content you deem objectionable, is it allowable to promote content you prefer over other content? Or de-promote the content but not remove it?

Is it acceptable to remove religious content with no explanation as to why? Is it acceptable to ban or demonetize a religious group without explanation?


I don't think it makes sense for us to debate "what would stand in a courtroom" since this has been the law of the land for more than two decades and has been able to "stand" just fine so far. We could debate whether or not the law should be repealed or amended, but the consequences of the law as is are pretty clear: online platforms are not responsible for content posted by a 3rd party and are free to curate content on their platform at their own discretion.

> is it allowable to promote content you prefer over other content? Or de-promote the content but not remove it?

Legally it is allowable since the law makes no distinction between "promotion" and "curation".


Google search is by far the company's most profitable service or product. Compare the investment of putting a dollar into search over a year vrs putting a dollar into android or youtube. You can legally force a company to work against itself but we may have to say goodbye to all the side projects.

also this didn't set well with me

"Access. There might have to be limits on who can access the API. We might not want every high school hacker to be able to build his or her own search platform. On the other hand, imagine thousands of Mark Zuckerbergs battling each other to find better ways of organizing the world’s information."


I don't think this would help our social fabric, which the article says Google is "tearing apart." Why do we need to break Google's monopoly on search again? Which they don't actually have.


Well look at what they mean whenever people complain about tearing the social fabric it usually means "other people aren't conforming to how I think they should, this is new and I don't like it so therefore it is all its fault and we are doomes if we don't get rid of it".

Given that it has previous culprit of that exact same charge was gay marriage it is clearly a meaningless "family values" style euphemism to try to make their complaints seem valid.


Making the index available would be very valuable for scientific research. when I was a googler, we used the index (and google scholar's index) to do all sorts of interesting science projects (DNA search, gene search, etc) But we couldn't publish the results for various reasons. If the index (or a fragment of it) was available sitting in parquet files (or possibly a better format for indices) you could easily sit around doing spark jobs to extract all sorts of interesting web data.


There are tons of projects with serious sizable open source index, for ex, http://commoncrawl.org/. Even crawling 15B pages on AWS is not super expensive on your own using well established OSS tool chain. Bing already provides APIs if you don’t want to do crawling.

Web index, while important, is by no means most important part of search engine (which I had say is relevance). The OP article is written by someone who has little clue that search engine involves massive amount of technologies besides index and even relevance - everything from spell correction to recommendations to query rewriting to answers to segment specific searches like images/video/maps/local/product/news/entertainment/events, so on and on and on. This is above and beyond the gigantic infrastructure needed to run all these at scale, speed and cost effectively. All of these needs to work harmoniously with each other and designed keeping in mind weaknesses and strength of each component. For example, index for news must be refreshed almost in real time and relevance needs higher emphasis on location.

Search is not free and no one has figured out if someone just can provide index and someone else can provide relevance and everything can just work as efficiently as before (data point: each query consumes 0.3 Watt-hour of energy at Google).


The index word suddenly gave me an idea: the entire internet transformed into tabular data, searchable and open to public.

Even if we got Google's data, there's a whole lot of scraping to transform irregular and disparate. Typically you would have to google a keyword, look through the search results, visit different websites (research mode), and then consolidate separate sources of truths to build your own understanding. Building a scraper for each website in the search results and displaying a complete table data of the website. Why click through 100s of pages of profiles or data when you can have it all in one view? Why bother with HTML, when all a data-centric individual desires is data. Getting to the data is so tedious and a long journey. Scrape the website, clean the data, make it available for consumption, schedule & consolidate updates. For instance, a hedge fund that scrapes certain group of websites to execute market orders for an automated trading system.

A tabular data focused search engine would return tabular data, there would be no HTML medium, just straight up raw data. For instance, instead of seeing the comments rendered in a normal browser, imagine a tabular data that describes all the username, post time, comment minus the hierarchy.

To build a focused crawler quickly, I came up with Web Scraping Language (https://scrapeit.netlify.com), and essentially what I want to do is hire people to write WSL to scrape the web and then sell a subscription to data-centric customers.

What do ya say HN?


> Fortunately, there is a simple way to end the company’s monopoly without breaking up its search engine, and that is to turn its “index”—the mammoth and ever-growing database it maintains of internet content—into a kind of public commons.

Partially bull__it - even forgetting about the $ that anybody would have to pay to Google for I/O traffic to scan its index (on which Google would put its % of fair earnings for at least the I/O generated plus maybe the "discovery" service for new URLs) that way any competing search engine would only have access to what Google decides that is worth indexing => OK from the point of view of the competing market (more search engines can rely on the same source data and potentially extrapolate different results) but in many cases NOK for the end-users as in all search engines that would use such service only the data that Google would decide that is worth to be indexed could be used (therefore a "closed bubble" set by Google).

If I were Google this proposal would probably be OK (more $ earned to cover the fixed costs of indexing by selling I/O to outsiders and at the same time it would lower the pressure/attention from regulators).

On the other hand non-indexed informations like all what comes from the Android OS (e.g. position of users, apps used by whom&when&etc), usage and actions on web-pages tracked by g-analytics, DNS access (e.g. 8.8.8.8), and etc... would be excluded - these are nowadays probably the most lucrative sources of infos of "what's happening now" + "what are people doing now where" which in turn generate the best answers to what "most" users want to know, which is mostly about "now" and their regional area.


> Google is especially worrisome because it has maintained an unopposed monopoly on search worldwide for nearly a decade. It controls 92 percent of search, with the next largest competitor, Microsoft’s Bing, drawing only 2.5%.

1. The definition of the word "monopoly" is not applicable to this situation.

2. Property expropriation reduces the competition within the society, while also being unethical.


I don't think this would do anything. Indexing the web is compartively easy next to querying that index to find results relevant to the users query.

On top of this, Google search is as good as it is thanks in no small part to business practices that make Google as a company unsavory. Tracking your browsing habits, reading your emails, and watching your geographical location all feed the search algorithm, providing more seemingly prescient search results. For people who are OK with this, I don't see how an alternative can show up and compete with the years of data Google has already collected on a given user.

Toppling the Google monopoly is going to be more about a cultural shift than just giving competitors the data and algorithms they need to become viable. Users need to be made more aware of how Google uses their data, then alternatives can pop up marketing themselves on how they do (or don't) leverage user data. I'm skeptical that enough users will actually care to make this difference, at least in the short term.


It might be true that Google is in such a dominant position that the greater society needs to think about regulation, transparency, etc., but I think we need to develop better and more clearly differentiated terminology that clearly differentiates this form of dominance from an anti-competitive monopoly. This is not a monopoly in the standard sense, and since anyone can crawl the web and build their own index Google isn't hindering anyone from competing.

Through hard work and good strategy they have achieved a dominant position, and I think that as a society we should both honor that and recognize that they have been pretty good players in the market. At the same time, if Google were instead run by the Koch brothers we could be in dire straights, and so we should also realize that we can't safely leave such power in the hands of private citizens without demanding accountability, transparency, openness to scrutiny, open dialogue, etc.


> This is not a monopoly in the standard sense, and since anyone can crawl the web and build their own index Google isn't hindering anyone from competing.

It's close. It's as close to a monopoly as MS was with IE built into Windows 98 - while you can find and use alternatives, it's on you to do this, and Google is the default all over the place.

Remember, too, how much power this gives Google - the top five search results for a keyword are intensely powerful, for everything from directing commerce to a particular site, to controlling news narratives, and more. Companies pay millions for SEO "professionals" and software like Moz / Stat to ensure they're in the top rankings. The latter in turn need to defeat google's anti-bot stuff to get search results, so they're in their own arms race...

The whole ecosystem is dirty and gross and gives too much power to too few people with too much money. It'd be fine if the average consumer was conscious of this and could reasonably make the choice to use a different engine, but this is highly controlled as well - think of the Google donations to Mozilla that they may not be able to live without.


I don't personally think Google's 'index' becoming public would make a single bit of difference on its position as number one search engine (and therefore advertiser) on the internet, unless you expand that word to include all the algorithms and infrastructure around it.


Ex-Googler here. This proposal makes no sense to me.

Anybody can write a web-crawler and get a searchable indexed snapshot of the web. What Google is good at, is ranking your search results based on relevance and quality so the best results are at the top and you never see the terrible results.


I am no expert on Google search, but it seems to me that "accessing the index" and "retrieving ranked results" must be one and the same. That is, the very structure of the index is designed around a particular ranking scheme in order to make accessing it computationally tractable. In light of that, how would one expose an API to offer access to the index without having something about Google's curation or ranking built in. A query API is clearly out because query responses include some results and leave out others, but even a dump runs into the problem that the data was collected, parsed, and structured with a particular scheme in mind.


The article mentions a precedent: "in the 1956 consent decree in the U.S. in which AT&T agreed to share all its patents with other companies free of charge."

Today the picture is completely different. Google, Amazon, Facebook and Apple are global - their infrastructure is virtual - not physical lines and cables. If a government wants to cause harm to one of those companies, they can decide to "move" somewhere else.

In addition GAFA is capturing dollars from everywhere in the world and bringing it to the US. It's in the US governments interest to tread softly here. No matter how the media spins things, GAFA is great for the US economy.


Uh no. Google has hundreds of billions of dollars invested in real non-virtual assets, majority of which are in the USA.


Sure, however they are not directly tied to their bread and butter - pay per click. They could over time divest and move data centers and jobs elsewhere without directly affecting revenues. Would be painful and risky but feasible.


Can someone explain why its fair or acceptable to force Google to make its index public, the one it built with its own time and money, instead of having some open source initiative building another index from scratch?


It's posted a few other places in this thread, but it already exists: https://commoncrawl.org/


There are problems with this article.

First googles competitive edge was never from its index. It was from its ability to give good matches to questions.

Second google being forced to share information like this might be considered a government taking of google property. Which would require a massive amount of compensation.

Third this has a condition that google isn’t allowed to remove anything from its index. However this flies in the face of the right to be forgotten, copyright law, child exploitation material law etc.

Fourth it says google must allow people to add to the index. Which is just inviting the index to be spammed with garbage.


Bloomberg talks about 'making the index public' and the way to do it is with the API. But the API behaves like the website of Google and does not make the index of the website content available. Google's new AI-based algorithm is trained by humans with a strong bias. An example of this is the disappearance of website of respectable medical doctors who use non-mainstream methods. It is one thing to agree or disagree with someone, but completely dropping a website from search results is effectively censorship and who is Google to determine who sees what?


"Facebook is controlling people's minds and ripping apart the social fabric. Also, you'll notice some 'share' buttons to the right of the article. Remember to 'like' us!"


"High-volume users (think: Microsoft Corp.’s Bing) should pay Google nominal fees set by regulators. That gives Google another incentive for maintaining a superior index."

How do these nominal fees hold a candle to the advertising dollars they rake in? Their dominance in advertising fuels their R&D to keep their algorithms superior. They have to maintain their superiority to fuel to keep the advertising dollars flowing. If suddenly anyone can offer Google results then a lot of that ad money dries up. Then the R&D money dries up.


Similarly I've been considering lately that the only real easy to break Facebook's monopoly is to target it's network effect, by forcing it to interoperate on some standard protocol, like ActivityPub.

So, say, people could use Google+ (if it was sticking around) to see Facebook content without giving data to Facebook.

Not sure how that would work with Twitter, I guess they'd just show media and clips of text posts.

Honestly I think I'd love it - just choose your favourite social network and only use one.

Up to each network to implement an interface that makes sense.


Just "Google it" means to find an answer, if that answer is being changed by a megacorp for unknown and sometimes nefarious reasons, I'd say we have a problem.

Would you want to use a library that filters its books to push an ideology? Would you want to listen to the radio if they only pushed top 40? Would you want to have a doctoral thesis that needs corporate approval? Research books that exclude facts based on corporate profits?


Is Google's Search quality that much better than Duck Duck Go or Bing? I get about the same quality of content when using any one of the three.

All three are terrible at serving me up websites to buy crap when I typically want to learn about something. I would love a filter that says "Don't try to sell me anything"!

I believe Google's biggest advantage is in their marketing & that the word "Google" now is a synonym for search.


I don't know the answer to this - you could be right, but just outright questioning that Google's search quality is no better than DDG or Bing and basing it off of a single data point is pretty foolhardy. Your second statement reinforces this point - I've run a lot of Google ads, and they're quite effective. Google ads are very effective at selling things to people.

In the future, when coming up with an opinion about something, I'd encourage you to look at statistics and combine that with your own personal experience. You'll often learn something you didn't know, and come up with a more grounded opinion.


I was looking for personal opinions to gather data from, hence the question. I was not trying to cite stats & persuade with anything more than my personal experience. It would have been very easy to link to articles such as this Yale study - https://www.networkworld.com/article/2225489/research-buries... if I wanted to do that. FYI, this argues the 2 are fairly close.

Also asking for personal opinions will give me much different results than looking up a blind test research as brand loyalty will play a big part in ones like Duck Duck Go.


Why does everyone assume Google is automated? A single website containing a small amount of knowledge or even a leaked archive of Diebold emails will show lower on Google search results than a web forum or some listicle.

The real web crawlers are people who surf the web, Google likely takes information from analytics and outbound link tracking to determine the most popular (not the most informational) website.


Forcing them to give anyone access to their index at a fair price wouldn't be unprecedented. In Europe that's a common method to deal with infrastructure: anyone can lay a phone line, but you have to sell access to competitors at a fair price.

Alternatively we could just publicly fund a crawler. A project like Common Crawl, just bigger and better, with some simple indexes prebuilt.


There is already commoncrawl[1]. Nobody managed to build a useful search engine with it. Apparently it takes much more to produce meaningful search results. Making googles index public won't solve anything. What we really need is a viable competitor to google.

[1]: https://commoncrawl.org/


Wouldn't help.

A classic antitrust breakup might.

- Google Search - no account required or offered.

- Doubleclick - ads on third party sites, not including Google sites.

- Alphabet Services - mail, docs, etc. - anything that requires a login or pay.

- Alphabet Heavy Industries - cloud, large scale business services, etc.


Making google's index public may not be as simple as sharing a file.

We don't exactly know how google indexes or ranks pages. It might be a completely different way of looking at indexing, it might be optimized for google's use cases such that we don't even have the software or the hardware to instantiate and search through it.


There is no barrier of entry to indexing the web. It is antiethical to force Google to make its index public through regulation. Google has its problems and we should attack them but not like this. It is not a monopoly. It is just the largest player by attending public wishes.

Let other indexes such as DuckDuckGo win through the market.


The data that google has holed up that I want more than anything else would be trends, broken down by page number. As interesting as the data of how often people search for a term is, I find it far more interesting when they can't find the thing they're searching for. That's a hole in the market.


Article on bloomberg wants to make something public and wants money from me simultaneously.


this guy doesn't understand search, google or reasonable policy. thanks for the lawls


All my results on the first screen on mobile and web are just ads... The good old days of discovering new websites and new ideas went away after the top results ended up being just ADS or super branded sites that we already know about...


Regardless of its search index or anything else, Google's dominance really comes more from it amassing a large userbase from the early days of the internet when no other search engine came close than it does from its search index.


It would be even more useful if they would make all the GMaps data available, like locations, points of interest etc. Like now I thing you need to have a CC attached to your account in order to get a slice of that data for free.


This sounds eerily like what happened with the railroads, and I suspect it would have some unintended and unsavory side effects. At the very least, it would open up opportunities for the Wesley Mouches of the world.


The fundamental tension of technology monopolies is convenience v. competition. In the same breath that Google and Amazon are criticized for their dominance critics bemoan the fragmentation of online streaming.


I'd like to break Microsofts Monopoly on desktop OS. And Github. And Skype. And LinkedIn. And Azure.

Search is virtually the only industry that Microsoft operate in where they aren't the Monopoly or biggest player.


If only replacing Google would be as simple as building another search engine. It’s like trying to beat WhatsApp by launching another messaging app. Any kid could do that. The tech is not the hard part.


The key for goolge's monopoly is not only its index algorithm capability, but also the brand, the chrome browser, the android and all other google product integration.


The index is a vast majority of their infrastructure. Queries just do not make sense from external infrastructure; the bandwidth just isn't there.


And then next thing ,people will want an API into this index because it gets updated frequently... tadaaaa GOOGLE SEARCH 2.0


An API is not the same as making the index public. I don't think the author has a grasp on all the technical details.


Would definitely make Siri and Alexa more relevant if they had access to Google knowledge graph


Another relevant option would be to ban targetted advertising, both online and offline.


“Information wants to be free!”


If you take away the moat of any business it helps their competition.


The solution to technopolies is decentralized protocols.


Yes please, a new Golden Age for Black Hat SEO!!!


I don't see how this could reduce Google's "Monopoly".

Searching with an API is the same as searching from the browser. how the results are going to be sorted: garbage will be garbage, and should be demoted both in the API and HTML version. And of course, downloading the index is out of question.

I think that a better way to improve competition would be reducing cloud computing costs (bandwidth in particular). But of course, regulators in USA would say that it is communism or something... Sigh.


Is Duck Search's Index Public?


Boo


add Bing to the Index


This suggestion is about as intelligible as saying back in 2006: "To break Microsoft's monopoly in the OS market, make it's kernel source code public". (i.e. Absolutely useless)


First, Google doesn't have a monopoly on 'Search'. Competitors provide entirely comparable products and are not that far off from what Google provides. Other industries? Maybe. Search? Not in the slightest.

Second, that is a terrible idea. It would be a fucking crisis if Google's entire index was made public. I don't think the author understands the scope and nature of what Google indexes, as well as the protections Google has implemented.


It takes $Billions to get new search engine to same level as Google..

Candidates with $Bs to spare (!) are Apple, Facebook, Microsoft. First two never started it, Bing is quite stupid so far. My default SE in Bing. I wish it was Apple :)


Bloomberg Opinion outside of Matt Levine doesn't have credibility to me.


[flagged]


Bloomberg are obviously rather pro free-enterprise

Though, I'd say totally free enterprise alone doesn't actually work. It naturally leads to monopolies (as is pretty much the case with bloomberg terminals for example), and the law of the jungle in general. If that's the kind of society you want, then head to many 3rd world countries.


> Bloomberg seems owned by some behind-the-scenes communist group

Bloomberg, LP is owned and controlled by Michael Bloomberg [1]. I can't believe I have to write this, but Michael Bloomberg is not a communist. (Nor is the author's proposal remotely communist.)

[1] https://en.wikipedia.org/wiki/Bloomberg_L.P.#cite_note-6


Yes, Bloomberg is run by commies. Bloomberg.


No sure if relevant but I've hated google's search for a unique reason: I find it horribly slow, especially compared to what it used to be. It's so slow, I've designed + partially implemented an alternative for my own use: https://github.com/Jeffrey-P-McAteer/dindex

I've only tested with 1000 records, but the query times are all <200ms.


Maybe it's your part of the world, but my TTFB to google search pages is less than 150ms and the "generated in x seconds" is usually less than 1 second. That's pretty good for an index searching effectively every public internet page.


I measure from when I hit enter to when my screen is full of results, and just _rendering_ google.com takes a full second on my macbook with 8gb ram and an i5 processor. It's so bad I have a shell script which forces my processor into the 3200mhz range when I'm on my browser workspace. When I'm not looking at my browser (+no downloading files, no audio) the same script sends a SIGSTOP to it so it isn't eating CPU cycles while I'm writing code.


How does your performance scale? If it scales linearly in records, you're going to have trouble if you're only below 200ms on 1k records. If it scales logarithmically, there are still a couple orders of magnitude between a comprehensive index like Google's and your 1k record index, which is still going to give you trouble.

In short, 200ms is slow, not fast.

EDIT: Just to clarify, I'm not saying it's necessarily easy to be faster than 200ms (I'd need to look closer into fuzzy text searching algorithms to be able to make an educated guess here); I'm just saying it's not fast enough.


Thanks for the insight; indexing the entire web is not a goal of dindex. Instead, it's designed to work like DNS where you can host your own server + have it federate queries to other servers if it doesn't have records which match a query.

Basically it moves control to the client; if the client wants to query servers across the globe there's nothing stopping them, but by default you only query servers in your config file. Organizations like the new york times, universities, and governments would host servers for the same reason they host their own email infrastructure.

It also moves a ton of control to publishers; dindex doesn't crawl the web unless a client explicitly wants to.


But yea while performance is a huge goal of mine, at the moment I don't even cache compiled regular expressions. It's definitely an alpha-stage project.


It's easy to achieve good performance on small datasets. But we are speaking of Internet scale!!! And not only that, Google is capable of serving responses to millions of persons!!!

The index take a lot of disk, memory, computing power, bandwidth... And then you have to handle attacks, spam, dead websites and not going bankrupt in the process.

I don't know you, but my best toy website is only capable of serving 100k reqs/second in localhost if serving pong...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: