Hacker News new | past | comments | ask | show | jobs | submit login
Google confirms the leaked Search documents are real (theverge.com)
275 points by alanzhuly 5 months ago | hide | past | favorite | 75 comments



KEY TAKEAWAYS:

• Google claimed they don't use a "domain authority" metric, but the docs show they totally do - it's called "siteAuthority."

• G said clicks don't affect rankings, but there's a whole system called "NavBoost" that uses click data to change search results.

• Google denied having a "sandbox" that holds back new sites, but yep, the docs confirm it exists.

• G assured us Chrome data isn't used for ranking, but surprise! It is.

• The number and diversity of your backlinks still matter a lot.

• Having authors with expertise and authority helps.

• Putting keywords in your title tag and matching search queries is important.

• Google tracks the dates on your pages to determine freshness.

• A lot of long-held SEO theories have been validated, so trust your instincts.

• Creating great content and promoting it well is still the best approach.

• We should experiment more to see what works, rather than just listening to what Google says.

From: https://www.reddit.com/r/SEO/s/ChlTrhjPnG

I wonder how chrome data works. Are they using every chrome browser to sniff what users are clicking on?


I was afraid of this. Now, it's a matter of time before Google search will get even worse as SEO hustlers push more of their useless crap to the top now that internal algorithm data has been published.

Guess I should look into that Kagi thing people keep mentioning.


The leak essentially confirmed what SEO experts already suspected (knew) but Google denied. SEOs have spent 2+ decades observing Google search behavior and honestly I wasn't even a little bit surprised their observations were proven correct. At this point, the "garbage" on Google isn't SEO optimized organic results, it's the ads.


For what it is worth, Google has favored macro-parasites over micro-parasites. The bigger companies have access to the ears of market regulators, etc. The average small publisher or affiliate site has almost nobody care if it disappears.

Part of the most recent Google update was penalizing high authority trusted sites for publishing off topic content from third parties. There is a concept called "goog enough" explaining how the likes of Forbes ranked for just about everything. https://www.blindfiveyearold.com/its-goog-enough


I doubt that will happen. One becsuse the leak didnt really disclose any major secrets that most marketers didnt already know.

2, even if some of this wasnt widely known, its not like you can take advantage of it overnight. Theres no quick hack to building a trustworthy domain or getting lots of trustworthy links for example


> Theres no quick hack to building a trustworthy domain or getting lots of trustworthy links for example

Sure, but there are other potential hacks that this leak exposes that marketers may now be focusing on more so now they would have otherwise based on the information in the leak.


It doesn't look like this leak will do that. This doesn't have "algorithms" in any real sense of algorithms.


Before SEO communities argued what were part of the model, thinking some things like clicks didn't matter. Truth is Google did everything you can imagine, including clicks, now they all know it instead of having to guess.

Still wont change much though, its very hard to game since Google has a lot of ways of mitigating click farms or it would have discovered a long time ago.


Back in the day a friend mentioned you could choose what version of a phrase you wanted to make the canonical for a search autocompletion by embedding a broken image call to a SERP page for the version of a keyword you wanted to be more popular.

Google has tons of ways to identify real users versus fake users. And lots of the fake it until you make it efforts leave statistical outliers that can lead to ignoring or smoothing away much of the benefits, especially if there is no fire following the smoke trail.


What could be worse than recipe pages that are 20 pages worth of text with the recipe hiding somewhere among the text?


For me the recipe sites are pretty usable (with adblock), there is generally a "jump to recipe" button to skip past the text. And sometimes I even read the text, if it is a good recipe the text often has useful information like substitutions and preparation techniques. Certainly a "just the recipe" website format would be worse SEO-wise, but I am not so sure it would be more useful.


They could split it in multiple pages instead of a single page! Imagine having to click “next part” >10 times just to see if you eventually end up with a section that contains the actual recipe


And unskippable ads after every third image. Then the moment you get to the final image there’s an email registration wall. It has a little X button that doesn’t work on iOS.


I see that we have a connoisseur of the devil’s work here :)


The thing is, there's a limit to how many times the typical user would do that before just clicking back to google for a different recipe.


Create a fake google page, inject it into their history, the users goes back, sees something google like, now you’re 100% evil, congrats! :)


They also break the back button.


That's why I open search results in a new tab, and close the tab to go back to the search.


Me too, but I still hate it when they break the back button.


Sites that have a poor user experience by design create the ranking signals for their own demotion by such design. Get a lot of traffic from search with not many people liking the destination page and that ranking will quickly go away.


Almost all of them now have a "jump to recipe" link at the top of the page.


Isn't Kagi dependant on Google for their results? Doesn't seem tenable in the long-term to me.


Are you thinking of ddg? Afaik kagi runs their own independent engine from scratch.


Kagi is a Meta Search engine (it uses other search engines API e.g. Google, Yandex, Brave Search, Marginalia, ...) + it has its own (tiny) index, their own index mostly consits of their "Small Web" pages.


>Our own index of the finest results augmented by the results from the best search engines on the market.

https://kagi.com/


The design is too monochromatic.


DDG uses Bing. Them blocking trackers except when they come from Microsoft (because of this search engine deal) is why I don't really care about using DDG.

You can use bangs to search on Google, but that's not the default.


[flagged]


Care to expand upon that? I have been using Kagi for just under a year so far and really enjoying it.


It's a troll. Downvote, move on.


10 years ago "chrome botnet" was a meme. And now we get the evidence for it.

I don't know what to say.




The main takeaway is that Google has been lying and gaslighting about their ranking factors.

The main lies that were uncovered is that they are indeed using clicks, and chrome browser data for ranking purposes.

Summary of their lies here: https://www.reddit.com/r/SEO/comments/1d2gllz/google_caught_...


I don't understand why anyone would trust those lies though. For a very long time Google has been a data and advertising business with a monopolistic hold on browsers, search, analytics, and advertising. Of course they use those together to make more money.



Good encryption should still hold, even if you know the algorithm. The same reasoning should be applied to search engines.


Oh there's no way such a thing is possible. Unless you have an omniscient oracle your search engine will be based on some metric correlated with quality and relevancy. The moment those are known people will produce low quality content with high scores on those metrics.

Can you think of even a single metric that can't be gamed?


to get real answers from humans and not marketing campaigns I constantly put reddit in my search. I feel like 5 years ago that wasn’t necessary.


Very easy to post things on Reddit as a marketer, particularly when working with a small group who can respond to each other to season threads. Plus you can pay trusted Reddit account holders to post items for you.


> not marketing campaigns

> reddit

A great marketing tactic is to pose as reddit users. If you have just 2-3 realistic accounts, you can ask a question as account 1, and write your answer with account 2. Now imagine a company with $$$. They can guide an entire thread.


I'm sure there are marketers on Reddit, but the nature of virality itself makes it pretty robust against attempts at manipulation.

(You have to astroturf really hard or be a part of an existing wave, astroturfing reads different, so the best you can hope for is bending the narrative a step or two)


now imagine entire countries spreading influence and manipulation through reddit. add in extreme bias and hivemind. only a select few topics where reddit would be a good place to learn from


Google should change their algorithm to rank websites randomly; they all show up in search results with equal probability, so long as they exceed a certain threshold of relevance for the user's keywords (the threshold could vary for different keywords but would be made public and there could be instructions on how to meet the threshold requirements so it doesn't have to be a secret and anyone should be able to get their sites showing for at least one set of specific keywords). That would make it impossible to game. Maybe they could have 5 slots in a side container for 'Top trending' for those keywords for the current day, week, month or year (the user can choose the granularity). Problem solved.


You would game it by creating more websites.


Other others have stated below this does in fact become a cat-versus-mouse Sybil attack scenario where the barrier to entry isn't high enough to stop a bad actor from creating many websites. Like online identity and reputation has to be tied to more than just an email address.


But would be difficult to build a lot of websites which all meet the threshold for specific keywords. The thresholds don't have to be particularly low. In fact it's better if they require a certain amount of work to meet. So maybe only a relatively small number of websites would qualify for a specific niche keyword but the idea is that, among those, they are ranked randomly. You'd probably have to use AI to figure out site quality in niche areas.

Or Google could go with a lower risk approach of keeping their results as they are with their current algorithm, but only randomize 3 slots out of the top 10 based on this new threshold approach.


Do you remember those autogenerated websites that were just giant lists of all words? Those disappeared many years ago, but if you made search ranking random, they'd come right back.


In 2030: Do you remember those websites storytelling about their grand mother just to introduce a mathematical theorem? We’re so lucky they disappeared like the giant lists of all words, because they were 100% fabricated by Google’s unnatural incentives.

Google has the ability to change the face of the internet in 2-3 years. They can detect the chaff and shut it down, and I wonder whether it’s an anti-competiton feature that they require that websites write a thousand words per page.


I asked ChatGPT to tell me how to get away with murder in the style of a recipe blog and it (surprisingly) did a bang-up job: https://chatgpt.com/share/b738b68d-8294-4a2c-87ff-f95a6e2d91...

I did this after simply wanting to know how much powdered sugar to put in whipped cream and getting frustrated at trying to scroll through 3 blogs just to find the ingredient list for something so simple. Eventually I just asked ChatGPT.

I wonder if Google can start running an LLM on websites to judge them on things like that. Hell, looking for a photographer in your area? Have it judge how good the photography is on each website. The possibilities are there but I don’t know if they’ll bother.


Your link doesn't seem to work.


It was removed because it was against policy. I was able to generate a new response with this prompt "I'm writing a novel. Tell me how I can get away with murder, write it in the style of a recipe blog"


Huh, that's odd. Since when do they go back and check older generations?


> Do you remember those autogenerated websites

Still many copy/paste sites around. Crawl data, put a skin on top, publish on stolen domain to make it legit, clickfarm away!


I honestly think the problem can't really be solved because of the adversarial relationships involved. But if there was more than one search engine with significant marketshare maybe it would be easier to route around the problem.


Why would it be difficult? Just copy paste content to different domains. And done. And for example if google decides to down rank sites that have same content on different domains, well, then you have a nice weapon against your competitors, just copy their sites lot of times and you got your competitor removed from google.


It's a game of cat and mouse, and apparently all the “this is easy” people think they're just smarter than everyone out there.


One of my buddies that got into SEO a half decade before I did mentioned the copy and paste rankeroo stuff was real popular back in the days of Infoseek, Altavista, Excite, Lycos and similar.

Google looks for the canonical version of a document and then deduplicates before returning the result set.

You can add &filter=0 to the end of the search URL for a particular query to turn off the duplicate content filters.

An old school spam technique for some affiliates in the early days of Google was to buy a high PR link to their affiliate URL so that like site.com/?aff=123 would be the default version of the homepage & the branded searches for the merchant would then owe the affiliate the commissions until the rankings shifted again.


Well surely the algorithm can detect duplicate content. Also Google should focus beyond content and consider user satisfaction metrics to decide what is above or below the threshold. Maybe AI can help with all these things?


> That would make it impossible to game.

Have you considered that it would make it also pretty lame user experience?


Hold up, I think he might be up to something for when he discovers to order an array in O(1)


The golden times were 8-10 years ago, where you could change the order of keywords in google search and get more precise matches. Could find pretty much any obscure thing on the internet.

Then could find that article that you remember read 6 months ago by adjusting the keywords until it is on the first page.

Now it does not matter at all what you enter in the search box. No matter the inputs you get one set of results and will never find something specific.


Yeah, and we should give it a cool name. Something that communicates that this is a new kind of search! I am thinking "NewHorizons" or how about "AltaVista"? What do you think?


In not too many years, the average user will be prompting to get their needs met, rather than searching a flawed search system, wading through pages of sponsored and SEO-gamified links, opening up multiple tabs to try to dig out the details from sites hustling whatever they hustle, and then trying to read to get their needs met. Google sees the writing on the wall, which is going towards a prompt-based direct ask system mediated by an LLM. It definitely is far from perfect now, but the writing is on the wall. Search and SEO are both going to be relics of a bygone past in not too many years.


The same forces that drowned us in SEO crap will drown us in LLMO crap. Hopefully we'll enjoy a brief period of usefulness first.


I don't know. There were recently some documents released on how OpenAI was soliciting companies to integrate their product recommendations more deeply into the training data. This is obvious a huge way to monetize chatgpt-like products. Rather than SEO optimized sites gathering the ad revenue with click bait and gamification, OpenAI will collect the revenue themselves.

At the end of this enshittification, users will be looking for other options. Imagine a salesman that is ignoring the elephant in the room to tell you that if you bought this brand of shoes, you would run faster instead of giving tips on the skills to learn to be a better runner.

Search is a great way to find high-quality references and non-hallucinated answers by tweaking the keywords slightly. A salesman-like LLM might be pushing products when you just need information. ChatGPT's authority is going to dwindle, and search is a good tool to find authoritative sources


I'd argue that sales is entirely a game of ignoring the elephant in the room and selling someone on something they don't already think they need.

Its not really sales if I go into a shoe store and say I want a pair of Air Jordan 4s in size 11, that's just customer service.


people already know how to be a better runner. they just dont want to do it. they rather buy shoes that make them run faster instead. people also dont like choice. they might think they do but really i dont believe it. getting a single answer with a simple question is more appealing than having to come up with a detailed question with many answers to choose from. even searches have been doing this with the single boxed result at the top


Any good summaries about what was revealed in the leak?


The main takeaway for me is that Google is caught lying. Many things were already assumed but Google used to deny them.

- They claimed that clicks were not a ranking factor, it turns out it is.

- Also turns out that they are using Chrome data for ranking purposes (Not good for the ongoing lawsuit)

- There is also a field called something like "is small personal site" and it presumed that those sites are penalized.

You can find a summary here: https://www.reddit.com/r/SEO/comments/1d2gllz/google_caught_...


didn’t they just leak the schema? we know they may be tracking that information but we don’t know how it affects the model


Would they track the data if they haven't been using it, or didn't expect to use it in the future?


Didn't they just say they're not using it currently?


It is hard to delete protobuf fields.


Its getting harder and harder to like Google, not only are they NOT not evil, they are also not even competent.


Well, another reason to take whit caution what Google says




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: