Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google won’t comment on a leak of its search algorithm documentation (theverge.com)
161 points by janandonly on May 28, 2024 | hide | past | favorite | 47 comments


The Rand Fishkin article that this refers to is a much better source

https://sparktoro.com/blog/an-anonymous-source-shared-thousa...


HN discussion of that article: https://news.ycombinator.com/item?id=40496967


Quick, have a LLM read it as that magically removes copyright.


It should be enough to have it online, soon AIs would pick it up.


It was leaked under an open source license.


I don't believe that was an authorized publishing.


For all we know, it was made by a person/bot/process authorized to publish to an official Google repository[1].

It may have been accidental, sure[2]. But is there a basis to claim unauthorized action?

[1]: https://news.ycombinator.com/item?id=40505708

[2]: Reminds me of “Night of the Living Dead”, which accidentally became public domain; source: https://en.m.wikipedia.org/wiki/Night_of_the_Living_Dead


Perhaps surprisingly to the HN community, a lot of what is coming out confirms what decent SEOs have been saying for years (and appears to contradict a lot of what Google have said publicly via their various talking heads)


for someone far removed from SEO, what are the true (non-sensationalist headline) effects of this leak?


Long time dev/designer and part-time SEO person here.

The major implications are that a lot of what Google was telling SEO practitioners were a lie or cleverly disguised red herrings wrapped up in semantics.

The article that was posted earlier today went through a lot of the stuff they were lying about with examples of Matt Cutts and Gary Ilyes statements that are now directly contradicted with the recent release of this code.

Here is the article: https://ipullrank.com/google-algo-leak

The premise:

What I’ll do here is contextualize some of the most interesting ranking systems and features (at least, those I was able to find in the first few hours of reviewing this massive leak) based on my extensive research and things that Google has told/lied to us about over the years.

“Lied” is harsh, but it’s the only accurate word to use here. While I don’t necessarily fault Google’s public representatives for protecting their proprietary information, I do take issue with their efforts to actively discredit people in the marketing, tech, and journalism worlds who have presented reproducible discoveries.


Makes sense. I used to have a youtube channel that got a few age-restricted videos (for absolutely silly reasons). Reps from youtube would swear up and down that age-restrictions wouldn't effect your video's ranking in the algorithms but it was so well known in most creator circles how untrue that was.

Thank you for that link, I'll have to check it out!


Normally I’m not here to defend the likes of Google, but when it comes to their communications to SEO practitioners, I’m OK with some intentional misdirection and outright lies.


The "lies" there are such bullshit. They basically consist of "Google claimed they didn't do X" now it looks like they probably do X and therefore google lied.

They are somewhat circumspect about the fact that each example is actually, 8+ years ago Google made a claim about what they did at the time and now there's evidence that they are currently doing doesn't match that claim. That's not a lie.

For example, they link to this tweet from 2016 as such a claim: https://ipullrank.com/wp-content/uploads/2024/05/image21.png

Or a later example where they link to this 2016 talk: https://www.youtube.com/watch?v=iJPu4vHETXw

Later on, they use claims from 2012: https://www.seroundtable.com/google-chrome-search-usage-1561...


A lot more spam than usual in your search results.

A leak like this (if it’s actually as substantial as people say it is) combined with google’s ham fisted way to insert LLM results into its search could really mean that google search quality will crater in the coming months, potentially opening up space for competitors or reducing usage in a real way.


It has really gotten so bad. I get so much reddit spam lately. In a lot of cases reddit can be a great... but man, the astroturfing has gotten insane.


Crater worse than it already is?


Since this seems to be most relevant to SEO practitioners I might as well ask a loaded question. Given how prevalent spam sites are in search results now and how bad Google seems to be at filtering them out, what's the argument for how SEO is a good thing in aggregate?


> how SEO is a good thing

Are you asking whether people want discoverability for their product/companies ?

Or if they're willing to give up agency and leave it to a third party for-profit corporation to dictate what is surfaced in people's generic searches ?


I'm asking if optimizing for clicks in search results rather than optimizing for quality of content and letting search engines attempt to determine what's worth surfacing was overall a good thing for users.


That's a valid question, but there is no answer to what "optimizing for quality of content" means, and every entity involved has a different view of it.

We'd have to decide what is quality and what is content first, and I'd expect the heat death of the universe before we ever come to a useful definition that even matches half people's own definitions.


Depends on the context. I'm a long time dev/designer and part-time SEO person.

Back in the day (early to mid aughts) as an SEO person you really had to work to get your site noticed. You have to develop inbound links, create really good content, set your site up well for Google to index, used other resources like blogs and social media (when it was still in the early stages) to boost your positioning. In short, it was a full time, month-to-month job. The sites I worked on it was a constant battle to get a decent ranking and to maintain it.

But you are 100% right. Somewhere in the last 8-10 years, everything has gone from search engines really protecting who they put on the front page, to loading the page with tons of ads (half of the results above the fold are now all ads, followed by more ads below the 10th position) and making it insanely easy to manipulate Google and others to get your site on the front page.

Great example: In 2019, I had a medical device company that was a startup. Out of the blue they called me because of my prior relationship with one of the founders parents and the site I built for her. They needed a site designed, built, optimized and 15 pages of content written in less than three months. I gave them some insane price and they didn't even balked and said if I got done sooner, they would send me a bonus.

I grabbed a template somewhere online, revamped the home page and the internal content pages and then proceeded to copy/paste content from other well known and not so well known sites in their industry. I did everything you should never do from an SEO standpoint. I figured it was a long shot, but worth the payoff. I got their site released on time, but the majority of the content was copied from other sites. Very little of it was original.

I figured it would get buried in the first month. Nope. #3 on the first page for various searches I targeted. It was outranking the sites I copied the content from, it was crazy. To this day, the site remains in the top three places on the first page of Google and other search engines for dozens of searches I targeted. It was outranking huge medical supply companies in the same industry - all as a startup.

That experience in 2019 was a huge wakeup call for me. It was blatantly obvious how easy it was to manipulate Google and other search engines. As of today, I have no idea if SEO is really a worthwhile pursuit any more considering how easy it is to do this. If I can do it, then I just assume everybody else is already doing it. If not, then they're missing out.


> Though the documents aren’t exactly a smoking gun

Smoking gun of what? The article never seems to actually say.

The only accusation I see is that SEO grifters were lied to by Google Search docs, but you look at the article like[1] it's over things like "Google always said Domain Authority isn't a thing, but we found a variable site_authority. We don't know what it's for or what it does, but clearly we were right". Which, even if true, isn't a thing I care about at all...

[1] https://ipullrank.com/google-algo-leak


I believe for a while that Google didn’t consider Domain Authority much.

For a while, I would create pages like “Bank of XYZ Phone Numbers” or “Phone Numbers for $ThatTelecomYouHate” and often rank #1.

Which made sense, because my page was just a very parseable page of their actual phone numbers, but obviously risk for harm there.

Phone numbers on the actual orgs’ websites were hard to find because, well, they wanted you to do anything but call them because calling cost them money. And the big corp websites were always an SEO mess that I’m not surprised a crawler could not comprehend.

Of course the ads were all for competitors, so I enjoyed making money and costing them customers at the same time.

After some years, Google would start returning you 10 different results from the orgs’ websites, playing into their hand of having you do anything but call your bank/telco.


Publicly lying about how a core product works is notable, and would be a big deal for any other company. Reminds me of Apple's "we treat all developers equally" statements before it was made clear Spotify didn't pay a dime on their in-app subs.

You're right that the real impact is limited though, people woun't be moving to another search engine anyway.


SEO discussion is notoriously high noise and low signal. It's mainly conjecture, assumption, and superstition. Like a lot of things, if they knew what worked, they wouldn't be sharing it.


I’m not sure calling folks like Rand Fishkin a grifter is all that accurate, to put it politely.



I think the repo was also cloned here... <https://github.com/yoshi-code-bot/elixir-google-api/commit/d...> and considering the original repo is apache 2... not sure they can legally force them to be removed.


What am I looking at?


You're looking at a commit that contains all of the leaked files/documentation conveniently on the directory view on side of the github page.


That tells me literally nothing. This is all machine generated boilerplate code. This isn't executive emails or documents. Since when are comments written by nameless workers an authoritative newsworthy leak on corporate policy? It probably tells us something about what sort of things workers at Google have researched, e.g. csam, scams, racism. If I grep for SEO then I see some database fields for their pagespeed service that mentions accessibility audits and progressive web apps. Looks pretty consistent with what Google has always said publicly. Where's the specific code that's generating all the shock and outrage?


There are plenty of breakdowns of what exactly the technical documents contain, including the stuff I just said. In great detail! Google ironically here is your friend. It confirms what SEO practitioners have been saying for years that Google has actively been discrediting with flat out falsehoods, such as they don’t use domain authority anywhere in their ranking (documentation describes fields named domainAuthority), they track clicks in great detail (which they have vigorously denied), and they use chrome usage data as a factor in their rankings. Hope that helps


Licenses don't work like that. Accidentally releasing something under a license doesn't mean it will forever carry that license.


Regarding trade secrets, inadvertent or accidental disclosure means it is no longer a trade secret - no protection. And regarding copyright, there is Oracle vs. Google that copying APIs is fair use (Google's argument). So even if the copyright license grant in the leak is held to be invalid (the license states it is perpetual, if it is valid it does forever carry that license), it would be fair use anyway. I think the only IP protection is patents, Google has patents on various parts of the system so if you did actually want to build a Google clone by reverse-engineering their documentation you would run into trouble.


I think serving the docs themselves is likely illegal, though. You don't actually have a right to distribute copies just because someone pushed a wrong button and they were made visible in a repo that has a top-level Apache 2 license.

Sure, they can't stop people who received a copy from studying it and using or discussing the information in there. But I think they can almost certainly stop them from (legally) distributing that information to others.


Apache 2 is irrevocable, and you can redistribute it however you wish. I think it's safe to assume as soon as it was pushed public to the original google repo it's was covered under apache 2. that entire repo was distributed by google as Apache 2. So if you pull that commit, as far as I would be concerned everything in it is apache 2 since that's the license it's distributed with. We have no idea why that code was pushed, and it doesn't matter from a public perspective.


None of that matters if the company never intended to publish it as Apache 2, and can show they took immediate steps to correct their mistake. The law is not code, and intent matters.


source or relevant case? Because if someone pulls before it's another commit that removes it... you would have absolutely zero idea why they removed it, or you wouldn't even know it was removed. Or even if you still pull that commit that as of yesterday after over 1 month was still in the google repo, there's no info that it shouldn't be there... And I think pushing code to a public opensource repo, shows intent to publish it...


I wonder if anyone's ever tried to figure out if APIs are copyrightable...

https://arstechnica.com/tech-policy/2021/04/how-the-supreme-...



I didn't know Google used Elixir internally


I don't think this leak shows that they use Elixir. It's open-source code that Google customers are supposed to be able to use to interact with Google APIs. It's natural to provide support for your customers to use the languages they want to use, rather than the ones you've settled on internally; this is important for anyone, let alone restrictive internal environments like Google's is supposed to be. (https://aws.amazon.com/sdk-for-php/ exists, but Amazon famously doesn't allow services to be developed in PHP.) Google is famous for allowing only a few languages for production code.

The README also says:

> Disclaimer

> This is not an officially supported Google product.

which, while it doesn't directly indicate that Elixir isn't used internally to Google, would be a surprising mismatch of support if they did use Elixir. Here are their official clients (including some in languages that I doubt they want to be used internally for service development, like PHP, Node.js, and .NET): https://developers.google.com/api-client-library


That statement about not being a supported Google product is just boilerplate that they make you put on Google open source products. I wouldn’t read much into it. I say this as an ex-Googler who open sourced several projects and had to put that disclaimer on them.


Right, what I meant is that if Google used Elixir internally they'd likely also provide an officially supported Elixir client.


Maybe I'm missing something but all the document seems to imply is that Google collects this data... Nothing about it being used as ranking factors. It doesn't seem unreasonable for Google to collect and correlate extra data that it may want to use as future ranking factors, experiments, quality analysis etc.


Ouch. Some pretty damming stuff there. I can see spokespeople getting something wrong, but multiple points that are seen as highly contentious in SEO land being explicitly denied and the docs say opposite? Not much wiggle room there


Where money can be made, money will be made.

Google lied? Of course they did. And Microsoft, consistently (with their pool of NOBUS).

In articles/stories like that my mind wanders to the ST land and the Rules of Acquisition (https://projectsanctuary.com/the_complete_ferengi_rules_of_a...)

125. A lie isn't a lie until someone else knows the truth




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: