I finally understand (somewhat) how Google search works

I finally understand (somewhat) how Google search works


Photo Credit: REUTERS

On May 5, Rand Fishkin, CEO of marketing research firm SparkToro and an SEO expert, received an anonymous email making the bold claim that he had access to the API documentation for Google's search algorithm. Given how secretive Google is about how its search mechanisms work, Fishkin was immediately skeptical of these outlandish claims. After exchanging a few emails, Fishkin spoke with the sender via video call on May 24. Four days later, the source revealed his identity: Erfan Azimi, founder of a digital marketing agency and an SEO practitioner himself, who had many mutual friends with Fishkin.

How did the leak happen?

During the call, Elfan showed Fishkin a document that contained over 2,500 pages of API documentation and 14,014 attributes or API functions. It has not been confirmed who posted it, but the document history shows yoshi-code-bot /elixer-google-api as the origin, indicating that Google's internal content API warehouse may have accidentally published it to the repository. The code was published on March 27 and was kept open until May 7 to allow enough time for the public to acquire it.

While the documents don't reveal the specifics of what triggered the search algorithm to rank an article higher, they do reveal a list of factors that Google Search reliably tracked, which is revelatory in itself. The secrets of Google's algorithms were as much a black box as large-scale language models or the human mind. The company's executives have hidden details about how search rankings work and deliberately lied about what matters when publishing content, misleading the marketers, publishers and content creators whose jobs are largely optimizing content for Google Search.

Fishkin shared the documents with Mike King, an SEO veteran and CEO of marketing firm iPullRank, and the two subsequently shared their own analysis of the leak, publicly disseminating valuable findings for an industry operating in the dark, many of which exposed Googlers' lies.

What did Google lie about?

In the past, Google has clearly reiterated many times that domain authority is not considered as a focus. However, it turns out that Google has a feature called siteAuthority. However, it is not clear how this metric is calculated.

And contrary to previous assertions that clicks aren't used as a way to calculate rankings, there is now solid evidence that clicks are indeed a metric. In testimony at the Department of Justice (DOJ) antitrust trial last November, VP of Search Pandu Nayak spoke about NavBoost and Glue ranking systems, both of which use click-driven methods to raise, lower, or enhance search rankings. Nayak revealed that Google has employed NavBoost since 2005 and has used 18 months of click data to date. Google representatives have previously said that dwell time is not a feature, but Navboost certainly does consider longer clicks, which is essentially the same thing.

Another key point is that Google may use Chrome data to determine rankings, something that Google has previously denied. King noted that Chrome appears in multiple modules, that a module related to Page Quality Score has site-level view measurements from Chrome, and that another module that appears to be related to generating sitelinks also has Chrome-related attributes.

Not much is known about what Twiddler is exactly, but King described it as a re-ranking feature. How important are they? Ex-Googler Devargya Das told X that he once disabled Twiddler without realizing that all of YouTube search relied on it.

Google also stores the names of the authors of articles, and when combined with the detailed mapping of entities and embeddings presented in these documents, it's clear that a comprehensive measurement of authorship is being done, King noted.

The leak is reminiscent of what happened with AOL: In 2006, the web portal's research department accidentally released for all to see a compressed file containing 20 million keyword searches by more than 650,000 users over a three-month period.

While the Google leak isn't too egregious, it's a lesson for journalists and SEO experts to never take the company's word for it. More than a day after the leak was reported, Google acknowledged that the data was 100% its own, but warned against making inaccurate assumptions about searches based on out-of-context, out-of-date, or incomplete information. But should we still believe them?




