



In a Google Search Central video, Google's Gary Illyes explains the part of indexing web pages that includes canonical selection, what canonical means to Google, and a thumbnail explanation of web page signals. I did and mentioned the central part of the page and what it does for duplicates. It means new ways of thinking about them.

What is a canonical web page?

There are several ways to think about what canonical means from a publisher and SEO perspective from the search box side, and from Google's perspective.

Publishers identify what they feel are “original” web pages. Also, the concept of the regular version of SEO is to select the “strongest” version of her web page for ranking purposes.

Google canonicalization is something completely different than what publishers and SEOs think of, so it's good to hear about it from Googlers like Gary Illyes.

Google's official documentation on canonicalization uses the term deduplication to refer to the process of selecting canonicalization, and lists five typical reasons why your site may have duplicate pages: Masu.

5 Reasons Why Pages Are Duplicate “Regionally specific content: For example, content for the US and the UK. Essentially the same content that can be accessed from different URLs is displayed in the same language. By device type: For example, mobile Desktop version Protocol variations: such as the HTTP and HTTPS versions of the site Site features: such as the results of category page sorting and filtering functions Accidental variations: The demo version of the site may be incorrectly displayed remains accessible to crawlers, etc.

Canonicals can be thought of in three different ways, and there are at least five reasons for duplicate pages.

Gary explains another way of thinking about the canon.

Signals are used to select the canonical version

Ilyes shares one more definition of canonical, this time from an indexing perspective, and talks about the signals used to select canonical.

Gary explains:

“Google determines whether a page is a duplicate of another known page and which version to keep in its index, the canonical version.

But in this context, the canonical version is the page that best represents a group of duplicate pages according to the signals we collect about each version. ”

Gary stops explaining overlapping clustering and returns to signals after a moment.

He continued:

“Most of the time, only legitimate pages appear in search results. But how do you know which pages are legitimate?

So when Google gets the content of a page, specifically the main content or core of the page, it groups it with one or more pages (if any) that have similar content. This is overlap clustering. ”

Pause here to note that Gary refers to main content as “the central part of the page.” This is interesting because there's a concept called “core annotations” introduced by Google's Martin Splitt. He didn't actually explain what the centerpiece annotation is, but this part Gary shared is helpful.

Below is a portion of the video where Gary talks about what a signal actually is.

Illyes explains what a “signal” is:

“We then compare several signals already calculated for each page and choose the canonical version.

Signals are information that search engines collect about a page or website and use for further processing.

Some signals are very simple, such as a site owner's annotation in the HTML like rel=”canonical”, while others are less simple, such as the importance of individual pages on the internet. ”

Duplicate cluster has one regular

Gary then explains that one page is selected to represent the canonical for each cluster of duplicate pages in the search results. Each cluster of duplicates has one regular.

He continues:

“Each duplicate cluster contains a single version of the content selected as canonical.

This version represents the content of all other versions of search results.

Other versions in the cluster will be alternate versions served in different contexts, such as when a user is searching for a very specific page from the cluster. ”

Alternative versions of web pages

The last part is very interesting and important to consider, especially for e-commerce web pages, as it can help you rank for multiple variations of a keyword.

Content management systems (CMS) may create duplicate web pages to account for product variations such as product size and color, which can affect the description. These variations are selected to rank in search results if the variation page more closely matches the search query.

This is important to consider because you may be tempted to redirect noindex variant web pages to exclude them from the search index for fear of a (non-existent) keyword cannibalization problem. Adding noindex to a page that is a variant of another page can backfire. This is because there are scenarios where variant pages are best suited for ranking for more subtle search queries, including colors, sizes, and version numbers that differ from the canonical page.

Important points about canonicals (and others) to remember

Gary's canon discussions are packed with a lot of information, including some side topics related to the main content.

Here are seven points to consider:

The main content is called the centerpiece, and Google calculates a “few signals” for each page it detects. Signals are data used for “further processing” after a web page has been detected. Some signals, such as hints (and perhaps directives), are controlled by the publisher. The tip Illyes mentioned is the rel=canonical link attribute. Other signals, such as the importance of a page in the context of the Internet, are outside the publisher's control. Some duplicate pages may serve as alternate versions. Alternate versions of a web page may still be ranked and are useful to Google (and publishers) for ranking purposes.

Check out Search Central's episode on indexing.

How Google Search indexes pages

Featured image from Google Video / Modified by author

