Handling duplicate content on websites

Why duplicate content is problematic

Duplicate content on websites describes a situation where different URLs return the same response document (html only). This is easier to understand with a quick example:

Imagine you are running an online blog, available at myblog.com. Since some older readers expect to type "www" in front of domains, you also serve it under www.myblog.com. To human visitors it is obvious that this is still the same site, but to search engines this isn't quite as obvious. If a search engine indexes a blog post, it might find it on www.example.com first and index that for search results, and ignore the version on example.com it finds later on. For a different post, it may be the opposite. This results in a situation where blog posts may rank for one domain or the other in the search engine results, but effectively split the domain authority among two domains instead of all contributing to the same one (your blog).

A situation like this will artificially decrease the authority and thus visibility in search results for your page, even if you have lots of high quality content. It does not necessarily impact real human visitors, but your position and relevance in search engine results.

Handling multiple domains

The problem of having multiple domains is easily solved by adding a canonical (primary / preferred) url to each page, using a <link rel="canonical"> html element.

If your website is available under example.com and www.example.com, and you want all your content to only rank for example.com, you could add a single line of html to the <head> element:

<link rel="canonical" href="https://example.com/blog/my-post" />

Now even if a search engine crawler finds the blog post under www.example.com/blog/my-post, it will see this tag and know to ignore this page, and attribute the content to the url variant under example.com/blog/my-post instead.

The canonical tag can be unconditionally added to the html response, as there is no problem if the canonical document references itself inside the canonical tag. It's not a redirect, just a hint which URL the content should be attributed to (aka which url is "authoritative" for the content).

Unintended duplicate content

It is fairly common to accidentally produce duplicate content through query parameters. Let's say you have a base page example.com/about showing some information about your page.

On the sidebar, you may have a little widget with posts that are sortable, and if a user changes the sort order, the url would include a query parameter like example.com/about?sort=asc. Or perhaps you added tracking for your latest advertisement campaign using UTM tags. Users clicking on the ad on social media would be directed to a link like example.com/about?utm_campaign=myad&utm_medium=social, whereas ads within search engines would change the utm_medium parameter example.com/about/utm_campaign=myad&utm_medium=search-engine.

Again, for you and your users, there is no problem here. But for search engines, these are technically entirely different pages, and having them rank may be extremely problematic. If the advertisement url from your social media campaign is indexed by search engines, then users coming from search engine results would now also show up as coming from your social media ad campaign in your analytics, even though they didn't! Or if query parameters including settings like the sort order from the first example are picked up and ranked by search engines, then users will have an unexpected default experience including custom settings they didn't configure.

For these reasons, it is generally a good idea to apply canonical html tags to every page on your website, even if you don't support multiple different domains or subdomains.

Linking logical alternatives

A different form of duplicate content are logical duplicates, for example when you provide pages in different languages. They aren't literally identical, but the contained information is. Without any tags, search engines may treat all language versions as separate pages and rank them separately. This can lead to situations where the english version may rank higher in a spanish speaking region, even though you have a spanish translation of it.

To fix this, you will need to designate one of the language versions as the canonical version, and provide the translations as "alternates":

<link rel="canonical" href="https://example.com/en/blog/my-post" />
<link rel="alternate" hreflang="en" href="https://example.com/en/blog/my-post" />
<link rel="alternate" hreflang="es" href="https://example.com/es/blog/my-post" />
<link rel="alternate" hreflang="fr" href="https://example.com/fr/blog/my-post" />

The first tag designates the url example.com/en/blog/my-post as the canonical (primary) version of the blog post, while the following lines link translated versions of it for spanish and french. The canonical version should always point to the current page, for example the french page should list the french one as the canonical, the spanish one the spanish link etc. Note that the canonical version should always be included as an alternate, to help search engines understand it's content language.

Now search engines that find this page immediately know to soft-link the translated versions to it, and provide users preferring spanish language results with the spanish translation of the blog post, instead of the english or french one. Using alternates improves the user experience of visitors coming from search engines and helps unify the page authority and ranking for page content that is logically linked.