How to find and remove duplicate pages on the site? Instructions

Why is duplicate pages bad?

Duplicates are pages with the same content, i.e. they duplicate each other.

The reasons why pages are duplicated can be different:

  • automatic generation;
  • errors in the site structure;
  • incorrect splitting of one cluster into two pages and others.

Duplicate pages are bad, even though they can appear for natural reasons. The fact is that search robots rank pages worse, the content of which is not much different from other pages. And the more such pages there are, the more signals to search bots that this site is not worthy to be in the top of the search results.

What happens to a site that has duplicate pages?

  1. Its relevance decreases. Both pages with the same content are pessimized in the search results, losing positions and traffic.
  2. The percentage of uniqueness of text content is reduced. This will reduce the uniqueness of the entire site.
  3. The weight of the site’s URLs is reduced. For each request, only one page gets into the search results, and if there are several of these identical pages, everyone loses weight.
  4. Time for indexing increases. The more pages there are, the longer it takes for the bot to index your site. For large sites, indexing issues can severely impact search traffic.
  5. Banned from search engines. You can even fly out of the issue for an indefinite period.

In general, it becomes clear that no one needs duplicates. Let’s figure out how to find and neutralize duplicate pages on the site.

How do I find duplicate pages?

Kirill Buzakov,
SEO-optimizer company SEO.RU:

“When we get a website into operation, we check it for duplicates of pages that return the code 200. Let us analyze what duplicates these may be.

Possible types of duplicate pages on the site

  1. Duplicate pages with http and https protocols.

    For instance: https://site.ru and http://site.ru

  2. Duplicates with and without www.

    For instance: https://site.ru and https://www.site.ru

  3. Duplicates with and without a slash at the end of the URL.

    For instance: https://site.ru/example/ and https://site.ru/example

  4. Duplicates with multiple slashes in the middle or at the end of the URL.

    For instance: https://site.ru/////////, https://site.ru/////////example/

  5. Uppercase and lowercase letters at various levels of nesting in the URL.

    For instance: https://site.ru/example/ and https://site.ru/EXAMPLE/

  6. Duplicates with appended at the end of the URL:

    • index.php;
    • home.php;
    • index.html;
    • home.html;
    • index.htm;
    • home.htm.

    For instance: https://site.ru/example/ and https://site.ru/example/index.html

  7. Duplicates with the addition of arbitrary characters, either as a new nesting level (at the end or middle of the URL), or in existing nesting levels.

    For instance: https://site.ru/example/saf3qA/, https://site.ru/saf3qA/example/ and https://site.ru/examplesaf3qA/

  8. Adding arbitrary numbers to the end of the URL as a new nesting level.

    For instance: https://site.ru/example/ and https://site.ru/example/32425/

  9. Duplicates with an asterisk at the end of the URL.

    For instance: https://site.ru/example/ and https://site.ru/example/*

  10. Duplicates with hyphen replaced by underscore or vice versa.

    For instance: https://site.ru/defis-ili-nizhnee-podchyorkivanie/ and https://site.ru/defis_ili_nizhnee_podchyorkivanie/

  11. Duplicates with incorrectly specified nesting levels.

    For instance: https://site.ru/category/example/ and https://site.ru/example/category/

  12. Duplicates with missing nesting levels.

    For instance: https://site.ru/category/example/ and https://site.ru/example/

How to detect duplicate pages?

You can search for duplicate pages in different ways. If you want to collect all-all takes and not miss anything, it is better to use all the services listed below together. But to search for the main ones, one tool is enough, choose which one is closer and more convenient for you.

Read also:   The site is in the top, but there are no orders: why and what to do

  1. Parsing a site in a specialized program

    Screaming Frog SEO Spider is suitable for finding duplicates. We start scanning, and after it we check for duplicates in the URL → Duplicate directory:

    URL → Duplicate

    In addition, in the directory Protocol → HTTP check pages with the http protocol – are there any among them that have Status Code is equal to 200:

    Http protocol

  2. Online services.

    The first service that suits our purposes is ApollonGuru.

    • We select 5-7 typical site pages. For example, a set can be as follows: main, spreading, product card / service page, blog article, as well as other important pages depending on the type of site.
    • We enter them in the “Search for duplicate pages” field and click the “Submit” button:

      ApollonGuru

    • We take duplicates with 200 server response code (see the column “Server response code”):

      Server Response Code column

      In addition, you need to check that direct 301 redirects to the main versions of the same pages are configured with duplicates.

You can also check site duplicates using the Check Your Redirects and Statuscode online service, but it is only suitable if you need to parse one URL:

Check Your Redirects and Statuscode

  1. Yandex and Google Webmaster Panels.

    You can find duplicate pages using your own search engine tools – Yandex.Webmaster and Google Search Console.

    In Yandex.Webmaster, we analyze the “Indexing” section, then – “Pages in search”:

    Indexing a Page in Search

    There you can see the current indexing of the site and the required duplicate pages:

    current indexing of the site and the required duplicate pages

    In Search Console, analyze the “Coverage” section, namely the item with pages excluded from the index:

    Search Console Coverage

We collect all duplicates into one table or document. Then we send them to the programmer to work:

Read also:   Promotion of commercial sites

The task for the programmer to eliminate duplicates

Try to explain the problem to the programmer in more detail, since there can be many addresses. ”

How to remove duplicate pages on the site?

Evgeny Kostyrev

Evgeny Kostyrev,
web programmer at SEO.RU:

“There are many ways to deal with duplicate pages. If possible, use the manual method. But such an opportunity is not always there, because serious programming skills are needed here: at least, you need to be well versed in the features of the CMS of your site.

Other methods do not require specialized knowledge and can also give good results. Let’s break them down.

301 redirect

301 redirects are the most reliable way to get rid of duplicates, but at the same time the most demanding for the professional skills of a programmer.

How it works: if the site uses the Apache server, then the necessary rules in the file .htaccess using regular expressions.

The simplest option for a 301 redirect from one page to another:

Redirect 301 / test-1 / http://site.ru/test-2/

Set up a 301 redirect from a page with www to a page without www (the main mirror is a domain without www):

RewriteCond% {HTTP_HOST} ^ www . (. *) $
RewriteRule ^ (. *) $ Http: //% 1 / $ 1 [L,R=301]

Let’s organize a redirect from the http protocol to https:

RewriteCond% {HTTPS}! = On
RewriteRule ^ (. *) $ Https: //% {HTTP_HOST} / $ 1 [R=301,L]

We register a 301 redirect for index.php, index.html or index.htm (for example, in Joomla), bulk gluing:

RewriteCond% {THE_REQUEST} ^[A-Z]{3.9} / index . (Php | html | htm) HTTP /
RewriteRule ^ (. *) Index . (Php | html | htm) $ http://site.ru/$1 [R=301,L]

If the site uses Nginx, then the rules are written in the file nginx.conf… To redirect, you also need to write rules using regular expressions, for example:

Read also:   Composite site technology

location = /index.html {
return 301 https://site.com
}

Instead index.html you can specify any other URL address of the page on your site, from which you want to make a redirect.

At this stage, it is important to monitor the correctness of the new part of the code: if there are errors in it, not only duplicates will disappear, but the entire site in general from the entire Internet.

Creating a canonical page

Using canonical points the search spider to the only page that is original and should be in the search results.

To highlight such a page, you need to write a code with the address of the original page on all duplicate URLs:

Read on the topic: How to properly fill in your robots.txt file – critical points

If there are duplicates on the site, you can prevent the crawler from indexing them using the directive:

User-agent: *
Disallow: site.ru/contacts.php?work=225&s=1

This method practically does not require programmer skills, but it is not suitable if there are many duplicates: it will take a lot of time to change the robots.txt of each take. ”

Choose the method based on your own programming skills and personal preference, and do not give search engines a reason to doubt the relevance and quality of your site.

0
How to work in a heated auction without raising the budget – the case of… Promotion of a medical website: how to increase traffic of a dental clinic website by…

No Comments

No comments yet

Leave a Reply

Your email address will not be published. Required fields are marked *