Prevent pages from being indexed: robots.txt vs noindex

Prevent pages from being indexed: robots.txt vs noindex

Being listed on different search engines is important, but you also want to control what get's listed. That's where robots.txt and "noindex" come into play, but what's the difference? Let's find out

Robots.txt is one of the most common ways people try to block web bots with, but "noindex" tag has a similar goal and might even be what you actually want

"noindex" VS robots.txt


To better understand what these two are, let's break them down one by one

robots.txt


All websites should have an "robots.txt"-file and for bots to be able to access it, the file should always be found at www.mywebsite.com/robots.txt

[snippet]Sitemap: https://www.bonty.net/sitemap.xml

User-agent: AhrefsBot
Disallow: /

User-agent: *
Crawl-Delay: 20
Disallow: /login
Disallow: /registration
Disallow: /search
Disallow: /search?
Disallow: /tag/*[/snippet]

Inside the robots.txt, you can define what pages bots are allowed or not allowed to crawl, how frequent they can crawl (Crawl-Delay) and what bots are affected by certain rules (User-agent). When star (') is used, it means all bots. In robots.txt, is also where you define your path/s to your sitemap and you can have more than one file

If you put a page to disallow, as we have done here with /login. That means the bots that respect robots.txt, will not crawl those pages. Bad bots will most likely do the opposite and is why you can't use robots.txt to try to hide areas you don't want to be listed on the web

But, and there's always an but. Just because example Google's web bot can't crawl your page, doesn't mean the page won't be indexed on Google. If the page you've restricted being crawled in robots.txt, is linked to from another website. Then Google will still index the page, even if it can't crawl it directly

noindex


The "noindex"-tag only have one goal and that's to prevent that specific page from being listed in search engine results, but it will still allow bots to crawl the page

[snippet]<meta name="robots" content="noindex">[/snippet]

To prevent all robots to index a specific pages, you'll need to add above meta tag to the header of all pages you wish to not be indexed

[snippet]<meta name="googlebot" content="noindex">[/snippet]

You can also specify which robots can't index your page, by specifying the bot you don't allow to index the page

For "noindex" to work, it's important the page is not restricted in robots.txt or the bots will never be able to see that the page should not be indexed

Bad bots


When it comes to bad bots, it doesn't matter much if you put "noindex" or disallow for them in robots.txt. As both of these methods are not a hard ban, but a suggestion to how you want it and something good bots will adhere to

Still, some bots are annoying and not necessary bad bots. So it's still good practice to use both methods, as needed

Best approach


For most websites, a combination between "noindex" and robots.txt will be the correct approach

Normally login, search and similar are pages you don't always want to be indexed, but at the same time don't mind if is crawled. Then you've more secretive areas, such as administrative areas you definitely don't want to be index and at the same time not reveal URL for in robots.txt

Some bots can also have very frequent crawling and you want to use robots.txt to reduce how frequent they can crawl or just completely ban them. Good bots will respect this, but bad bots will often ignore this

What approach is correct for you, will depend on the need and goal of the website it will be used for

For further reading, we recommend looking into canonical. Which compliments noindex well



Tags: #SEO #noindex #robotstxt

We sometimes publish affiliate links and these always needs to follow our editorial policy, for more information check out our affiliate link policy

You might also like

Comments

Sign up or Login to post a comment

There are no comments, be the first to comment.