Prevent pages from being indexed: robots.txt vs noindex

Being listed on different search engines is important, but you also want to control what get's listed. That's where robots.txt and "noindex" come into play, but what's the difference? Let's find out

Written by Tomas Lyngen

Published: 03 Dec 2022

Robots.txt is one of the most common ways people try to block web bots with, but "noindex" tag has a similar goal and might even be what you actually want

"noindex" VS robots.txt

To better understand what these two are, let's break them down one by one

robots.txt

All websites should have an "robots.txt"-file and for bots to be able to access it, the file should always be found at www.mywebsite.com/robots.txt

Sitemap: https://www.bonty.net/sitemap.xml

User-agent: AhrefsBot Disallow: /

User-agent: * Crawl-Delay: 20 Disallow: /login Disallow: /registration Disallow: /search Disallow: /search? Disallow: /tag/*

Inside the robots.txt, you can define what pages bots are allowed or not allowed to crawl, how frequent they can crawl (Crawl-Delay) and what bots are affected by certain rules (User-agent). When star (') is used, it means all bots. In robots.txt, is also where you define your path/s to your sitemap and you can have more than one file

If you put a page to disallow, as we have done here with /login. That means the bots that respect robots.txt, will not crawl those pages. Bad bots will most likely do the opposite and is why you can't use robots.txt to try to hide areas you don't want to be listed on the web

But, and there's always an but. Just because example Google's web bot can't crawl your page, doesn't mean the page won't be indexed on Google. If the page you've restricted being crawled in robots.txt, is linked to from another website. Then Google will still index the page, even if it can't crawl it directly

noindex

The "noindex"-tag only have one goal and that's to prevent that specific page from being listed in search engine results, but it will still allow bots to crawl the page

To prevent all robots to index a specific pages, you'll need to add above meta tag to the header of all pages you wish to not be indexed

You can also specify which robots can't index your page, by specifying the bot you don't allow to index the page

For "noindex" to work, it's important the page is not restricted in robots.txt or the bots will never be able to see that the page should not be indexed

Bad bots

When it comes to bad bots, it doesn't matter much if you put "noindex" or disallow for them in robots.txt. As both of these methods are not a hard ban, but a suggestion to how you want it and something good bots will adhere to

Still, some bots are annoying and not necessary bad bots. So it's still good practice to use both methods, as needed

Best approach

For most websites, a combination between "noindex" and robots.txt will be the correct approach

Normally login, search and similar are pages you don't always want to be indexed, but at the same time don't mind if is crawled. Then you've more secretive areas, such as administrative areas you definitely don't want to be index and at the same time not reveal URL for in robots.txt

Some bots can also have very frequent crawling and you want to use robots.txt to reduce how frequent they can crawl or just completely ban them. Good bots will respect this, but bad bots will often ignore this

What approach is correct for you, will depend on the need and goal of the website it will be used for

For further reading, we recommend looking into canonical. Which compliments noindex well