Prevent pages from being indexed: robots.txt vs noindex
Robots.txt is one of the most common ways people try to block web bots with, but "noindex" tag has a similar goal and might even be what you actually want
"noindex" VS robots.txt
To better understand what these two are, let's break them down one by onerobots.txt
All websites should have an "robots.txt"-file and for bots to be able to access it, the file should always be found at www.mywebsite.com/robots.txtUser-agent: AhrefsBot Disallow: /
User-agent: * Crawl-Delay: 20 Disallow: /login Disallow: /registration Disallow: /search Disallow: /search? Disallow: /tag/*
Inside the robots.txt, you can define what pages bots are allowed or not allowed to crawl, how frequent they can crawl (Crawl-Delay) and what bots are affected by certain rules (User-agent). When star (') is used, it means all bots. In robots.txt, is also where you define your path/s to your sitemap and you can have more than one file
If you put a page to disallow, as we have done here with /login. That means the bots that respect robots.txt, will not crawl those pages. Bad bots will most likely do the opposite and is why you can't use robots.txt to try to hide areas you don't want to be listed on the web
But, and there's always an but. Just because example Google's web bot can't crawl your page, doesn't mean the page won't be indexed on Google. If the page you've restricted being crawled in robots.txt, is linked to from another website. Then Google will still index the page, even if it can't crawl it directly
noindex
The "noindex"-tag only have one goal and that's to prevent that specific page from being listed in search engine results, but it will still allow bots to crawl the pageTo prevent all robots to index a specific pages, you'll need to add above meta tag to the header of all pages you wish to not be indexed
You can also specify which robots can't index your page, by specifying the bot you don't allow to index the page
For "noindex" to work, it's important the page is not restricted in robots.txt or the bots will never be able to see that the page should not be indexed
Bad bots
When it comes to bad bots, it doesn't matter much if you put "noindex" or disallow for them in robots.txt. As both of these methods are not a hard ban, but a suggestion to how you want it and something good bots will adhere toStill, some bots are annoying and not necessary bad bots. So it's still good practice to use both methods, as needed
Best approach
For most websites, a combination between "noindex" and robots.txt will be the correct approachNormally login, search and similar are pages you don't always want to be indexed, but at the same time don't mind if is crawled. Then you've more secretive areas, such as administrative areas you definitely don't want to be index and at the same time not reveal URL for in robots.txt
Some bots can also have very frequent crawling and you want to use robots.txt to reduce how frequent they can crawl or just completely ban them. Good bots will respect this, but bad bots will often ignore this
What approach is correct for you, will depend on the need and goal of the website it will be used for
For further reading, we recommend looking into canonical. Which compliments noindex well