Robots.txt files are used to let search engines know which pages on your site you don’t want them to index. So if you don’t want a section of your website, or even your entire website to show up when people search on Google, then a robots.txt file can be used to do this.
What Are Robots.txt Files?
In your route folder on your website (the folder where you keep your home page) you would include a simple text file called robots.txt
The file can be created using notepad(windows) or textedit(mac). When creating the file you will need to name it robots.txt and upload it to your route folder on your website. Search engines know to look for this file automatically when they arrive on your website to begin scanning your content so it can be stored on their servers.
Meta Robots May Be a Better Solution than Robots.txt files
Even Though this article is about robots.txt files it is important to know that their is an alternative to robots.txt files that could be considered a better option for you if you are wanting to stop robots from indexing a certain page on your website. That option is to include a rule on the actual page itself within the header section.
This rule is page specific and only refers to the actual page it is written on, unlike a robots.txt file that may refer to the whole website or an entire section of your website.
An example of a Meta Robots rule written on a specific page could be as follows:
<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>
This would be directly written into the coding for the page and only refers to the page it is written on.
Some Notes About robots.txt Files
- robots.txt files will not instantly remove your page from a search engine’s index. If this is your concern you may want to try Google’s search console(previously known as webmaster tools) There is a remove url function.
- robots.txt files are placed in your root directory of your website, the same file you keep your home page in
- robots.txt files cannot be used to hide webpages from the internet. Anyone with the link to that page will still be able to get to the page.
- robots.txt files will affect children pages on your site. So if you want to de-index your blog home page, this will also de-index all your blog articles written on your blog. If you are looking to de-index a specific page rather use Meta Robots as mentioned above.
What Is a Robot or Bot?
Robots or Bots as some people call them are clever little pieces of code that search engines use to automatically browse the web and record all the information they find on a website. When we type a search into Google, it shows us a list of all the websites that it thinks will be of interest to us based on that search. In order for Google to do this its Bots would have already visited all those websites and taken note of the information it found on those website pages.
Why Would You Want To Hide Your Pages From a Search Engine?
If your goal is to attract relevant traffic from people searching for your products and services on Google, then it may seem counter productive to want to stop your web pages from being shown in the SERP’s (Search Engine Results Pages). But part of any good search engine optimisation strategy should be focussed on providing a good user experience to your visitors and sending them to the right page.
Some of the more common reasons you may want to stop pages from being indexed are:
- Building New Website – You’re building a new website and the content hasn’t been optimized yet and therefore you don’t want to let search engines index it just yet
- You have a series of landing pages that you use for various AdWords campaigns and all these landing pages mention the same keywords. If all these pages were indexed then search engines would have trouble deciding which of your pages is the best page for someone looking for your products and services (keyword cannibalisation). Landing pages for paid campaigns are usually very different to the type of page you would use to rank for on a search engine.
- A keywords Page Strategy – You’re implementing a keyword page strategy; Similar to above, a keyword page strategy is focussed on organic search engine optimization and is designed to keep all your keywords that are similar on the same page so that they don’t compete against each other for the same search terms. You may have a page on your site that uses a keyword similar to your primary search engine optimized page with the same keyword. Instead of deleting the similar page, you could simply de-index it.
- Pages Not Relevant To Your Core Service or Products – You have a page or pages on your site that are not really relevant to your core products and services and you don’t want search engines to get confused about the core topic your website is focussed on.
- Old Products or Services – You’re planning on deleting a page with a product that has been discontinued and you want to begin the process of letting search engines know the page is no longer relevant. If 301 redirects are not ideal and there isn’t a new product replacing the old product then you may want to simply de-index the page.
- No Need to Index your T’s & C’s – Terms and conditions pages usually are pages that companies choose to not index on search engines.
- Gated Content – You may want to hide content from the search engines so that users will have to fill in a form with their details to get access to the content. This is a common gated content strategy used to gather email addresses so that companies can pass the leads onto their sales team. If the page was indexed by a search engine, then it would be easy for users to find the content without filling in their details
- Thank you pages – These pages appear after a user has completed a purchase or filled in their contact information. You wouldn’t want anyone to land on this page straight from a search engine search, therefore you wouldn’t want this page to be indexed.
How Do We Write The Rules Inside The Robots.txt File
We know how to create a file called robots.txt and we know this needs to go into the root folder of our website so search engines can find it, but what do we put inside the robots.txt file to tell search engines which pages we don’t want indexed.
Here is an example of the actual text you would put inside the robots.txt file:
Let’s break this down to understand what is going on here.
User-agent: The user agent part of the instruction means that this instruction is intended for automated bots or robots that search engines use to crawl your website
* The star or multiplication symbol used in the rule, tells the robots that this particular rule is referring to all the different types of robots from all the different types of search engines. This would change if we were only referring to a specific robot from a specific search engine. More on this later.
Disallow: This part of the rule simply tells the bots not to scan the following pages.
/ The forward slash is probably the most complicated part of the rule, but it is essentially telling the bots that they shouldn’t scan any page of this website. Let’s look a little closer at this.
Urls contain forward slashes for example:
The above url has 6 forward slashes and the rule we are looking at is referring to the 3rd forward slash found in the above url.
The third forward slash immediately follows the home page and is infact referring to the home page. Another way of referring to the home page in the rule could be as follows:
We don’t have to enter the whole url root page into the robots.txt file because the file is found on our website and the bots know we are referring to the website where the robots file is found. So as a shortcut we simply use the 3rd forward slash
So according to the rule we made above, this means that all the pages on your website that include your root directory (https://pomland.co.uk/) must not be indexed. In other words all the pages on your site, because all my pages urls have the root directory in them.
But what if we wanted to only stop a certain section from being indexed by a bot and not the entire website?
We could do this by using the rule as follows:
This rule is telling all robots to not index our blog pages.
Another way this could be written is as follows:
We have now told all the bots not to index our blog pages. So all the children pages of blog will not be indexed as well for example:
Will also not be indexed as well as
Will not be indexed
robots.tx files affect the children URL’s
A lot of people fail to realise this important fact when using robots.txt files; The rule will affect all children pages of a page as well as the specific page you want don’t want to be indexed. As mentioned above, the rule we applied to the blog page will also apply to every blog article written in my blog section, which would be really bad for my website.
This is why meta robots may be a better solution for you.
How To Target Specific Search Engines Bots in a robots.txt file
There are many different companies who all have their own bots and robots that they use to scan your websites with. You may want to only block a specific search engine from crawling your website.
Another way we could of written the above rule targeting a specific bot is as follows:
The above rule will only tell Google’s bot to not index your site. So your site will still appear in other search engines but not Google’s.
Adwords Bots Will not Be Affected
A major concern about blocking bots is that you will also block Googles Adwords bots from crawling your paid landing pages and therefore you will have a poor landing page quality score because Adwords can’t crawl your landing page. Don’t worry, Google uses a completely separate bot to scan your paid landing pages.
If you don’t tell this specific bot not to scan your landing pages then you should be fine. For peace of mind you could use the following rule to make sure Google’s Adwords bots are not stopped from visiting your website even when all other bots have been:
The above two rules work together. The first one is targeting a very specific bot used by Adwords and we are telling Adwords that they can go ahead and crawl our website because we have not mentioned any url in the disallow field.
The second rule is the standard disallow function that stops all bots from crawling the website. So this means that our website will not appear on Google’s search pages or any other search pages but the Adwords bot will still be able to crawl the pages to determine our landing page relevancy score.