A robots.txt is a small text file that lists which pages and files a search engine may or may not crawl. In other words, you give instructions to search engine bots to indicate which pages of your website may appear in search results.
Where should it be?
The robots.txt file must be in the root of your domain. You can check this by typing robots.txt after your domain name.
Why use a robots.txt?
For many websites the file is superfluous: they want all their pages to be indexed. However, there are several reasons why you should use robots.txt in the context of SEO.
The main goal is to exclude pages that you would rather not have displayed in the search engines. Most of the time, this is how you want to duplicate content solve. After all, it makes no sense to have completely identical pages displayed in the search results, and Google will also penalize you for this in terms of SEO.
How does it work?
There are some standard rules and commands for using a robots.txt.
- User-agent: This specifies which bot(s) the rules apply to. “User-agent: *” means that the rules under that section apply to every bot.
- Disallow: Prefix indicating which parts of your website are not accessible to the bot.
- Allow: certain sub-folders or files may be crawled.
- Sitemap: indicate where your sitemap is located. Putting the URL of your sitemap into your robots.txt is a good best practice.
Please note:
- The file is accessible to everyone. Therefore, do not store passwords or other important or personal information in it.
- Bots can be programmed to ignore the file. Malware and other virus-like software will therefore still crawl your website.