robots.txt file can be used for any website, it is especially important for dynamically generated websites such as those based on the WordPress platform. Why? Because dynamic CMS systems like WordPress tend to create (potentially) dozens or even thousands of pages (URLs) that don’t need to be public. Not only can this lead to SEO penalties or even security issues, but in many cases, webmasters are not even aware that this is happening…
“Wait a minute!”, some of you might be thinking. “What the heck is a robots.txt file anyway?” Without going into a lengthy explanation, suffice it to say that this file has been agreed upon by all major “web players” and especially search companies like Google and Bing as the only standard way to tell “robots” which pages of your website should NOT be “crawled”. Of course, this comes with two major caveats, which are clearly explained by the following warning:
- robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
- the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use.
In other words, webmasters should not rely on
robots.txt to hide sensitive content from the public, as hackers or even just your overly bored college student can easily check your robots file anytime they wish. Therefore, the entire concept of the file is aimed at “well-behaved” bots such as those from publicly-traded companies and search engines.
“But why do I want to prevent Google from indexing certain pages?” Well again, the answer depends on your specific website, and is mostly related to SEO reasons these days. If you are familiar with penalties such as duplicate content or shallow content, you know that having tons of poor-quality pages from your domain being indexed into search engines can degrade the reputation of your website. So, generally speaking, things like “tag archives” or “author archives” are not a good thing to index.
Of course, there is now a huge migration away from heavy
robots.txt rules because many bloggers (and Google) would rather you not block crawl access for too many files and pages, and instead consider the noindex meta tag instead. WordPress itself automatically includes this tag on the login page, registration page, and admin directory.
Personally, I don’t like to rely on the noindex meta tag in certain cases, especially since you must install a third party SEO plugin to WordPress in order to add that tag to “other” URLs that do not include it by default (above). That being said, it is now ALWAYS better to use the noindex meta tag rather than robots.txt rules whenever possible, as blocking Google from seeing certain resources on your domain (i.e. JS, CSS, etc) can have devastating effects on your rankings.
In summary it is best to use the noindex meta tag on any page that you know Google wishes to crawl BUT that you definitely don’t want showing up in search results; and it is best to use robots.txt blocking rules on any page that you know Google doesn’t want/need access to AND that you also don’t want showing up in search results.
A minimalistic robots.txt file is growing more and more popular for WordPress, here is LittleBizzy’s current file:
User-agent: ia_archiver Disallow: / User-agent: * Disallow: Disallow: /xmlrpc.php Sitemap: https://www.littlebizzy.com/sitemap.xml
As you can see above, the only thing we block access to now is the evermore hated XML-RPC file, which is completely worthless. The annoying thing though is that with so few rules, there is still a strong chance that Google (or other search engines) will accidentally index certain files from your WordPress plugin directory, wp-includes directory, or beyond, that you don’t want to be in search results. Unfortunately there is nothing much anyone can do about this now, because blocking robot access to those locations will only anger Google’s robot minions — and Google gets what Google wants!
Note: while its true that WordPress now creates its own virtual robots.txt file, its still always better to manually create a “real” one and place it in the root directory of your website. This gives you much better control of robot access.
To make sure that Google is “happy” with its access, you can use the free Fetch As Google tool within Google Search Console, plug in a few URLs from your site, and click the Fetch & Render button and check the results.
The below far-reaching robots.txt syntax rules are no longer recommended and can get you severely penalized:
## NOTE: REMOVE ALL COMMENTS AND UNUSED LINES BEFORE USING THIS ROBOTS.TXT FILE FOR BETTER CLEANLINESS ## enable the below 2 lines to allow Google Adsense bot to crawl anything it wants (recommended) User-agent: Mediapartners-Google Disallow: ## enable the below 2 lines to block archive.org from archiving your entire site (up to you) # User-agent: ia_archiver # Disallow: / ## all the rules below this comment line will apply to ALL remaining unspecified robots User-agent: * Disallow: Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /trackback/ ## Google has asked webmasters to no longer use below 2 lines (they block access to embedded JS/CSS/images/etc) # Disallow: /wp-content/ # Disallow: /wp-includes/ Disallow: /wp-content/plugins/ Disallow: /search/ Disallow: /tag/ Disallow: /category/ Disallow: /uncategorized/ Disallow: /*/comment-page Disallow: /*/page Disallow: /*/order/ Disallow: /*/feed/ Disallow: */xmlrpc.php Disallow: */wp-*.php Disallow: */trackback/ Disallow: *?wptheme= Disallow: *?comments= Disallow: *?replytocom Disallow: *?s= Disallow: *? Sitemap: https://www.littlebizzy.com/sitemap.xml