The end of the post has my current robots.txt for the standard installation of the Drupal content management system. It borrows heavily from many places including this thread on the Drupal website.
That same thread has a discussion about whether a robots.txt should be distributed with Drupal. I'm not a huge proponent one way or the other. What I want instead and will likely have to be a combination of things, is a plaintext module that provides nodes as plain text without themes or anything fancy. This will allow the authors of independent Drupal sites on the same server to each have a distinct robots.txt as well as other files. An additional possible use is the publication of a comment spam blacklist. This plain-text list would be parsed and comments with the words in the list deleted at a semi-regular interval.
As the thought has evolved I now would think the way to do it is a special node type. The "plain-jane" node type will simply regurgitate what it is fed, verbatim. Then with URL aliasing it can be robots.txt or any other file one prefers.
User-agent: *
Crawl-Delay: 10
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /files
Disallow: /search
Disallow: /book/print
Disallow: /admin
Disallow: /cron.php
Disallow: /xmlrpc.php
Disallow: /database/
Disallow: /includes/
Disallow: /modules/
Disallow: /scripts/
Disallow: /themes/
Disallow: */add/
3 Comments
Another reason for plain text pages
Looking through Google's AdSense alternate ad specs gives yet another reason for yesterday's plain text node
good one!
thanks for the great example. i combined it with a list of known nasties from searchengineworld.com to come up with the following:
» robots.txt
update: instead of listing
update: instead of listing the bots i wanted to exclude, i updated the above link to only include the bots i want to crawl my site (i.e. exclude all the others).
seemed more logical.