If you want to find out (or know more) about the mysterious robots.txt and web robots then The Web Robots Pages will definitely be of help. The author Martijn Koster has generously written a huge FAQ and also gathered some code examples, thorough information and references.
Sometimes people find they have been indexed by an indexing robot, or that a resource discovery robot has visited part of a site that for some reason shouldn’t be visited by robots.
In recognition of this problem, many Web Robots offer facilities for Web site administrators and content providers to limit what the robot does. This is achieved through two mechanisms:
The Robots Exclusion Protocol
A Web site administrator can indicate which parts of the site should not be visited by a robot, by providing a specially formatted file on their site, in example.com/robots.txt. More information on this method is found here:
- Web Server Administrator’s Guide to the Robots Exclusion Protocol
- HTML Author’s Guide to the Robots Exclusion Protocol
- The original 1994 protocol description, as currently deployed
- The revised Internet-Draft specification, which is not yet completed or implemented
The Robots META tag
A Web author can indicate if a page may or may not be indexed, or analysed for links, through the use of a special HTML META tag. Full details on how this tags works is provided:
- Web Server Administrator’s Guide to the Robots META tag
- HTML Author’s Guide to the Robots META tag
- The original notes from the May 1996 Indexing Workshop
If you just want to include a robots.txt file allowing anyone to crawl your site add the following and upload to your root directory (example.com/robots.txt):
User-agent: * Disallow:
If you happen to know of one or several web robots worth keeping out please let me know…