Robots

Robots.txt

The importance of a robots.txt file is extremely important, for the following primary reason:

Web Robots, Search Engine Spiders, crawlers, bots or spiders are programs that follow links across the internet automatically. Search engines will look in your root domain for a file named "robots.txt". The file tells the robot (spider) which files and directories it may not spider (download to index). Alternately which Search Engines you dont want indexing your web site.

This system is called The Robots Exclusion Standard. The "robots.txt" file is used by search engines to exclude them from indexing certain of your files for keywords and content for inclusion in search indexes.

The robots.txt file is nothing more than a plain text file, with the following as an example below :

User-agent: *
Disallow: /dirA/
Disallow: /dirB/index.html
User-Agent: Scambot
Disallow: /
Disallow: /dirA/

They supposedly then not attempt to access files that you dont want them to, an honor type system. Robots that fail to retrieve "robots.txt" and fail to abide by your exclusion request, are referred as Bad Bots, they in turn need to be blocked by using .htaccess, either by IP or by User Agent.