Today is the 20th anniversary of the robots.txt an easy and convenience way to block search engines from crawling their pages.
The robots.txt was created by Martijn Koster in 1994. All major search engines back then, including WebCrawler, Lycos and AltaVista, quickly adopted it; and even 20 years later, all major search engines continue to support it and obey it.
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /
The “User-agent: *” means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site.
There are two important considerations when using /robots.txt:
robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use.