How Do .Gov Sites Use Robots.txt

right bright parking
Creative Commons License photo credit: ph0t0 {is on the move}

Looking back at some older post I revisited the SEOmoz post on government susceptibility to cross site scripting. The results of their test over 2 years ago make me curious how government sites interact with the search engines today.

So I went to Google and searched: “robots.txt” “disallow:” filetype:txt inurl:.gov

It returned 231 results. These sites all disallow some section of the site. Whitehouse.gov disallows 2132 sections of the site. Including disallowing their internal search from visiting their site map. Politics aside it is interesting that the White House is so restrictive of search engines.

Here is an excerpt:

Disallow:    /stateoftheunion2003/2002/behindthescenes/print/text
Disallow:    /stateoftheunion2003/2002/behindthescenes/text
Disallow:    /stateoftheunion2003/2002/photos/print/text
Disallow:    /stateoftheunion2003/2002/photos/text
Disallow:    /stateoftheunion2003/2002/print/text
Disallow:    /stateoftheunion2003/2002/text

Notice how they all have text? The White House maintains text only versions of their content for every page. So, it seems someone is on the ball about duplicate content. Except whoever runs the site did not use robots=”noindex” or rel=nofollow on the links from the main page to the text version.

Take Easter Egg Roll 2008 as an example. It has one live PR 8 link pointing to it, but it doesn’t index. Do you have a page with that kind of link that doesn’t show? I am unclear how that page achieves invisibility, strong domain, strong link, no noindex (or nofollow) instruction, but still it isn’t indexed. Maybe Google obeys the intent of the White House robots.txt — instead of holding them to the literal interpretaion the rest of us live by.

Other things you can learn from government robots.txt:

# Rover is a bad dog
User-agent: Roverbot
Disallow: /

# EmailSiphon is a hunter/gatherer which extracts email addresses for spam-mailers to use
User-agent: EmailSiphon
Disallow: /

# Exclude MindSpider since it appears to be ill-behaved
User-agent: MindSpider
Disallow: /

Whoever runs the CDC site has a sense of humor. This section brings up something that you should remember: writing a robots.txt only works on robots that obey the rules. Malicious bots don’t even read your instructions, or read then to specifically target those places that are disallowed. If you do track a “bad dog” you should be shutting out their IP-address, or using .htaccess to block agents, to really deter them — or better yet randomly redirect them.

The same person that runs the CDC site probably runs the Medicare site, they have the same Rover, Siphon, and MindSpider exclusions — and they both use FrontPage. Yeah that is right, the CDC uses FrontPage.

# Ignore FrontPage files
User-agent: *
Disallow: /_borders
Disallow: /_derived
Disallow: /_fpclass
Disallow: /_overlay
Disallow: /_private
Disallow: /_themes
Disallow: /_vti_bin
Disallow: /_vti_cnf
Disallow: /_vti_log
Disallow: /_vti_map
Disallow: /_vti_pvt
Disallow: /_vti_txt

You can get more out of these documents by looking at who they exclude, not what they exclude from the search engines. I wish that the US Government would collect a list of bots that are misbehaved and publish them.

If you think that you are being dragged down by malicious bots you can get a comprehensive list of bad user agents. Kloth.net has instructions for building a bot trap. Be careful when building bot traps, doing it wrong can damage your site by triggering alerts and capturing innocent search bots/user-tools.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*