Recently we had an overworked machine with around 60% of the traffic coming from search engine indexing bots. We needed to stabilise the load but didn’t want to sacrifice all indexing and thus our SEO entirely.

This is the solution we came up with.

First off we decided there were a few bots we could live without so we added some .htaccess rules to 302 these away:

# looking for majestic bot to kick it away.
RewriteCond %{HTTP_USER_AGENT} ^.*MJ12bot.*$ [NC]
RewriteRule ^(.*)$ http://go.away/ [R=302,L]
# Ditto Yandex
RewriteCond %{HTTP_USER_AGENT} ^.*Yandex.*$ [NC]
RewriteRule ^(.*)$ http://go.away/ [R=302,L]
# Ditto Baidu
RewriteCond %{HTTP_USER_AGENT} ^.*Baiduspider.*$ [NC]
RewriteRule ^(.*)$ http://go.away/ [R=302,L]

But there were the larger more important ones that we wanted to just delay until less busy periods. Our solution was to 503 (temporarily unavailable) the requests with a header suggesting they come back in 12 hours:

# Other bots we will allow, but during peak hours we will ask them to come back later.
RewriteCond %{HTTP_USER_AGENT} ^.*bot.*$ [NC]
RewriteCond %{TIME_HOUR} ^08|09|10|11|12|13|14|15|16|17|18|19$
Header set Retry-After: 43200
RewriteRule .* - [R=503,L]

The above is by no means entirely foolproof as it’s reliant on the requesting bot having ‘bot’ in it’s user-agent but we found it lowered the load significantly as soon as it was put live.

It did manifest one further problem in that our server is monitored by pingdom which does feature ‘bot’ in it’s user agent as below:

"GET / HTTP/1.0" 200 35120 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" "www.domain.com"