The cart is empty

Blocking bots

An internet bot, or web crawler, traverses websites on the internet to create a massive database/index. It automatically visits all available web pages and creates a database of content, access, architecture, and more.

Managing bots is worth it! By properly configuring bot access, you can significantly adjust your website's speed and security. Bots continually access your website. In some cases, dozens of bots may be active at the same time, processing hundreds to thousands of pages per second.

 

Website Performance

Small websites and e-shops usually don't need to worry about bots. A bot accesses the website and, within a few seconds, crawls the entire site. The load on the server increases slightly but then quickly subsides.
However, large websites can be problematic. A bot may start indexing the site, which can take a few minutes or even hours. The server load increases continuously until the server becomes unavailable. A common example is e-shops not using caching with more than 100,000 products. A well-constructed e-shop usually doesn't cause significant load, and bot indexing proceeds smoothly. However, problems can arise if multiple bots are active simultaneously.

Fortunately, it is possible to influence the speed of well-behaved bots by adding settings to /robots.txt

Crawl-Delay: 10 - The number 10 defines the number of seconds between indexing each page.

or influence crawling speed directly in the bot/search engine interface.

 

Security

Every website has parts you don't want to be searchable. Typically, these include administrative access, partner files, and so on. Blocking access for well-behaved bots can be easily done in robots.txt:

User-agent: *
Disallow: /Admin

= All bots are denied access to the /Admin directory.

Alternatively, you can block indexing only for specific bots:

User-agent: googlebot
Disallow: /B2B

= Google is denied access to the B2B directory. Google can still crawl other parts of the site. Other bots can crawl everything.

 

Good vs. Bad Bots

So why do we talk about good bots throughout this article? A good bot is one that respects the rules set in robots.txt or whose settings can be adjusted in its admin interface.

Good Bots:

  • Search engine bots
  • Website and server availability monitoring bots
  • CDN caching bots (MaxCDN, Cloudflare, CDN77, Akamai, ...)

Bad Bots:

  • Marketing agency bots
  • Competitor price monitoring bots
  • SEO tool bots
  • Brute-force validation bypassing bots (/wp-admin)
  • Bots attacking websites via SQL injection

 

How to Block Bad Bots from Your Website?

The only option is to block bots via .htaccess

You can block bot IP addresses:

Order Deny,Allow
Deny from 111.111.111.111

But bad bots often change IP addresses or access from multiple IP ranges simultaneously.

Therefore, you need to block bots based on their names.

SetEnvIfNoCase User-Agent "SemrushBot" bad_user
SetEnvIfNoCase User-Agent "semrush" bad_user
Deny from env=bad_user

Specific entry to block the most annoying bot, Semrush


We have prepared a list to block most current bad bots. Simply add it to .htaccess:

RewriteEngine on

SetEnvIfNoCase User-Agent "SemrushBot" bad_user
SetEnvIfNoCase User-Agent "semrush" bad_user
SetEnvIfNoCase User-Agent "BlackWidow" bad_user
SetEnvIfNoCase User-Agent "ChinaClaw" bad_user
SetEnvIfNoCase User-Agent "Custo" bad_user
SetEnvIfNoCase User-Agent "DISCo" bad_user
SetEnvIfNoCase User-Agent "eCatch" bad_user
SetEnvIfNoCase User-Agent "EirGrabber" bad_user
SetEnvIfNoCase User-Agent "EmailSiphon" bad_user
SetEnvIfNoCase User-Agent "EmailWolf" bad_user
SetEnvIfNoCase User-Agent "ExtractorPro" bad_user
SetEnvIfNoCase User-Agent "EyeNetIE" bad_user
SetEnvIfNoCase User-Agent "FlashGet" bad_user
SetEnvIfNoCase User-Agent "GetRight" bad_user
SetEnvIfNoCase User-Agent "GetWeb!" bad_user
SetEnvIfNoCase User-Agent "Go!Zilla" bad_user
SetEnvIfNoCase User-Agent "Go-Ahead-Got-It" bad_user
SetEnvIfNoCase User-Agent "GrabNet" bad_user
SetEnvIfNoCase User-Agent "Grafula" bad_user
SetEnvIfNoCase User-Agent "HMView" bad_user
SetEnvIfNoCase User-Agent "MegaIndex.ru" bad_user
SetEnvIfNoCase User-Agent "HTTrack" bad_user
SetEnvIfNoCase User-Agent "InterGET" bad_user
SetEnvIfNoCase User-Agent "JetCar" bad_user
SetEnvIfNoCase User-Agent "larbin" bad_user
SetEnvIfNoCase User-Agent "LeechFTP" bad_user
SetEnvIfNoCase User-Agent "Navroad" bad_user
SetEnvIfNoCase User-Agent "NearSite" bad_user
SetEnvIfNoCase User-Agent "NetAnts" bad_user
SetEnvIfNoCase User-Agent "NetSpider" bad_user
SetEnvIfNoCase User-Agent "NetZIP" bad_user
SetEnvIfNoCase User-Agent "Octopus" bad_user
SetEnvIfNoCase User-Agent "PageGrabber" bad_user
SetEnvIfNoCase User-Agent "pcBrowser" bad_user
SetEnvIfNoCase User-Agent "RealDownload" bad_user
SetEnvIfNoCase User-Agent "ReGet" bad_user
SetEnvIfNoCase User-Agent "SiteSnagger" bad_user
SetEnvIfNoCase User-Agent "SmartDownload" bad_user
SetEnvIfNoCase User-Agent "SuperBot" bad_user
SetEnvIfNoCase User-Agent "SuperHTTP" bad_user
SetEnvIfNoCase User-Agent "Surfbot" bad_user
SetEnvIfNoCase User-Agent "tAkeOut" bad_user
SetEnvIfNoCase User-Agent "VoidEYE" bad_user
SetEnvIfNoCase User-Agent "WebAuto" bad_user
SetEnvIfNoCase User-Agent "WebCopier" bad_user
SetEnvIfNoCase User-Agent "WebFetch" bad_user
SetEnvIfNoCase User-Agent "WebLeacher" bad_user
SetEnvIfNoCase User-Agent "WebReaper" bad_user
SetEnvIfNoCase User-Agent "WebSauger" bad_user
SetEnvIfNoCase User-Agent "WebStripper" bad_user
SetEnvIfNoCase User-Agent "WebWhacker" bad_user
SetEnvIfNoCase User-Agent " WebZIP" bad_user
SetEnvIfNoCase User-Agent "Widow" bad_user
SetEnvIfNoCase User-Agent "WWWOFFLE" bad_user
SetEnvIfNoCase User-Agent "Zeus" bad_user
SetEnvIfNoCase User-Agent "AhrefsBot" bad_user
SetEnvIfNoCase User-Agent "DotBot" bad_user
SetEnvIfNoCase User-Agent "BaiduSpider" bad_user
SetEnvIfNoCase User-Agent "CCBot" bad_user
SetEnvIfNoCase User-Agent "MJ12bot" bad_user
SetEnvIfNoCase User-Agent "SiteAnalyzerbot" bad_user
SetEnvIfNoCase User-Agent "BLEXBot" bad_user

SetEnvIfNoCase User-Agent "Uptimerobot" bad_user
SetEnvIfNoCase User-Agent "AspiegelBot" bad_user
SetEnvIfNoCase User-Agent "VelenPublicWebCrawler" bad_user
SetEnvIfNoCase User-Agent "Xenu Link Sleuth" bad_user
SetEnvIfNoCase User-Agent "sarpstatbot" bad_user
SetEnvIfNoCase User-Agent "ZoominfoBot (zoominfobot at zoominfo dot com)" bad_user
SetEnvIfNoCase User-Agent "Nimbostratus-Bot" bad_user
SetEnvIfNoCase User-Agent "SEOkicks" bad_user
SetEnvIfNoCase User-Agent "Seekport Crawler" bad_user
SetEnvIfNoCase User-Agent "Alphabot" bad_user
SetEnvIfNoCase User-Agent "magpie-crawler" bad_user
SetEnvIfNoCase User-Agent "LinkpadBot" bad_user
SetEnvIfNoCase User-Agent "Linguee bot" bad_user
SetEnvIfNoCase User-Agent "Semtix.cz" bad_user
SetEnvIfNoCase User-Agent "Statusoid" bad_user
SetEnvIfNoCase User-Agent "BananaBot" bad_user
SetEnvIfNoCase User-Agent "CFNetwork" bad_user
SetEnvIfNoCase User-Agent "python-request" bad_user
SetEnvIfNoCase User-Agent "FirmoGraph" bad_user
SetEnvIfNoCase User-Agent "PetalBot" bad_user

SetEnvIfNoCase User-Agent "TombaPublicWebCrawler" bad_user

SetEnvIfNoCase User-Agent "barkrowler" bad_user
SetEnvIfNoCase User-Agent "serpstatbot" bad_user
SetEnvIfNoCase User-Agent "Archive Team" bad_user
SetEnvIfNoCase User-Agent "Barkrowler" bad_user

SetEnvIfNoCase User-Agent "Sogou web spider" bad_user

Deny from env=bad_user

 

Last updated: 01.01.2021