123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247 |
- ## Some of this is cribbed from https://www.dcc-servers.net/robots.txt
- ### Geographically Meh ### {{{
- ## no need for Chinese or Russian searches
- User-agent: Baiduspider
- Disallow: /
- ## Czech Republic
- User-agent: SeznamBot
- Disallow: /
- ## No need for Russian searches and they fetch but ignore robots.txt
- User-Agent: Yandex
- Disallow: /
- ### End Geographically Meh ### }}}
- ### SEO Dung Bots ### {{{
- ## "The World's Experts in Search Analytics"
- ## is yet another SEO outfit that hammers HTTP servers without permission
- ## and without benefit for at least some HTTP server operators.
- User-Agent: Searchmetrics
- Disallow: /
- ## Claimed SEO; ignores robots.txt
- User-Agent: lipperhey
- Disallow: /
- ## Claimed SEO
- User-Agent: dataprovider.com
- Disallow: /
- ## SEO
- ## http://www.semrush.com/bot.html suggests its results are for users:
- ## "Well, the real question is why do you not want the bot visiting
- ## your page? Most bots are both harmless and quite beneficial. Bots
- ## like Googlebot discover sites by following links from page to page.
- ## This bot is crawling your page to help parse the content, so that
- ## the relevant information contained within your site is easily indexed
- ## and made more readily available to users searching for the content
- ## you provide."
- User-Agent: SemrushBot
- Disallow: /
- ## SEO bs
- User-agent: spbot
- Disallow: /
- ## SEO bs
- ## Wasn't respecting 'dotbot' block...
- User-agent: DotBot
- Disallow: /
- ## SEO bs
- User-agent: Baiduspider
- Disallow: /
- ### End SEO Dung Bots ### }}}
- ### Poorly Implemented Crap Bots ### {{{
- ## Stupid bot
- User-Agent: purebot
- Disallow: /
- ## Seems to only search for non-existent pages.
- ## See ezooms.bot@gmail.com and wowrack.com
- User-Agent: Ezooms
- Disallow: /
- ## http://www.majestic12.co.uk/bot.php?+ follows many bogus and corrupt links
- ## and so generates a lot of error log noise.
- ## It does us no good and is a waste of our bandwidth.
- User-Agent: MJ12bot
- Disallow: /
- ## There is no need to waste bandwith on an outfit trying to monetize our
- ## web pages. $50 for data scraped from the web is too much
- ## never bothers fetching robots.txt
- ## See http://www.domaintools.com
- User-Agent: SurveyBot
- Disallow: /
- User-Agent: DomainTools
- Disallow: /
- ## Too many mangled links and implausible home page
- User-Agent: sitebot
- Disallow: /
- ## At best another broken spider that thinks all URLs are at the top level.
- ## At worst, a malware scanner.
- ## Never fetches robots.txt, contrary to http://www.warebay.com/bot.html.
- ## See SolomonoBot/1.02 (http://www.solomono.ru)
- User-Agent: SolomonoBot
- Disallow: /
- ## Yet another claimed search engine that generates bad links from plain text.
- ## It fetches and then ignores robots.txt
- ## 188.138.48.235 http://www.warebay.com/bot.html
- User-Agent: WBSearchBot
- Disallow: /
- ## Ignores robots.txt
- User-Agent: Sosospider
- Disallow: /
- ## Does not handle protocol relative links. It does not fetch robots.txt.
- User-Agent: 360Spider
- Disallow: /
- ## Does not handle protocol relative links.
- User-Agent: 80legs
- Disallow: /
- ## Does not know the difference between a hyperlink <A HREF="..."></A> and
- ## anchors that are not links such as <A NAME="..."></A>
- User-Agent: YamanaLab-Robot
- Disallow: /
- ## Ignores rel="nofollow" in links
- ## parses ...href='asdf' onclick='... (single quote ('') instead of double ("")
- ## as if " onclick=..." were part of the URL.
- ## It fetches robots.txt and then ignores it
- User-Agent: Aboundex
- Disallow: /
- User-Agent: Aboundexbot
- Disallow: /
- ## Fetches robots.txt for only some domains.
- ## It searches for non-existent but often abused URLs such as .../contact.cgi
- User-Agent: yunyun
- Disallow: /
- ## Multiple long crawls a day... and .ru
- User-Agent: MegaIndex.ru
- Disallow: /
- ### End Poorly Implemented Crap Bots ### }}}
- ### Waste of Bandwidth ### {{{
- ## Monetizers of other people's bandwidth.
- User-Agent: Exabot
- Disallow: /
- ## Monetizers of other people's bandwidth.
- User-Agent: findlinks
- Disallow: /
- ## Monetizers of other people's bandwidth.
- User-Agent: aiHitBot
- Disallow: /
- ## Monetizer of other people's bandwidth. It ignores robots.txt.
- User-Agent: AhrefsBot
- Disallow: /
- ## Yet another monetizer of other people's bandwidth that hits selected
- ## pages every few seconds from about a dozen HTTP clients around the
- ## world without let, leave, hindrance, or notice.
- ## There is no apparent way to ask them to stop. One DinoPing agent at
- ## support@edis.at responded to a request to stop with "just use iptables"
- ## on 2012/08/13.
- ## They're blind to the irony that one of their targets is
- ## <A HREF="that-which-we-dont.html">http://www.rhyolite.com/anti-spam/that-which-we-dont.html</A>
- User-Agent: DinoPing
- Disallow: /
- ## Waste of bandwidth
- User-Agent: masscan
- Disallow: /
- ## Waste of bandwidth
- User-Agent: escan
- Disallow: /
- ## No apparent reason to spend bandwidth or attention on its bad URLs in logs
- User-Agent: discoverybot
- Disallow: /
- ## Unasked for tracking. Monetizes
- User-agent: Uptimebot
- Disallow: /
- ## Monetizing
- User-agent: AhrefsBot
- Disallow: /
- ### End Waste of Bandwidth ### }}}
- ### Get Off My Lawn ### {{{
- ## Cutsy story is years stale and no longer excuses bad crawling
- User-Agent: dotnetdotcom
- Disallow: /
- ## Cutsy story is years stale and no longer excuses bad crawling
- User-Agent: dotbot
- Disallow: /
- ## Unprovoked, unasked for "monitoring" and "checking"
- User-Agent: panopta.com
- Disallow: /
- ## No "biomedical, biochemical, drug, health and disease related data" here.
- ## 192.31.21.179 switch from www.integromedb.org/Crawler to "Java/1.6.0_20"
- ## and "-" after integromedb was added to robots.txt
- User-Agent: www.integromedb.org/Crawler
- Disallow: /
- ## Ambulence chasers with stupid spider that hits the bad spider trap.
- User-Agent: ip-web-crawler.com
- Disallow: /
- ## Little public information
- User-Agent: Findxbot
- Disallow: /
- ## Don't know why it crawled me
- User-Agent: ips-agent
- Disallow: /
- ## Don't know why it crawled me
- User-Agent:Go-http-client
- Disallow: /
- ### End Get Off My Lawn ### }}}
- ### Plain Attack ### {{{
- ## evil
- User-Agent: ZmEu
- Disallow: /
- ## evil
- User-Agent: Morfeus
- Disallow: /
- ## evil
- User-Agent: Snoopy
- Disallow: /
- ### End Plain Attacks ### }}}
- User-agent: bot-pge.chlooe.com
- Disallow: /
- ## Firewall anything that goes to trap
- User-agent: *
- Allow: /
- Disallow: /badbottrap
- Disallow: /.well-known
|