> If it's requesting something which is allowed in robots.txt, it's a legitimate request.
An abusive scraper is pushing over your boxes. It is intentionally circumventing rate limits and (more generally) accurate attribution of the traffic source. In this example you have deemed such behavior to be abusive and would like to put a stop to it.
Any given request looks pretty much normal. The vast majority are coming from residential IPs (in this example your site serves mostly residential customers to begin with).
So what if 0.001% of requests hit a disallowed resource and you ban those IPs? That's approximately 0.001% of the traffic that you're currently experiencing. It does not solve your problem at all - the excessive traffic that is disrespecting ratelimits and gumming up your service for other well behaved users.
Why would it be only 0.001% of requests? You can fill your actual pages with links to pages disallowed in robots.txt which are hidden from a human user but visible to a bot scraping the site. Adversarial bots ignoring robots.txt would be following those links everywhere. It could just as easily be 50% of requests and each time it happens, they lose that IP address.
An abusive scraper is pushing over your boxes. It is intentionally circumventing rate limits and (more generally) accurate attribution of the traffic source. In this example you have deemed such behavior to be abusive and would like to put a stop to it.
Any given request looks pretty much normal. The vast majority are coming from residential IPs (in this example your site serves mostly residential customers to begin with).
So what if 0.001% of requests hit a disallowed resource and you ban those IPs? That's approximately 0.001% of the traffic that you're currently experiencing. It does not solve your problem at all - the excessive traffic that is disrespecting ratelimits and gumming up your service for other well behaved users.