Since I posted an article here about using zip bombs [0], I'm flooded with bots. I'm constantly monitoring and tweaking my abuse detector, but this particular bot mentioned in the article seemed to be pointing to an RSS reader. I white listed it at first. But now that I gave it a second look, it's one of the most rampant bot on my blog.
If I had a shady web crawling bot and I implemented a feature for it to avoid zip bombs, I would probably also test it by aggressively crawling a site that is known to protect itself with hand-made zip bombs.
One of the few manual deny-list entries that I have made was not for a Chinese company, but for the ASes of the U.S.A. subsidiary of a Chinese company. It just kept coming back again and again, quite rapidly, for a particular page that was 404. Not for any other related pages, mind. Not for the favicon, robots.txt, or even the enclosing pseudo-directory. Just that 1 page. Over and over.
The directory structure had changed, and the page is now 1 level lower in the tree, correctly hyperlinked long since, in various sitemaps long since, and long since discovered by genuine HTTP clients.
The URL? It now only exists in 1 place on the WWW according to Google. It was posted to Hacker News back in 2017.
(My educated guess is that I am suffering from the page-preloading fallout from repeated robotic scraping of old Hacker News stuff by said U.S.A. subsidiary.)
[0]: https://news.ycombinator.com/item?id=43826798