You have LinkedIn and Twitter examples, where you're very likely violating their TOS as they prohibit any scraping.
I also assume you don't check the robots.txt of websites?
I'm all for automating tedious work, but with all this (mostly AI-related) scraping, things are getting out of hand and creating a lot of headaches for developers maintaining heavily scraped sites.
Scraping is semi-controversial, but in this case it's just a user with a Chrome extension visiting the site. LinkedIn has lots and lots of shady patterns around showing different results to Google Bot vs. regular users to encourage logged in sessions. Many other sites like Pinterest and Twitter/X employ similar annoying patterns.
Imo, users should be allowed to use automation tools to access websites and collect data. Most of these sites thrive off of user generated content anyways, for example Reddit is built on UGC. Why shouldn't people be able to scrape it?
If let's say I built an extension that allows people to scrape things on demand and the extension sends that data also to my servers, removing PII in the process, would that be allowed?
Technically it's acting on behalf of a proactive user in Chrome so IMHO is non-"robotic". But heh tbf this was also the excuse of Perplexity where they argued they are a legitimate non-robotic-user-agent (thus don't need to respect robots.txt) because they only make requests at the time of a user query. We need a new way of understanding what it even means to be a legitimate human user-agent. The presence of AIs as client-side catalysts will only grow.
The parent didn't say the scraping was "illegal", but that it violated ToS.
These are entirely different things. The upshot of the proceedings is that while the courts ruled there wasn't sufficient for an injunction to stop the scraping, it was nonetheless still injurious to the plaintiff and had breached their User Agreement -- thus allowing LinkedIn to compel hiQ towards a settlement.
From Wikipedia:
The 9th Circuit ruled that hiQ had the right to do web scraping.[1][2][3] However, the Supreme Court, based on its Van Buren v. United States decision,[4] vacated the decision and remanded the case for further review in June 2021. In a second ruling in April 2022 the Ninth Circuit affirmed its decision.[5][6] In November 2022 the U.S. District Court for the Northern District of California ruled that hiQ had breached LinkedIn's User Agreement and a settlement agreement was reached between the two parties.[7]
I see scraping to be equivalent to a Cherry Tree Shaking machine :-) If you are authorized to pick cherries from a tree then why not use a tree shaker and do the job in seconds but yeah make sure you don't kill the tree in the process. Also the tree owner must have right to deny you from using the tree shaker machine.
I also assume you don't check the robots.txt of websites?
I'm all for automating tedious work, but with all this (mostly AI-related) scraping, things are getting out of hand and creating a lot of headaches for developers maintaining heavily scraped sites.
related:
- "Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?" - https://news.ycombinator.com/item?id=41109926
- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182