Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Good rant!

The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not? Behavior? Volume? Unlikely coincidence?

> random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.



There are commercial services that provide residential proxies, i.e. you get to tunnel your scraper or bot traffic through actual residential connections. (see: Bright Data, oxylabs, etc.)

They accomplish this by providing home users with some app that promises to pay them money for use of their connection. (see: HoneyGain, peer2profit, etc.)

Interestingly, the companies selling the tunnel service to companies and the ones paying home users to run an app are sometimes different, or at least they use different brands to cater to the two sides of the market. It also wouldn't surprise me if they sold capacity to each other.

I suspect some of these LLM companies (or the ones they outsource to capture data) some of their traffic through these residential proxy services. It's funny because some of these companies already have a foothold inside homes (Google Nest and Amazon Alexa devices, etc.) but for a number of reasons (e.g. legal) they would probably rather go through a third-party.


They can be local LLMs doing search, some SETI@Home style distributed work, or else.

I host my blog at Mataroa, and the stats show 2K hits for some pages on some days. I don't think my small blog and writings are that popular. Asking ChatGPT about these subjects with my name results in some hilarious hallucinations, pointing to poor training due to lack of data.

IOW, my blog is also being scraped and used to train LLMs. Not nice. I really share the sentiment with Drew. I never asked/consented for this.


> the stats show 2K hits for some pages on some days

This has been happening since long before LLMs. Fifteen years ago, my blog would see 3k visitors a day on server logs, but only 100 on Google Analytics. Bots were always scraping everything.


> I never asked/consented for this.

You put it on the information super highway baby


> You put it on the information super highway baby

...with proper licenses.

Here, FTFY, baby.

Addenda: Just because you don't feel like obeying/honoring them doesn't make the said licenses moot and toothless. I mean, if you know, you know.


Between Microsoft and Google, my existence AND presence as a community open source developer is being scraped and stolen.

I've been trying to write a body of audio code that sounds better than the stuff we got used to in the DAW era, doing things like dithering the mantissa of floating-point words, just experimental stuff ignoring the rules. Never mind if it works: I can think it does, but my objection holds whether it does or not.

Firstly, if you rip my stuff halfway it's pointless: without the coordinated intention towards specific goals not corresponding with normally practiced DSP, it's useless. LLMs are not going to 'get' the intention behind what I'm doing while also blending it with the very same code I'm a reaction against, the code that greatly outnumbers my own contributions. So even if you ask it to rip me off it tries to produce a synthesis with what I'm actively avoiding, resulting in a fantasy or parody of what I'm trying to make.

Secondly, suppose it became possible to make it hallucinate IN the relevant style, perhaps by training exclusively on my output, so it can spin off variations. That's not so far-fetched: _I_ do that. But where'd the style come from, that you'd spend effort tuning the LLM to? Does it initiate this on its own? Would you let it 'hallucinate' in that direction in the belief that maybe it was on to something? No, it doesn't look like that's a thing. When I've played with LLMs (I have a Mac Studio set up with enough RAM to do that) it's been trying to explore what the thing might do outside of expectation, and it's hard to get anything interesting that doesn't turn out to be a rip from something I didn't know about, but it was familiar with. Not great to go 'oh hey I made it innovate!' when you're mistakenly ripping off an unknown human's efforts. I've tried to explore what you might call 'native hallucination', stuff more inherent to collective humanity than to an individual, and I'm not seeing much facility with that.

Not that people are even looking for that!

And lastly, as a human trying to explore an unusual position in audio DSP code with many years of practice attempting these things and sharing them with the world around me only to have Microsoft try to reduce me to a nutrient slurry that would add a piquant flavor to 'writing code for people', I turn around and find Google, through YouTube, repeatedly offering to speak FOR me in response to my youtube commenters. I'm sure other people have seen this: probably depends on how interactive you are with your community. YouTube clearly trains a custom LLM on my comment responses to my viewers, that being text they have access to (doubtless adding my very verbose video footnotes) to the point that they're regularly offering to BE ME and save me the trouble.

Including technical explanations and helpful suggestions of how to use my stuff, that's not infrequently lies and bizarro world interpretations of what's going on, plus encouraging or self-congratulatory remarks that seem partly drawn from known best practices for being an empty hype beast competing to win the algorithm.

I'm not sure whether I prefer this, or the supposed promise of the machines.

If it can't be any better than this, I can keep working as I am, have my intentionality and a recognizable consistent sound and style, and be full of sass and contempt for the machines, and that'll remain impossible for that world to match (whether they want to is another question… but purely in marketing terms, yes they'll want to because it'll be a distinct area to conquer once the normal stuff is all a gray paste)

If it follows the path of the YouTube suggestions, there will simply be more noise out there, driven by people trying to piggyback off the mindshare of an isolated human doing a recognizable and distinct thing for most of his finite lifetime, with greater and greater volume of hollow mimicry of that person INCLUDING mimicry of his communications and interpersonal tone, the better to shunt attention and literal money to, not the LLMs doing the mimicking, but a third party working essentially in marketing, trying to split off a market segment they've identified as not only relevant, but ripe for plucking because the audience self-identifies as eager to consume the output of something that's not usual and normal.

(I should learn French: that rant is structurally identical to an endlessly digressive French expostulation)

Today I'm doing a livestream, coding with a small audience as I try for the fourth straight day to do a particular sort of DSP (decrackling) that's previously best served by some very expensive proprietary software costing over two thousand dollars for a license. Ideally I can get some of the results while also being able to honor my intentions for preserving the aspects of the audio I value (which I think can be compromised by such invasive DSP). That's because my intention will include this preservation, these iconoclastic details I think important, the trade-offs I think are right.

Meanwhile crap is trained on my work so that a guy who wants money can harness rainforests worth of wasted electrical energy to make programs that don't even work, and a pretend scientist guru persona who can't talk coherently but can and will tell you that he is "a real audio hero who's worked for many years to give you amazing free plugins that really defy all the horrible rules that are ruining music"!

Because this stuff can't pay attention, but it can draw all the wrong conclusions from your tone.

And if you question your own work and learn and grow from your missteps to have greater confidence in your learned synthesis of knowledge, it can't do that either but it can simultaneously bluster with your confidence and also add 'but who knows maybe I'm totally wrong lol!'

And both are forms of lies, as it has neither confidence nor self-doubt.

I'm going on for longer than the original article. Sorry.


This sounds pretty interesting. Can you share a link to your work or livestream?


I run a small browser game—roughly 150 unique weekly active users.

Our Wiki periodically gets absolutely hammered by LLM scraper bots, rotating IP addresses like mad to avoid mitigations like fail2ban (which I do have in place). And even when they're not hitting it hard enough to crash the game (through the external data feeds many of the wiki pages rely on), they're still scraping pretty steadily.

There is no earthly way that my actual users are able to sustain ~400kbps outbound traffic round the clock.


> The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not?

I think you cannot distinguish. But the issue is so large that Google now serves captchas on legitimate traffic, sometimes after the first search term if you narrow down the time window (less than 24 hours).

I wonder when real Internet companies will feel the hurt, simply because consumers will stop using the ruined Internet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: