This may come as a surprise to you, but there are organizations, for example German municipalities, that have more than 500 users but can't afford to start pumping tens or hundreds of thousands per year into a file sharing service. Nextcloud themselves recognize this and offer 95%+ discounts to edu, similar to what Adobe, Microsoft, and Git[Hub,Lab] are doing.
Hope you don't mind if I point out a couple of small bugs in babble.c:
1. When read_word() reads the last word in a string, at line 146 it will read past the end (and into uninitialised memory, or the leftovers of previous longer strings), because you have already added 1 to len on line 140 to skip past the character that delimited the word. Undefined behaviour.
2. grow_chain() doesn't assign to (*chain)->capacity, so it winds up calling realloc() every time, unnecessarily. This probably isn't a big deal, because probably realloc() allocates in larger chunks and takes a fast no-op path when it determines it doesn't need to reallocate and copy.
3. Not a bug, but your index precomputation on lines 184-200 could be much more efficient. Currently it takes O(n^2 * MAX_LEAF) time, but it could be improved to linear time if you (a) did most of this computation once in the original Python extractor and (b) stored things better. Specifically, you could store and work with just the numeric indices, "translating" them to strings only at the last possible moment, before writing the word out. Translating index i to word i can be done very efficiently with 2 data structures:
You can store the variable-length list of possible next words for each word in a similar way, with a large buffer of integers and an array of offsets into it:
unsigned next_words[MAX_WORDS * MAX_LEAF]; // Each element is a word index
unsigned next_words_start_pos[MAX_WORDS + 1]; // Each element is an offset into next_words
Now the indices of all words that could follow word i are enumerated by:
for (j = next_words_start_pos[i]; j < next_words_start_pos[i + 1]; ++j) {
// Do something with next_words[j]
}
(Note that you don't actually store the "current word" in this data structure at all -- it's the index i into next_words_start_pos, which you already know!)
> This next step I described work even with the most skeptic human interrogator possible.
To be a valid test, it still has to be passed by ~every adult human. The harder you make the test (in any direction), the more it fails on this important axis.
> A number of interrogators could be used, and statistics compiled to show how
often the right identification was given
Turing determines that we need enough competent-interrogator passes just to estabilish a statistical certainty, not ~everyone. I tend to agree with him on this.
Please reread that section. You'll discover it has nothing to do with whether humans can pass the test.
If you can find a part of the paper in which Turing really does claim that it is unnecessary for most adult humans to be able to pass the test, by all means quote it. But this would be a surprising thing for him to claim, because it would undermine the entire foundation of his Imitation Game.
My original claim was that the Turing test needs to be passable by ~every adult human. You counterclaimed that Turing himself didn't think so, and provided that quote from the IG paper as evidence. But that quote is in a section about testing digital computers, not humans. Thus it is unconnected to your counterclaim.
I don't know how much simpler I can make it.
Find a quote that actually backs up your claim, or accept that you've learned something about the paper you told me to read.
Maybe you're joking, but assuming you're not: This problem doesn't solve itself at all. If bots get good enough to know what links have garbage behind them, they'll stop scraping those links, and go back to scraping your actual content. Which is the thing we don't want.
That's sort of the point: almost nobody runs a site as large as Reddit. The average website has a relatively small handful of pages. Even a very active blog has few enough pages that it could be fully scraped in under a few minutes. Where scrapers get hung up is when they're processing links that add things like query parameters, or navigating through something like a git repository and clicking through every file in every commit. If a scraper has enough intelligence to look at what the link is, it _surely_ has enough intelligence to understand what it does and does not need to scrape.
That is literally what my post said, except the scraper has more leverage than is being admitted (it can learn which pages are real and “punish” the site by requesting them more).
My point isn’t that I want that to happen, which is probably what downvotes assume, my point is this is not going to be the final stage of the war.
I don't follow that at all. The post of yours that I responded to suggested that the scrapers could "just add an LLM" to get around the protection offered by TFA; my post explained why that would probably be too costly to be effective. I didn't downvote your post, but mine has been upvoted a few times, suggesting that this is how most people have interpreted our two posts.
> it can learn which pages are real and “punish” the site by requesting them more
Scrapers have zero reason to waste their own resources doing this.
Yes, this would be fine if you have an SPA or are otherwise already committed to having client-side JS turned on. Probably rot13 "encryption" would be enough.
OTOH, I doubt most scrapers are trying to scrape this kind of content anyway, since in general it's (a) JSON, not the natural language they crave, and (b) to even discover those links, which are usually generated dynamically by client-side JS rather than appearing as plain <a>...</a> HTML links, they would probably need to run a full JS engine, and that's considerably harder both to get working and computationally per request.
One way to keep things mostly the same without having to store any of it yourself:
1. Use an RNG seeded from the request URL itself to generate each page. This is already enough for an unchanging static site of finite it infinite size.
2. With each word the generator outputs, generate a random number between, say, 0 and 1000. On day i, replace the about-to-be-output word with a link if this random number is between 0 and i. This way, every day roughly 0.1% of words will turn into links, with the rest of the text remaining stable over time.
> A browser running openai operator or whatever its called would immediately figure it out though.
But running that costs money, which is a disincentive. (How strong of a disincentive depends on how much it costs vs. the estimated value of a scraped page, but I think it would 100x the per-page cost at least.)
My initial reaction was that running something like this is still a loss, because it probably costs you as much or more than it costs them in terms of both network bytes and CPU. But then I realised two things:
1. If they are using residential IPs, each byte of network bandwidth is probably costing them a lot more than it's costing you. Win.
2. More importantly, if this became a thing that a large fraction of all websites do, the economic incentive for AI scrapers would greatly shrink. (They don't care if 0.02% of their scraping is garbage; they care a lot if 80% is.) And the only move I think they would have in this arms race would be... to use an LLM to decide whether a page is garbage or not! And now the cost of scraping a page is really starting to increase for them, even if they only run a local LLM.
We should encourage number 2. So much of the content that the AI companies are scraping is already garbage, and that's a problem. E.g. LLMs are frequently confidently wrong, but so is Reddit, who produce a large volume of trading data. We've seen a study surgesting that you can poison an LLM with very little data. Encouraging the AI companies to care about the quality of the data they are scraping could be beneficial to all.
The cost of being critical of source material might make some AI companies tank, but that seems inevitable.
> it probably costs you as much or more than it costs them in terms of both network bytes and CPU
Network bytes, perhaps (though text is small), but the article points out that each garbage page is served using only microseconds of CPU time, and a little over a megabyte of RAM.
The goal here isn't to get the bots to go away, it's to feed them garbage forever, in a way that's light on your resources. Certainly the bot, plus the offline process that trains on your garbage data, will be using more CPU (and I/O) time than you will to generate it.
Not to mention they have to store the data after they download it. In theory storing garbage data is costly to them. However I have a nagging feeling that the attitude of these scrapers is they get paid the same amount per gigabyte whether it's nonsense or not.
If they even are AI crawlers. Could be just as well some exploit-scanners that are searching for endpoints they'd try to exploit. That wouldn't require storing the content, only the links.
If you look at the pages which are hit and how many pages are hit by any one address in a given period of time it's pretty easy to identify features which are reliable proxies for e.g. exploit scanners, trawlers, agents. I publish a feed of what's being hit on my servers, contact me for details (you need to be able to make DNS queries to a particular server directed at a domain which is not reachable from ICANN's root).
How dare they. I just want to share photos and calendar with the 502 people in my immediate family.
reply