I'd like to see this for Gemini Pro 1.5 -- I threw the entirety of Moby Dick at it last week, and at one point all books Byung Chul-Han has ever published, and it both cases it was able to return the single part of a sentence that mentioned or answered my question verbatim, every single time, without any hallucinations.
A number of people in my lab do research into long context evaluation of LLMs for works of fiction. The likelihood is very high that Moby Dick is in the training data. Instead the people in my lab have explored recently published books to avoid these issues.
I’m not involved in the space, but it seems to me that having a model, in particular a massive model, exposed to a corpus of text like a book in the training data would have very minimal impact. I’m aware that people have been able to return data ‘out of the shadows’ pf the training data but to my mind a model being mildly influenced by the weights between different words in this text hardly constitute hard recall, if anything it now ‘knows’ a little of the linguistic style of the authour.
It depends on how many times it had seen that text during training. For example, GPT-4 can reproduce ayats from the Quran word for word in both Arabic and English. It can also reproduce the Navy SEAL copypasta complete with all the typos.
But this content is presumably in its training set, no? I'd be interested if you did the same task for a collection of books published more recently than the model's last release.
To test this hypothesis, I just took the complete book "Advances in Green and Sustainable Nanomaterials" [0] and pasted it into the prompt, asking Gemini: "What absorbs thermal radiations and converts it into electrical signals?".
It replied: "The text indicates that graphene sheets present high optical transparency and are able to absorb thermal radiations with high efficacy. They can then convert these radiations into electrical signals efficiently.".
Ask it what material absorbs “infrared light” efficiently.
To me, that’s useful intelligence. I can already search text for verbatim matches, I want the AI to understand that “thermal radiations” and “infrared light” are the same thing.
> Answer the following question using verbatim quotes from the text above: "What material absorbs infrared light efficiently?"
> "Graphene is a promising material that could change the world, with unlimited potential for wide industrial applications in various fields... It is the thinnest known material with zero bandgaps and is incredibly strong, almost 200 times stronger than steel. Moreover, graphene is a good conductor of heat and electricity with very interesting light absorption properties."
Interestingly, the first sentence of the response actually occures directly after the latter part of the response in the original text.
Edit: asking it "What absorbs infrared light and converts it into electrical signals?" yields "Graphene sheets are highly transparent presenting high optical transparency, which absorbs thermal radiations with high efficacy and converts it into electrical signals efficiently." verbatim.
Fair point, but I also think something that's /really/ clear is that LLMs don't understand (and probably cannot). It's doing highly contextual text retrieval based on natural language processing for the query, it's not understanding what the paper means and producing insights.
Gemini works with brand new books too; I've seen multiple demonstrations of it. I'll try hunting one down. Side note: this experiment is still insightful even using model training material. Just compare its performance with the uploaded book(s) to without.
I would hope that Byung-Chul Han would not be in the training set (at least not without his permission), given he's still alive and not only is the legal question still open but it's also definitely rude.
Part of that back-and-forth is the claim "this specific text was copied a lot all over the internet making it show up more in the output", and that means it's not a useful guide to things where one copy was added to The Pile and not removed when training the model.
(Or worse, that Google already had a copy because of Google Books and didn't think "might training on this explode in our face like that thing with the Street View WiFi scanning?")
Just put the 2500 example linked on the article through Gemini 1.5 Flash and it answered correctly ("The tree has diseased leaves and its bark is peeling.") https://aistudio.google.com/
Wow. Cool. I have access to that model and have also seen some impressive context extraction. It also gave a really good summary of a large code base that I dumped in. I saw somebody analyze a huge log file, but we really need something like this needle in a needlestack to help identify when models might be missing something. At the very least, this could give model developers something to analyze their proposed models.
Funnily enough I ran a 980k token log dump against Gemini Pro 1.5 yesterday to investigate an error scenario and it found a single incident of a 429 error being returned by a third-party API provider while reasoning that "based on the file provided and the information that this log file is aggregated of all instances of the service in question, it seems unlikely that a rate limit would be triggered, and additional investigation may be appropriate", and it turned out the service had implemented a block against AWS IPs, breaking a system that loads press data from said API provider, leaving the customer who was affected by it without press data -- we didn't even notice or investigate that, and Gemini just randomly mentioned it without being prompted for that.
Man, we are like 2-5 years away from being able to feed in an ePub and get an accurate graphic novel version in minutes. I am so ready to look at four thousand paintings of Tolkien trees.
What version of Gemini is built into Google Workspace? (I just got the ability today to ask Gemini anything about emails in my work Gmail account, which seems like something that would require a large context window)
Maybe they needed a German company to receive money from the BND for their user data without the US knowing :-D
But in all seriousness, I’ve been a subscriber ever since they started and I’m an ultimate subscriber still, and I’d be sad if they went bankrupt due to mismanagement of the funds.
I was just curious and visited the LinkedIn profile that's linked to from the ctone.ws website (in KingOfCoders's profile) and was wondering why Wirecard was omitted.
All of the PWAs on my iPhone running 17.4 will now open in Safari instead of in fullscreen, and iOS itself warned me the first time I opened a PWA from the home screen after installing 17.4 that iOS will now open all „linked websites“ in the „configured default browser“.
They’re obviously trying to prevent companies from bypassing their extortion proposal in response to DMA by simply offering a PWA to users that can work around the „core tech fee“..
MacPaw lists Russian-developed software as a risk because the government can access your data at any time — this is self-hosted open-source software though.
The FSB can’t just access your local server with an arbitrary court order.
Therefore this doesn’t feel like a legitimate concern but more like Russophobia, which I understand but also think is utterly unasked for as I know first hand how much Russian developers are suffering from the stupidity of their government.