Hacker Newsnew | past | comments | ask | show | jobs | submit | gradys's commentslogin

You might be using JSON mode, which doesn’t guarantee a schema will be followed, or structured outputs not in strict mode. It is possible to get the property that the response is either a valid instance of the schema or an error (eg for refusal)


How do you activate strict mode when using pydantic schemas? It doesn't look like that is a valid parameter to me.

No, I don't get refusals, I see literally invalid json, like: `{"field": ["value...}`


Is it luck, or is scaling arithmetic genuinely a useful capability to offer the world?


I assume you know how to program in Python? I would start with just the client libraries of the model providers you want to use. LLMs are conceptually simple when treated as black boxes. String in, string out. You don't necessarily need a framework.


I’m not as familiar with the Baby Bells, so this is a surprising comparison to me. Bell Labs was famously so productive while it had the monopoly money hose, and not as much came from it after Bell was broken up.

What are the most noteworthy accomplishments of the Baby Bells?


While I lament the decline of Bell Labs and unfettered research in general (unfettered research labs need protection from market forces that only a monopoly, government, academia, or the very wealthy can provide), I also believe that the breakup of the Bell System was overall a good thing for society. For example, there was a time when AT&T customers had to rent their phones; they couldn’t own them (https://memorial.bellsystem.com/bell_system_property.html). Customers were finally allowed to purchase their own phones once the divestiture was underway. In addition, I’m not sure if we’d have a competitive cell phone market in America today had the Bell System remained in place, not to mention how I haven’t heard anything about long-distance calling charges in about 15 years due to how many modern cell phone plans work.


Ironically, we're nearly back at the "renting" phone stage. Sure the companies selling the phones don't use that terminology, and it's a one-time payment for the life of the device, but full control of the device is never transferred to the user. The company holds the keys and will only allow you to do what they want you to do. This certainly describes iPhones and most Android phones to date, and it's getting worse on the Android side as root becomes harder and harder.


I just don't see any positives here though. Apple will be given 100% free-reign to take complete monopolistic control of the smartphone market without Google.


Calling people became cheap. Think about making a cross-country phone call in the pre-broken up AT&T era. It was like 25¢/min. Now I pay $35/month and can literally call most countries for up to 500min before I get metered (Visible+).


Cross country? You mean a 15 minute drive away. Many places local was only that town and maybe another town under 5 miles away.


Really limited the range of those free BBS calls


I'll just say one thing: BlueBeep. If you know what that is, nothing else needs to be said. :)


it's hard to compare without a control, but a few key points: * none of the Baby Bells failed * because they segmented regionally, integration was super important and you could argue paved the way for the modern internet * consumer services under Bell was incredibly expensive and tightly controlled

In hindsight maybe they should have split up horizontally, nationalizing the natural monopoly components/infrastructure (ex: the physical lines)? It's interesting to see what looks like a reconsolidation of wireless now, I wonder what the future will look like.


My opinion is that Bell Labs created great technology, but had no real incentive to make products and bring them to the public. The Baby Bells needed to compete however, and so they did.


I think even with firehose monopoly money that Bell Labs would have eventually succumbed to cuts and general enshittification as the CEOs and shareholders wanted ever increasing pay and dividends. "Do more with less guys! You're smart you can figure it out! The Board really needs this 10,000% pay raise, they have families you know."


Sounds more like Generation Augmented Retrieval in that case.


It wasn't this GAR post, I remember them calling out legal docs explicitly, might have seen it on Twitter

https://blog.luk.sh/rag-vs-gar


Do you happen to have any good references for GAR implementation?


The size of the cached internal state of the network processing the book is much larger than the size of the book. The resource that is preserved with caching is the compute required to recreate that state.


Sure, but a direct forwards pass of the book would surely require more compute than simply loading and setting the hidden state?

The second doesn't require any matrix operations, it's just setting some values.


> it's just setting some values

But it may very well be slower than just recompute it. At least for ordinary MHA and even GQA.

So, either a model arch woodoo significantly reducing kv cache size (while keeping roughly the same compute cost), or some really careful implementation moving kv cache of upcoming requests to devices in background [0].

[0] My back of envelop calc shows that even then it still does not make sense for, say, Llama 3 70B on H100s. Time to stare at TPU spec harder trying to make sense of it I guess.


It depends on how large the input prompt (previous context) is. Also, if you can keep cache on GPU with a LRU mechanism, for certain workloads it's very efficient.

You can also design an API optimized for batch workloads (say the same core prompt with different data for instruct-style reasoning) - that can result in large savings in those scenarios.


If you can pipeline upcoming requests and tie state to a specific request, doesn't that allow you to change how you design physical memory? (at least for inference)

Stupid question, but why wouldn't {extremely large slow-write, fast-read memory} + {smaller, very fast-write memory} be a feasible hardware architecture?

If you know many, many cycles ahead what you'll need to have loaded at a specific time.

Or hell, maybe it's time to go back to memory bank switching.


The throughput of the PCIe link between the CPU and GPU, is far less than the aggregate throughput of the internal interconnects between neighbouring tensor cores.

Matrix operations might flow a lot of data around — but that data flow is akin to a bunch of individual people travelling along the individual residential streets they live on. There's a lot of movement there, but also a lot of capacity for movement, because there's no bottleneck of everyone needing to go to the same place or come from the same place.

Persisting the data out of the GPU and then loading it back in, is more like all those people commuting to work and then going back home. Big fan-in onto the PCIe "highway" over to the CPU and into RAM; then big fan-out back. Traffic jams for miles.

In the time it takes to restore a 1GB state snapshot from RAM into VRAM, you can probably chew through the equivalent of 1TB or more of intermediate matrix states.


I don’t know of any public details on how they implement Context Caching, but that is presumably exactly what they are doing. Just caching the text would be a minimal savings.


"some" is doing a lot of lifting. # of tokens * # of layers * head dimension * # of heads * 2 (K+V vectors) * 4-16bits (depending on quantization)


But isnt the information somehow cached when you start a new chat and build context with say GPT4? If the caching was so large as you say so many chat sessions in parallel would not be possible.


That's not my understanding. We can't be sure how OpenAI does things themselves, but adding messages to a conversation in the API means just rerunning the history through the prompt every time


>The size of the cached internal state of the network processing the book is much larger than the size of the book

It's funny that sometimes people consider LLMs as compression engines. While a lot of information gets lost in each direction (through the neural net)


Why is that funny? Sometimes compression is lossy, like JPEG and H.265


And the internal state of a JPEG decoder can be an order of magnitude larger than the JPEG file (especially progressive JPEG that can't stream its output).


I don't lose anything with gzip or rar.


You can make any lossy compression scheme into a lossless scheme by appending the diff between the original and the compressed. In many cases, this still results in a size savings over the original.

You can think of this as a more detailed form of "I before E, except after C, except for species and science and..." Or, if you prefer, as continued terms of a Taylor-series expansion. The more terms you add, the more closely you approximate the original.


And just as fast? The issue here is how do you do these things both accurately and while maintaining reasonable speeds.


I’m guessing there was an instrumental reason for this, for instance to check that the model was listening before launching into what they wanted to demo


Yeah if they had to ask the question twice as it wasn't listening, on social media and the press it would morph into "how it couldn't understand".


There are many different rules of thumb on this floating around. 1g per lb of body weight is indeed something a number of well informed people recommend, but those people generally acknowledge that it’s something like an upper bound on the amount that is useful.


Cardio isn’t super important for what? It certainly has longevity benefits over and above those from resistance training.


Sorry, I meant cardio isn't super important for losing weight if you are already resistance training.


If losing weight is the only goal, then even resistance training isn't important.

You can just reduce portion sizes and caloric intake until you reach your goal weight.

You train strength and endurance for health and body composition, and performance if you care about that.


Attention itself was the key idea of that paper and, as you sort of acknowledge, was definitely not just throwing things at the wall. It was the culmination of a long line of work gradually progressing toward fully dynamic routing via attention, and it was motivated, if not by deep theory, at least deep intuition from linguistics. The other details of transformers are perhaps sort of arbitrary, but made sense to everyone at the time. There was no claim that those other details were optimal - just that they were one way of surrounding the attention mechanism with computing machinery that worked.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: