IP law, especially defence against submarine patents, makes codec development expensive.
In the early days of MPEG codec development was difficult, because most computers weren't capable of encoding video, and the field was in its infancy.
However, by the end of '00s computers were fast enough for anybody to do video encoding R&D, and there was a ton of research to build upon. At that point MPEG's role changed from being a pioneer in the field to being an incumbent with a patent minefield, stopping others from moving the field forward.
That's unnecessarily harsh. Patent pools exist to promote collaboration in a world with aggressive IP legislation, they are an answer to a specific environment and they incentivize participants to share their IP at a reasonable price to third parties. The incentive being that you will be left out of the pool, the other members will work around your patents while not licensing their own patents to you, so your own IP is now worthless since you can't work around theirs.
As long as IP law continues in the same form, the alternative to that is completely closed agreements among major companies that will push their own proprietary formats and aggressively enforce their patents.
The fair world where everyone is free to create a new thing, improve upon the frontier codecs, and get a fair reward for their efforts, is simply a fantasy without patent law reform. In the current geopolitical climate, it's very very unlikely for nations where these developments traditionally happened, such as US and western Europe, to weaken their IP laws.
>> That's unnecessarily harsh. Patent pools exist to promote collaboration in a world with aggressive IP legislation, they are an answer to a specific environment and they incentivize participants to share their IP at a reasonable price to third parties.
You can say that, but this discussion is in response to the guy who started MPEG and later shut it down. I don't think he'd say its harsh.
They actually messed up the basic concept of a patent pool, and that is the key to their failure.
They didn't get people to agree on terms up front, they made the final codec with interlocking patents embedded from hundreds of parties and made no attempt to avoid random outsider's patents and then once it was done tried to come to a licence agreement when every minor patent holder had an effective veto on the resulting pool. That's how you end up with multiple pools plus people who own patents and aren't members of any of the pools. It's ridiculous.
My minor conspiracy theory is that if you did it right, then you'd basically end up with something close to open source codecs as that's the best overall outcome.
Everyone benefits from only putting in freely available ideas. So if you want to gouge people with your patents you need to mess this up and "accidentally" create a patent mess.
IP law and the need for extremely smart people with a rare set of narrow skills. It's not like codec development magically happens for free if you ignore patents.
The point is, if there had been no incentives to develop codecs, there would have been no MPEG. Other people would have stepped into the void and sometimes did, e.g. RealVideo, but without legal IP protection the codecs would just have been entirely undocumented and heavily obfuscated, relying on tamper-proofed ASICs much faster.
You continue to make the same unsubstantiated claims about codecs being hard and expensive. These same tropes were said about every other field, and even if true, we have tens of thousands of folks that would like to participate, but are locked out due to broken IP law.
The firewall of patents exist precisely because digital video is a way to shakedown the route media would have to travel to get to the end user.
Codecs are not, "harder than" compilers, yet the field of compilers was blown completely open by GCC. Capital didn't see the market opportunity because there wasn't the same possibility of being a gatekeeper for so much attention and money.
The patents aren't because it is difficult, the patents are there because they can extract money from the revenue streams.
Codecs not harder than compilers? Sounds like an unsubstantiated claim!
Modern video codecs are harder than compilers. You have to have good ASIC development expertise to do them right, for example, which you don't need for compilers. It's totally feasible for a single company to develop a leading edge compiler whereas you don't see that in video codecs, historically they've been collaborations.
(I've worked on both codecs and compilers. You may be underestimating the difficulty of implementing sound optimizers).
Hardware vendors don't benefit from the patent pools. They usually get nothing from them, and are burdened by having to pass per-unit licensing costs on to their customers.
It's true that designing an ASIC-friendly codec needs special considerations, and benefits from close collaboration with hardware vendors, but it's not magic. The general constraints are well-known to codec designers (in open-source too). The commercial incentives for collaboration are already there — HW vendors will profit from selling the chipsets or licensing the HW design.
The patent situation is completely broken. The commercial codecs "invent" coding features of dubious utility, mostly unnecessary tweaks on old stuff, because everyone wants to have their patent in the pool. It ends up being a political game, because the engineering goal is to make the simplest most effective codec, but the financial incentive is to approve everyone's patented add-ons regardless of whether they're worth the complexity or not.
Meanwhile everything that isn't explicitly covered by a patent needs to be proven to be 20 years old, and this limits MPEG too. Otherwise nobody can prove that there won't be any submarine patent that could be used to set up a competing patent pool and extort MPEG's customers.
So our latest-and-greatest codecs are built on 20-year-old ideas, with or without some bells and whistles added. The ASICs often don't use the bells and whistles anyway, because the extra coding features may not even be suitable for ASICs, and usually have diminishing returns (like 3x slower encode for 1% better quality/filesize ratio).
With all due respect, to say that codecs are more difficult to get right than optimizing compilers is absurd.
The only reason I can think of why you would say this is that nowadays we have good compiler infrastructure that works with many hardware architectures and it has become easy to create or modify compilers. But that's only due to the fact that it was so insanely complicated that it had to be redone from scratch to become generalizible, which led to LLVM and the subsequent direct and indirect benefits everywhere. That's the work of thousands of the smartest people over 30 years.
There is no way that a single company could develop a state of the art compiler without using an existing one. Intel had a good independent compiler and gave up because open source had become superior.
For what it's worth, look at the state of FPGA compilers. They are so difficult that every single one of them that exists is utter shit. I wish it were different.
> There is no way that a single company could develop a state of the art compiler without using an existing one. Intel had a good independent compiler and gave up because open source had become superior.
Not only can they do it but some companies have done it several times. Look at Oracle: there's HotSpot's C2 compiler, and the Graal compiler. Both state of the art, both developed by one company.
Not unique. Microsoft and Apple have built many compilers alone over their lifespan.
This whole thing is insanely subjective, but that's why I'm making fun of the "unsubstantiated claim" bit. How exactly are you meant to objectively compare this?
I've searched for some performance comparisons between Graal and equivalent GCC programs and it seems like Graal is not quite at the same level - unsurprisingly, it is probably more concerned with avoiding boxing than optimal use of SIMD. And as much as I love Roslyn, which is/was a Microsoft thing: it has the same issue. It only recently got serious about competing with C, and that's years after it was open sourced.
Well, Graal is designed to compile Java and dynamic scripting languages, not C. Its flexibility means it can also compile C (=LLVM bitcode), but that's more of a tech demo than something they invest into.
I don't quite get your point though. Mine was only that it's common for single companies to develop multiple independent state of the art compilers, whereas after the 1990s video codecs tend to be collaborations between many companies. That's a piece of evidence that codecs are harder. But this is all quite subjective and I don't really care. Maybe compilers are actually harder and the trend to collaboration in video is just a cultural quirk of that subfield - doesn't really matter. The starting point of the thread was a belief that if MPEG didn't exist video codecs would have all been 100% free right from day one and I just don't see any evidence for that. The competition to MPEG in the 90s was mostly Sorensen and RealVideo if my fading memories aren't too garbled. Although the last version of Sorensen Spark was apparently a tweaked version of H.263 according to Wikipedia.
Software wasn't always covered by copyright, and people wrote it all the same. In fact they even sold it, just built-to-order as opposed to any kind of retail mass market. (Technically, there was no mass market for computers back then so that goes without saying.)
That argument seems to have been proven basically correct, given that a ton of open source development happens only because companies with deep pockets pay for the developers' time. Which makes perfect sense - no matter how altruistic a person is, they have to pay rent and buy food just like everyone else, and a lot of people aren't going to have time/energy to develop software for free after they get home from their 9-5.
Without IP protections that allow copyleft to exist arguably there would be no FOSS. When anything you publish can be leveraged and expropriated by Microsoft et al. without them being obligated to contribute back or even credit you, you are just an unpaid ghost engineer for big tech.
This is still the argument for software copyright. And I think it's still a pretty persuasive argument, despite the success of FLOSS. To this day, there is very little successful consumer software. Outside of browsers, Ubuntu, Libre Office, and GIMP are more or less it, at least outside certain niches. And even they are a pretty tiny compared to Windows/MacOS/iOS/Android, Office/Google Docs, or Photoshop.
The browsers are an interesting case. Neither Chrome nor Edge are really open source, despite Chromium being so, and they are both funded by advertising and marketing money from huge corporations. Safari is of course closed source. And Firefox is an increasingly tiny runner-up. So I don't know if I'd really count Chromium as a FLOSS success story.
Overall, I don't think FLOSS has had the kind of effect that many activists were going for. What has generally happened is that companies building software have realized that there is a lot of value to be found in treating FLOSS software as a kind of barter agreement between companies, where maybe Microsoft helps improve Linux for the benefit of all, but in turn it gets to use, say, Google's efforts on Chromium, and so on. The fact that other companies then get to mooch off of these big collaborations doesn't really matter compared to getting rid of the hassle of actually setting up explicit agreements with so many others.
That's great, but it's not what FLOSS activists hoped and fight for.
It's still almost impossible to have a digital life that doesn't involve significant use of proprietary software, and the vast majority of users do their computing almost exclusively through proprietary software. The fact that this proprietary software is a bit of glue on top of a bunch of FLOSS libraries possibly running on a FLOSS kernel that uses FLOSS libraries to talk to a FLOSS router doesn't really buy much actual freedom for the end users. They're still locked in to the proprietary software vendors just as much as they were in the 90s (perhaps paying with their private data instead of actual money).
If you ignore the proprietary routers, the proprietary search engines, the proprietary browsers that people use out-of-the-box (Edge, Safari and even Chrome), and the fact that Linux is a clone of a proprietary OS.
>> That sounds like the 90s argument against FLOSS
> This is still the argument for software copyright.
And open source licensing is based on and relies on copyright. Patents and copyright are different kinds of intellectual property protection and incentivize different things. Copyright in some sense encourages participation and collaboration because you retain ownership of your code. The way patents are used discourages participation and collaboration.
On my new phone I made sure to install F-Droid first thing, and it's surprising how many basic functions are covered by free software if you just bother to look.
Google has been bombarding Firefox users with "Upgrade to Chrome" notices on their properties. Google kept having "oopses" that blocked browsers based on User-Agent strings, rather than capabilities.
Google also plays "fire and motion" with Web standards. They have a tendency to use non-standard(-yet) features on their websites. This gives them a perfect excuse to make other browsers look technically inferior (when the features are missing or the browser is blocked) or slow (when the features are implemented using inefficient polyfills). The unfairness is the one-sided choice of using whatever cutting-edge or Google-specific feature Chrome has, while they'd never do this in other direction. If Firefox implemented a new feature first, Google would never tell Chrome users that Chrome sucks and they need to upgrade to Firefox.
There's a grain of truth to it — Apple has learned from Microsoft's history that making the whole browser shitty is too obvious and annoys users. Apple was smart enough to keep user-visible parts of the browser in a good shape, while also dragging their feet on all the Web platform features that could endanger the App Store cash cow.
I don't want web apps on my phone (or, in an ideal world, anywhere else) so that's also a good thing. If they're not viable, it forces developers to make real apps or else just make a web page instead of whatever awful-UX nonsense they were planning.
>I don't want web apps on my phone (or, in an ideal world, anywhere else) so that's also a good thing. If they're not viable, it forces developers to make real apps or else just make a web page instead of whatever awful-UX nonsense they were planning.
Well what you personally want is irrelevant to the law and what regulators judge to be unlawful, so that's the real good thing.
>If they're not viable, it forces developers to make real apps or else just make a web page instead of whatever awful-UX nonsense they were planning.
They are perfectly viable and it has nothing to do with UX, but you have already exposed your bias and made clear that you are arguing in bad faith by spreading misinformation in your other comments.
This is tautological. If you keep instructions dumbed-down enough for AI to work well, it will work well.
The problem is that AI needs to be spoon-fed overly detailed dos and donts, and even then the output can't be trusted without carefully checking it. It's easy to reach a point where breaking down the problem into pieces small enough for AI to understand takes more work than just writing the code.
AI may save time when it generates the right thing on the first try, but that's a gamble. The code may need multiple rounds of fixups, or end up needing a manual rewrite anyway, after wasting time and effort on instructing the AI. The ceiling of AI capabilities is very uneven and unpredictable.
Even worse, the AI can confidently generate code that looks superficially correct, but has subtle bugs/omissions/misinterpretations that end up costing way more time and effort than the AI saved. It has uncanny ability to write nicely structured, well-commented code that is just wrong.
I made an STT tool (guess who wrote it for me) and have a bluetooth mic. I spend 10 minutes pacing and telling the AI what I need it to build, and how to build it. Then it goes off and builds it, and meanwhile I go to the next Claude Code instance on a different project, and do the same thing there. Then do the same for a third, and maybe by that time the first is ready for more direction. Depending on how good you are with context switching and quickly designing complex systems and communicating those designs, you can get a whole lot done in parallel. The problems you're describing can be solved, if you're careful and detailed.
It's a brave, weird and crazy new world. "The future is now, old man."
Young man, software often has more than 50 lines of code that merely merges basic examples from two libraries. That stuff is useful too, but that's a 0.5x intern, not a 10x developer.
I've told the same Claude to write me unit tests for a very well known well-documented API. It was too dumb to deduce what edge cases it should test, so I also had to give it a detailed list of what to test and how. Despite all of that, it still wrote crappy tests that misused the API. It couldn't properly diagnose the failures, and kept adding code for non-existing problems. It was bad at applying fixes even when told exactly what to fix. I've wasted a lot of time cleaning up crappy code and diagnosing AI-made mistakes. It would have been quicker to write it all myself.
I've tried Claude and GPT4o for a task that required translating imperative code that writes structured data to disk field by field into explicit schema definitions. It was an easy, but tedious task (I've had many structs to convert). AI hallucinated a bunch of fields, and got many types wrong, wasting a lot of my time on diagnosing serialization issues. I really wanted it to work, but I've burned over $100 in API credits (not counting subscriptions) trying various editors and approaches. I've wasted time and money managing context for it, to give it enough of the codebase to stop it from hallucinating the missing parts, but also carefully trim it to avoid distracting it or causing rot. It just couldn't do the work precisely. In the end I had scrap it all, and do it by hand myself.
I've tried gpt4o and 4-mini-high to write me a specific image processing operation. They could discuss the problem with seemingly great understanding (referencing academic research, advanced data structures). I even got a Python that had correct syntax on the first try! But the implementation had a fundamental flaw that caused numeric overflows. AI couldn't fix it itself (kept inventing stupid workarounds that didn't work or even defeated the point of the whole algorithm). When told step by step what to do to fix it, it kept breaking other things in the process.
I've tried to make AI upgrade code using an older version of a dependency to a newer one. I've provided it with relevant quotes from the docs (I know it would have been newer than its knowledge cutoff), and even converted parts of the code myself, so it could just follow the pattern. The AI couldn't properly copy-paste code from one function to another. It kept reverting things. When I pointed out the issues, it kept apologising, saying what new APIs it's going to use, and then use the old APIs again!
I've also briefly tried GH copilot, but it acted like level 1 tech support, despite burning tokens of a more capable model.
It turns out that deflate can be much faster when implemented specifically for PNG data, instead general-purpose compression (while still remaining 100%-standard-compatible).
Note he also expects a worse compression as tradeoff. I think he implements RLE in terms of zlib:
[...]Deflate compressor which was optimized for simplicity over high ratios. The "parser" only supports RLE matches using a match distance of 3/4 bytes, [...]
Bringing the check immediately is associated with fast food, and overcrowded touristy places that are rushing customers to leave. Places that want to be fancy act like you're there to hang out, not to just eat and leave.
It is sometimes absurd. In the UK there's an often an extra step of "oh, you're paying by card? let me go back and bring the card reader". Some places have just one reader shared among all waiting staff, so you're not going to get it faster unless you tip enough to make the staff wrestle for it.
I like the Japanese style the best — there's a cashier by the exit.
Even with the best intentions, the implementation is going to have bugs and quirks that weren't meant to be the standard.
When there's no second implementation to compare against, then everything "works". The implementation becomes the spec.
This may seem wonderful at first, but in the long run it makes pages accidentally depend on the bugs, and the bugs become a part of the spec.
This is why Microsoft has a dozen different button styles, and sediment layers of control panels all the way back to 1990. Eventually every bug became a feature, and they can't touch old code, only pile up new stuff around it.
When you have multiple independent implementations, it's very unlikely that all of them will have the same exact bug. The spec is the subset that most implementations agree on, and that's much easier to maintain long term, plus you have a proof that the spec can be reimplemented.
Bug-compatibility very often exposes unintended implementation details, and makes it hard even for the same browser to optimize its own code in the future (e.g. if pages rely on order of items you had in some hashmap, now you can't change the hashmap, can't change the hash function, can't store items in a different data structure without at least maintaining the old hashmap at the same time).
Is that so bad though? It's essentially what's already the case and as you said the developers already have an incentive to avoid making such bugs. Most developers are only going to target a single browser engine anyways, so bug or not any divergence can cause end users problems.
Regulations are like code of a program. It's the business logic of how we want the world to be.
Like all code, it can be buggy, bloated and slow, or it can be well-written and efficiently achieve ambitious things.
If you have crappy unmaintainable code that doesn't work, then deleting it is an obvious improvement.
Like in programming, it takes a lot of skill to write code that achieves its goals in a way that is as simple as possible, but also isn't oversimplified to the point of failing to handle important cases.
The pro-regulation argument isn't for naively piling up more code and more bloat, but for improving and optimizing it.
Motion vectors in video codecs are an equivalent of a 2D projection of 3D motion vectors.
In typical video encoding motion compensation of course isn't derived from real 3D motion vectors, it's merely a heuristic based on optical flow and a bag of tricks, but in principle the actual game's motion vectors could be used to guide video's motion compensation. This is especially true when we're talking about a custom codec, and not reusing the H.264 bitstream format.
Referencing previous frames doesn't add latency, and limiting motion to just displacement of the previous frame would be computationally relatively simple. You'd need some keyframes or gradual refresh to avoid "datamoshing" look persisting on packet loss.
However, the challenge is in encoding the motion precisely enough to make it useful. If it's not aligned with sub-pixel precision it may make textures blurrier and make movement look wobbly almost like PS1 games. It's hard to fix that by encoding the diff, because the diff ends up having high frequencies that don't survive compression. Motion compensation also should be encoded with sharp boundaries between objects, as otherwise it causes shimmering around edges.
Motion vectors in video codecs are an equivalent of a 2D projection of 3D motion vectors.
3D motion vectors always get projected to 2D anyway. They also aren't used for moving blocks of pixels around, they are floating point values that get used along with a depth map to re-rasterize an image with motion blur.
They are used for moving pixels around when used in Frame Generation. P-frames in video codecs aim to do exactly the same thing.
Implementation details are quite different, but for reasons unrelated to motion vectors — the video codecs that are established now were designed decades ago, when use of neural networks was in infancy, and the hardware acceleration for NNs was way outside of the budget of HW video decoders.
Third, optical flow isn't moving blocks of pixels around by an offset then encoding the difference, it is creating a floating point vector for every pixel then re-rasterizing the image into a new one.
You've previously emphasised use of blocks in video codecs, as if it was some special distinguishing characteristic, but I wanted to explain that's an implementation detail, and novel video codecs could have different approaches to encoding P-frames. They don't have to code a literal 2D vector per macroblock that "moves pixels around". There are already more sophisticated implementations than that. It's an open problem of reusing previous frames' data to predict the next frame (as a base to minimize the residual), and it could be approached in very different ways, including use of neural networks that predict the motion. I mention NNs to emphasise how different motion compensation can be than just copying pixels on a 2D canvas.
Motion vectors are still motion vectors regardless of how many dimensions they have. You can have per-pixel 3D floating-point motion vectors in a game engine, or you can have 2D-flattened motion vectors in a video codec. They're still vectors, and they still represent motion (or its approximation).
Optical flow is just one possible technique of getting the motion vectors for coding P-frames. Usually video codecs are fed only pixels, so they have no choice but to deduce the motion from the pixels. However, motion estimated via optical flow can be ambiguous (flat surfaces) or incorrect (repeating patterns), or non-physical (e.g. fade-out of a gradient). Poorly estimated motion can cause visible distortions when the residual isn't transmitted with high-enough quality to cover it up.
3D motion vectors from a game engine can be projected into 2D to get the exact motion information that can be used for motion compensation/P-frames in video encoding. Games already use it for TAA, so this is going to be pretty accurate and authoritative motion information, and it completely replaces the need to estimate the motion from the 2D pixels. Dense optical flow is a hard problem, and game engines can give the flow field basically for free.
You've misread what I've said about optical flow earlier. You don't need to give me Wikipedia links, I implement codecs for a living.
The big difference is that if you are recreating an entire image and there isn't going to be any difference information against a reference image you can't move pixels around, you have to get fractional values out of optical flow and move pixels fractional amounts that potentially overlap in some areas and leave gaps in others.
This means rasterization and making a weighted average of moved pixels as points with a kernel with width and height.
Optical flow isn't one technique, it's just a name for getting motion vectors in the first place.
I've started this thread by explaining this very problem, so I don't get why you're trying to lecture me on subpel motion and disocclusion.
What's your point? Your replies seem to be just broadly contrarian and patronizing.
I've continued this discussion assuming that maybe we talk past each other by using the term "motion vectors" in narrower and broader meanings, or maybe you did not believe that the motion vectors that game engines have can be incredibly useful for video encoding.
However, you haven't really communicated your point across. I only see that whenever I describe something in a simplified way, you jump to correct me, while failing to realize that I'm intentionally simplifying for brevity and to avoid unnecessary jargon.
The opposite is true. Ascii and English are pretty good at compressing. I can say "cat" with just 24 bits. Your average LLM token embedding uses on the order of kilobits internally.
You can have "cat" as 1 token, or you can have "c" "a" "t" as 3 tokens.
In either case, the tokens are a necessary part of LLMs. They have to have a differentiable representation in order to be possible to train effectively. High-dimensional embeddings are differentiable and are able to usefully represent "meaning" of a token.
In other words, the representation of "cat" in an LLM must be something that can be gradually nudged towards "kitten", or "print", or "excavator", or other possible meanings. This is doable with the large vector representation, but such operation makes no sense when you try to represent the meaning directly in ASCII.
True, but imagine an input that is ASCII, followed by some layers of NN that result in an embedded representation and from there the usual NN layers of your LLM. The first layers can have shared weights (shared between inputs). Thus, let the LLM solve the embedding problem implicitly. Why wouldn't this work? It is much more elegant because the entire design would consist of neural networks, no extra code or data treatment necessary.
This might be more pure, but there is nothing to be gained. On the contrary, this would lead to very long sequences for which self-attention scales poorly.
No, an LLM really uses __much__ more bits per token.
First, the embedding typically uses thousands of dimensions.
Then, the value along each dimension is represented with a floating point number which will take 16 bits (can be smaller though with higher quantization).
But humans we can feed ascii, whereas LLMs require token inputs. My original question was about that: why can't we just feed the LLMs ascii, and let it figure out how it wants to encode that internally, __implicitly__? I.e., we just design a network and feed it ascii, as opposed to figuring out an encoding in a separate step and feeding it tokens in that encoding.
> But humans we can feed ascii, whereas LLMs require token inputs.
To be pedantic, we can't feed humans ASCII directly, we have to convert it to images or sounds first.
> My original question was about that: why can't we just feed the LLMs ascii, and let it figure out how it wants to encode that internally, __implicitly__? I.e., we just design a network and feed it ascii, as opposed to figuring out an encoding in a separate step and feeding it tokens in that encoding.
That could be done, by having only 256 tokens, one for each possible byte, plus perhaps a few special-use tokens like "end of sequence". But it would be much less efficient.
Because each byte would be an embedding, instead of several bytes (a full word or part of a word) being a single embedding. The amount of time a LLM takes is proportional to the number of embeddings (or tokens, since each token is represented by an embedding) in the input, and the amount of memory used by the internal state of the LLM is also proportional to the number of embeddings in the context window (how far it looks back in the input).
In the early days of MPEG codec development was difficult, because most computers weren't capable of encoding video, and the field was in its infancy.
However, by the end of '00s computers were fast enough for anybody to do video encoding R&D, and there was a ton of research to build upon. At that point MPEG's role changed from being a pioneer in the field to being an incumbent with a patent minefield, stopping others from moving the field forward.