more pornel's comments | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit | more pornel's comments

pornel 6 months ago | parent | context | [–] | on: Apple and Meta fined millions for breaching EU law

If Apple is so bad at this that they have to charge 30%, they should have failed in the free market to a competitor that can do the same or better for 3%. However, Apple has prevented that, not by being better or cheaper, but by implementing DRM that locks users out from having a choice (and the market as a whole ended up being a duopoly with cartel-like pricing).

Whether Apple can be cheaper isn't really the point (they should be, digital services are a very high margin business). It's that they're anti-competitive to the point that the market for paid apps and in-app payments became inefficient (in a financial sense).

pornel 6 months ago | parent | context | | [–] | on: Pipelining might be my favorite programming langua...

Rust has such open extensibility through traits. The prime example is Itertools that already adds a bunch of extra pipelining helper methods.

pornel 7 months ago | parent | context | | [–] | on: Gemma 3 QAT Models: Bringing AI to Consumer GPUs

It is due to the risk of a leak.

Laundering of data through training makes it a more complicated case than a simple data theft or copyright infringement.

Leaks could be accidental, e.g. due to an employee logging in to their free-as-in-labor personal account instead of a no-training Enterprise account. It's safer to have a complete ban on providers that may collect data for training.

6510 7 months ago | | [–]

Their entire business model based on taking other peoples stuff. I cant imagine someone would willingly drown with the sinking ship if the entire cargo is filled with lifeboats - just because they promised they would.

vbezhenar 7 months ago | | | [–]

How can you be sure that AWS will not use your data to train their models? They got enormous data, probably most data in the world.

simonw 7 months ago | | | [–]

Being caught doing they would be wildly harmful to their business - billions of dollars harmful, especially given the contracts they sign with their customers. The brand damage would be unimaginably expensive too.

There is no world in which training on customer data without permission would be worth it for AWS.

Your data really isn't that useful anyway.

mdp2021 6 months ago | | | [–]

> Your data really isn't that useful anyway

? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory...

Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible.

simonw 6 months ago | | | [–]

There's a pretty common misconception that training LLMs is about loading in as much data as possible no matter the source.

That might have been true a few years ago but today the top AI labs are all focusing on quality: they're trying to find the best possible sources of high quality tokens, not randomly dumping in anything they can obtain.

Andrej Karpathy said this last year: https://twitter.com/karpathy/status/1797313173449764933

> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.

mdp2021 6 months ago | | | [–]

Obviously the training data should be preferably high quality - but there you have a (pseudo-, I insisted also elsewhere citing the rights to have read whatever is in any public library) problem with "copyright".

If there exists some advantage on quantity though, then achieving high quality imposes questions about tradeoffs and workflows - sources where authors are "free participants" could have odd data sip in.

And the matter of whether such data may be reflected in outputs remains as a question (probably tackled by some I have not read... Ars longa, vita brevis).

pornel 7 months ago | parent | context | | [–] | on: Things Zig comptime won't do

There is a stark contrast in usability of self-contained/owning types vs types that are temporary views bound by a lifetime of the place they are borrowing from. But this is an inherent problem for all non-GC languages that allow saving pointers to data on the stack (Rust doesn't need lifetimes for by-reference heap types). In languages without lifetimes you just don't get any compiler help in finding places that may be affected by dangling pointers.

This is similar to creating a broadly-used data structure and realizing that some field has to be optional. Option<T> will require you to change everything touching it, and virally spread through all the code that wanted to use that field unconditionally. However, that's not the fault of the Option syntax, it's the fault of semantics of optionality. In languages that don't make this "miserable" at compile time, this problem manifests with a whack-a-mole of NullPointerExceptions at run time.

With experience, I don't get this "oh no, now there's a lifetime popping up everywhere" surprise in Rust any more. Whether something is going to be a temporary view or permanent storage can be known ahead of time, and if it can be both, it can be designed with Cow-like types.

I also got a sense for when using a temporary loan is a premature optimization. All data has to be stored somewhere (you can't have a reference to data that hasn't been stored). Designs that try to be ultra-efficient by allowing only temporary references often force data to be stored in a temporary location first, and then borrowed, which doesn't avoid any allocations, only adds dependencies on external storage. Instead, the design can support moving or collecting data into owned (non-temporary) storage directly. It can then keep it for an arbirary lifetime without lifetime annotations, and hand out temporary references to it whenever needed. The run-time cost can be the same, but the semantics are much easier to work with.

pornel 7 months ago | parent | context | | [–] | on: TLS certificate lifetimes will officially reduce t...

Browsers check the identity of the certificates every time. The host name is the identity.

There are lots of issues with trust and social and business identities in general, but for the purpose of encryption, the problem can be simplified to checking of the host name (it's effectively an out of band async check that the destination you're talking to is the same destination that independent checks saw, so you know your connection hasn't been intercepted).

You can't have effective TLS encryption without verifying some identity, because you're encrypting data with a key that you negotiate with the recipient on the other end of the connection. If someone inserts themselves into the connection during key exchange, they will get the decryption key (key exchange is cleverly done that a passive eavesdropper can't get the key, but it can't protect against an active eavesdropper — other than by verifying the active participant is "trusted" in a cryptographic sense, not in a social sense).

pornel 7 months ago | parent | context | | [–] | on: TLS certificate lifetimes will officially reduce t...

I copy the same certbot account settings and private key to all servers and they obtain the certs themselves.

It is a bit funny that LetsEncrypt has non-expiring private keys for their accounts.

pornel 7 months ago | parent | context | | [–] | on: TLS certificate lifetimes will officially reduce t...

DANE is a TLS with too-big-to-fail CAs that are tied to the top-level domains they own, and can't be replaced.

Separation between CAs and domains allows browsers to get rid of incompetent and malicious CAs with minimal user impact.

ryao 7 months ago | | [–]

DANE lets the domain owner manage the certificates issued for the domain.

pornel 7 months ago | | | [–]

This delegation doesn't play the same role as CAs in WebPKI.

Without DNSSEC's guarantees, the DANE TLSA records would be as insecure as self-signed certificates in WebPKI are.

It's not enough to have some certificate from some CA involved. It has to be a part of an unbroken chain of trust anchored to something that the client can verify. So you're dependent on the DNSSEC infrastructure and its authorities for security, and you can't ignore or replace that part in the DANE model.

pornel 7 months ago | parent | context | | [–] | on: A flowing WebGL gradient, deconstructed

Mixing of colors in an "objective" way like blur (lens focus) is a physical phenomenon, and should be done in linear color space.

Subjective things, like color similarity and perception of brightness should be evaluated in perceptual color spaces. This includes sRGB (it's not very good at it, but it's trying).

Gradients are weirdly in the middle. Smoothness and matching of colors are very subjective, but color interpolation is mathematically dubious in most perceptual color spaces, because √(avg(a+b)) ≠ avg(√(a) + √(b))

pornel 7 months ago | parent | context | | [–] | on: Hacktical C: practical hacker's guide to the C pro...

Less controversially, when you write C, you write for a virtual machine described by the C spec, not your actual hardware.

Your C optimizer is emulating that VM when performing symbolic execution, and the compiler backend is cross-compiling from it. It's an abstract hardware that doesn't have signed overflow, has a hidden extra bit for every byte of memory that says whether it's initialized or not, etc.

Assembly-level languages let you write your own calling conventions, arrange the stack how you want, and don't make padding bytes in structs cursed.

bmandale 7 months ago | | [–]

These are all such nonsensical misinterpretations of what people mean when they say C is "low level". You absolutely don't write C for the C abstract machine, because the C spec says nothing about performance, whereas performance is one of the primary reasons people write C.

The existence of undefined behaviour isn't proof that there is a C "virtual machine" that code is being run on. Undefined behaviour is a relaxation of requirements on the compiler. The C abstract machine doesn't not have signed overflow, rather it allows the compiler to do what it likes when signed overflow is encountered. This is originally a concession to portability, since the common saying is not that C is close to assembly, but rather that it is "portable" assembler. It is kept around because it benefits performance, which is again one of the primary reasons people write C.

pjmlp 7 months ago | | | [–]

C performance exists thanks to UB, and the value optimising compilers extract out of it, during the 8 and 16 bit home computers days any average Assembly developer could write better code than C compiler were able to spit out.

codr7 7 months ago | | | [–]

And also because it doesn't get in your way of doing exactly what you want to do.

OCASMv2 7 months ago | | | [–]

If that was true then the optimizers wouldn't need to exist in the first place.

codr7 7 months ago | | | [–]

Compared to the alternatives.

It gets very frustrating to communicate at this level.

pjmlp 7 months ago | | | [–]

The alternatives outside Bell Labs were just as capable.

OCASMv2 7 months ago | | | | [–]

I don't think compilers allowing trash through is a good thing.

codr7 7 months ago | | | [–]

That's an opinion, another one would be that the flexibility allowed by undefined behavior is one of C's strengths at the same time. Strength/weakness are often two sides of the same coin. Which is why these discussions get a bit circular.

greenavocado 7 months ago | | | | [–]

Have you ever seen the error steamroller? https://github.com/mattdiamond/fuckitjs

dwattttt 7 months ago | | | [–]

ON ERROR RESUME NEXT rears its ugly head again

pornel 7 months ago | | | | [–]

I'm not trying to prove a novel concept, just explain how the C spec thinks about C:

> The semantic descriptions in this International Standard describe the behavior of an abstract machine in which issues of optimization are irrelevant.

This belief that C targets the hardware directly makes C devs frustrated that UB seems like an intentional trap added by compilers that refuse to "just" do what the target CPU does.

The reality is that front-end/back-end split in compilers gave us the machine from the C spec as its own optimization target with its own semantics.

Before C got formalised in this form, it wasn't very portable beyond PDP. C was too opinionated and bloated for 8-bit computers. It wouldn't assume 8-bit bytes (because PDP-11 didn't have them), but it did assume linear memory (even though most 16-bit CPUs didn't have it). All those "checking wetness of water... wet" checks in ./configure used to have a purpose!

Originally C didn't count as an assembly any more than asm.js does today. C was too abstract to let programmers choose addressing modes and use flags back when these mattered (e.g. you could mark a variable as `register`, but not specifically as an A register on 68K). C was too high level for tricks like self-modifying code (pretty standard practice where performance mattered until I-cache and OoO killed it).

C is now a portable assembly more because CPUs that didn't fit C's model have died out (VLIW) or remained non-standard specialized targets (SIMT).

aninteger 7 months ago | | | [–]

> Less controversially, when you write C, you write for a virtual machine described by the C spec, not your actual hardware.

Isn't this true for most higher level languages as well? C++ for instance builds on top of C and many languages call into and out of C based libraries. Go might be slightly different as it is interacting with slightly less C code (especially if you avoid CGO).

pornel 7 months ago | parent | context | | [–] | on: Problems with Go channels (2016)

Channels are only problematic if they're the only tool you have in your toolbox, and you end up using them where they don't belong.

BTW, you can create a deadlock equivalent with channels if you write "wait for A, reply with B" and "wait for B, send A" logic somewhere. It's the same problem as ordering of nested locks.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact