Hacker News new | past | comments | ask | show | jobs | submit login

Note that writing 64 bits and reading 32 (or viceversa) is not a way to get around fences on x86. It is explicitly documented as begin undefined. In most cases it will fail to store-forward that will stall and act as an implicit fence, but in some cases the CPU can do partial store forwarding, breaking it.

AFAIK this trick does work on SPARC though.




It's not documented as being undefined; it's simply not documented at all.

Intel's latest uarch does partial store forwarding.

There was a paper from a few years ago trying to define semantics for mixed-size accesses (not for x86 though) https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf

I don't think the parent was talking about this, though; they were just talking about using a single large physical location, which logically contains multiple smaller values. Accesses to a single location happen order, so there is indeed no need for fencing between accesses to it. Usually you get a full 128 bits (at least amd64/aarch64/ppc64; not riscv yet but I expect they will get there).

That said—mixed-size can be useful despite the lack of semantics (I think linux uses them in a few places?). sooo


Ah, right, it was about guaranteed total order on all stores in a single memory location.

Re colocation and x86, IIRC the intel memory model has wordings regarding read and writes to a a memory location having to be of the same size to take advantage of the memory model guarantees.


total order on all accesses to a given location—loads from a single location can't be reordered w.r.t. each other either

i don't remember seeing any wording relating to mixed-size accesses in the intel manual (not withstanding that the official models are ... ambiguous, to say the least, compared with what 3rd-party researchers have done)


> i don't remember seeing any wording relating to mixed-size accesses in the intel manual (not withstanding that the official models are ... ambiguous, to say the least, compared with what 3rd-party researchers have done)

I was probably misremembering the details. The manual has to say this regarding #LOCK prefixed operations:

"Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths. For example, if one processor accesses a semaphore using a word access, other processors should not access the semaphore using a byte access"

which is already vague enough, but regarding general atomic load and stores I couldn't find anything.


For a total store order to be meaningful of course it implies that loads are also non visibly reordered. If a store falls in the forest but nobody is around to load it, was it really ordered :)


Tbf you could say stores happen in order, and loads can happen out of order unless you fence. Personally I don't understand why we need such strong ordering constraints for weakly ordered reads—istm you can go much weaker and maintain sanity.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: