What is gained and lost with 63-bit integers? (2014)

rwmj · on Aug 12, 2023

> Unfortunately, LEA became very slow in the recent generations of CPUs. Intel doesn’t suggest this will change.

lea is a pretty common instruction even in non-OCaml code so this is annoying. I wonder if this has changed since 2014?

  $ objdump -d /usr/bin/ls | grep '\blea\b' | wc -l
  908
  $ objdump -d /usr/bin/ls | wc -l
  21398

4.2% of all instructions, and that's an underestimate because the raw objdump -d output contains blank lines, nops, etc.

For an OCaml 4.14 program (/usr/bin/virt-v2v), I got 38886 / 432937 which is 9.0%.

For OCaml 5 using the flambda optimizer, 36531 / 480357 = 7.6%

orra · on Aug 12, 2023

> I wonder if this has changed since 2014?

Good question. Probably not. LLVM reduced the number of LEAs in 2017. https://reviews.llvm.org/D32277

Also, Golang is gradually reducing the number of slow LEAs. https://github.com/golang/go/issues/21735

> grep '\blea\b' | wc -l

This maybe isn't sophisticated enough? The other links seem to explain two operand LEA isn't so problematic.

rwmj · on Aug 12, 2023

Looks like there have been proposals to eliminate use of 3 operand lea in OCaml code (not accepted sadly):

https://github.com/ocaml/ocaml/pull/8531 https://github.com/ocaml/ocaml/issues/9850

asddubs · on Aug 12, 2023

not outright rejected either though, the PRs just both died

saagarjha · on Aug 12, 2023

Not all leas, just the three-operand version described in the post. The less complex ones are heavily used for address construction and (less commonly) just normal arithmetic since they can help with port usage.

devit · on Aug 12, 2023

Yes, Zen 2+ can execute 4 two-operand LEAs per cycles (according to Agner Fog), and apparently Sapphire Rapids can execute 5 LEAs of any 32/64-bit kind per cycle (according to Intel).

rollcat · on Aug 12, 2023

Common in program text != common in hot code paths. To estimate the actual impact, you'd have to run the code in a profiler.

PaulDavisThe1st · on Aug 12, 2023

This is quite intriguing. In Ardour we have a 63 bit time type that uses the high (MS) bit to indicate whether it is a value in the audio time domain or the music time domain.

It did not occur to me to use the low (LS) bit as the tag. I wonder if the ocaml approach is faster or slower. We're not doing garbage collection, so the cool "pointers are free" aspect isn't relevant.

We're also even slower than the ocaml-style tagging for a different reason: we force atomic loads & stores because our values are by definition read/write from any number of threads.

[ EDIT: if anyone wants to tell me that my coding is crap and I could get a 4x speedup by doing <X>, please do: https://github.com/Ardour/ardour/blob/master/libs/pbd/pbd/in... ]

twoodfin · on Aug 12, 2023

Pure intuition, but if you don’t care about tagged pointers, I’d guess your approach would at worst break even and might even win some microbenchmarks: Flipping bits via a mask is cheaper transistor-wise than a shift, and a very cursory review of the latest Intel pipeline data from the canonical resource[1] suggests that this is reflected in the degree of macro-op fusion and number of execution units available for shift vs. bitlogic. See page 166 in particular.

[1] https://www.agner.org/optimize/microarchitecture.pdf

gpderetta · on Aug 12, 2023

Why the alignas 16 for an 8-byte type? In the wasted 8 bytes you could easily fit a type discriminant and a version counter/consistency check for to avoid teared read/writes (assuming you can't rely on atomic 128 bit access).

PaulDavisThe1st · on Aug 12, 2023

Another excellent question that I regret not having an answer to. I'll have to investigate both the git history and also check the performance results of dropping that.

EDIT: I think it may be related to this (now fixed) gcc bug:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147

nikic · on Aug 12, 2023

Why does the flag use different representation depending on whether the value is negative?

PaulDavisThe1st · on Aug 12, 2023

I will try to remember. Vague recollection was that it was necessary to get something obvious to work.

aydyn · on Aug 12, 2023

youre making a real good case for better documenting one’s code right now

tomcam · on Aug 12, 2023

Do you think Paul doesn’t want to comment his code perfectly? Is it your thought that he has unlimited time to create complex free software up to your coding standards?

You’re making a really good case for flagging comments, which I almost never do.

PaulDavisThe1st · on Aug 12, 2023

Thanks for the defense, but to be honest I think GP has a point.

Why on earth I would have used 2's complement conventions for what is truly just a boolean/bistate flag is far from obvious, and it ought to be (better) commented.

One thing I would say in my own defense is that one aspect of remote-distributed cooperative development is that sometimes the conversations you have with others (e.g. on IRC, or mattermost or even discord if that's your thing) can sometimes feel as if you've just "written the documentation", even though it will generally vanish into the ether. I think that is what happened here. I remember discussing it with others on our IRC dev channel, but I don't keep logs.

tomcam · on Aug 12, 2023

You’re a champ and I love you for that, but it’s no excuse for that kind of rudeness. There is literally no easier nor shallower critique in programming than one should’ve commented a little better.

You rock, literally!

heavenlyblue · on Aug 13, 2023

No matter how free code is, if it can't prove to me what it does and it's design decision, it may as well not exist at all.

aydyn · on Aug 13, 2023

mate, you need to relax fr. Its just a friendly jib

tomcam · on Aug 13, 2023

Oops. Now I feel bad—my apologies.

dougall · on Aug 12, 2023

The tag bit can be inverted.

V8 uses 1 for pointers, 0 for integers - you're usually loading from constant offsets from pointers anyway, so that mostly folds away nicely. Then:

x + y is translated to CPU instructions x + y

x * y is translated to CPU instructions (x >> 1) * y

x / y is translated to CPU instructions (((x >> 1) / (y >> 1)) << 1)

x lsl y is translated to CPU instructions (x << (y >> 1))

mrcode007 · on Aug 12, 2023

On x86-64 you are free to manipulate the 16 MSB bits of an address however you like. The only micro-architectural restriction is that the 16 msb bits are all either 0 or 1. Otherwise you get a GPF. One way to tag pointers is mask off 15 higher bits and do zero or sign extend based on 16msb bit prior to memory dereference with MOVZX or MOVSX instruction. No need for shifts and other magic. I think in this scenario the unsigned numbers can retain their plain arithmetic semantics after masking off of a single MSB bit without a need for shifts etc. alternatively when loading/operating on arrays its type can be encoded in the pointer to the array by the above scheme while data can be stored in POD format again with no extra shift etc.

tankenmate · on Aug 12, 2023

It's still a trade off, then all your pointer arithmetic has to be adjusted instead; which do you do more? int arithmetic or pointer chasing?

dougall · on Aug 12, 2023

It is a trade off, but a lot of processors have free offsets from loads, so pointer chasing is almost always free.

On ARM, "ldr x0, [x1]" just becomes "ldur x0, [x1, #-1]" - same size, same performance (at least on the Apple M1). If you have an array in a structure "add x0, x1, #16 ; ldr x0, [x0, x2, lsl #3]" becomes "add x0, x1, #15 ; ldr x0, [x0, x2, lsl #3]".

The only place I can think of penalties are pointers directly to arrays: "ldr x0, [x0, x1, lsl #3]" becomes "sub x0, x0, #1 ; ldr x0, [x0, x1, lsl #3]". Or loads from large offsets - ldur can only reach 256 bytes, whereas the aligned ldr can reach 32760 bytes. In either case the penalty is only one SUB operation.

rollcat · on Aug 12, 2023

By the way, at least for pointer arithmetic, this makes a lot of sense: adding two valid addresses never makes sense, so ptr+i always results in a valid pointer (assuming i is a multiple of sizeof(ptr)).

anttihaapala · on Aug 12, 2023

With pointers having lowest bit set, you'd check if the LSB is set and then do for example (uint32_t *)((char *)p - 1) + 2 which the compiler would optimize to register + 7 instead of register + 8, i.e. the penalty would be exactly zero for any other offsets beside 0.

Joker_vD · on Aug 12, 2023

Honestly, it's a shame that the appropriate lower bits aren't automatically masked off by the CPU before the atual load/store.

gpderetta · on Aug 12, 2023

That's by design. Implicit masking used to be the norm in older architectures, but programs would store data there with impunity and it would make it harder to extend the address space, so AMD64 requires most significant bits of an address to be either all 1 or all 0.

Joker_vD · on Aug 12, 2023

Yeah, but I'm talking about lower bits, not the upper bits.

Off-tangent: virtual pointers on AMD64 are effectively signed numbers and yet most low-level intro-level programming books keep saying that "pointers are unsigned ints because it doesn't make sense for pointers to be negative, bla-bla". Hell, I believe xv6 book has this passage and the passage about pointers' upper-bits on AMD64 being either all 1 or 0 on the same page!

gpderetta · on Aug 12, 2023

Ah, ok, but now you can't do subword aligned read/writes. That was a costly mistake that Alpha originally made and most architectures have avoided since.

Joker_vD · on Aug 12, 2023

That depends on ISA's details, honestly. For example, Knuth's MMIX has 64-bit words and has separate intructions to load/store 8-, 16-, 32-, or 64-bit wide chunks of data to/from memory — which are forced to be naturally aligned because 0, 1, 2, or 3 lower bits of the address are masked, depending on the data access's width. So there you go, you have aligned access for all "natural" subword sizes.

sweetjuly · on Aug 12, 2023

Don't most contemporary high performance processors support unaligned memory operations? An eight byte load with non-zero vaddr[2:0] from an all zero one

flohofwoe · on Aug 12, 2023

Ah, I totally thought this is about treating unsigned integers as 63 bits (which would use masked operations so that the top bit is always zero even on overflow), while signed integers would use the full 64 bits with the MSB as sign bit. That way converting from unsigned to signed would always be 'safe' without requiring casting.

I sometimes use "width-1" integer types in Zig because Zig doesn't allow to assign an unsigned integer to a signed integer of the same width (e.g. assigning an u8 to an i8 is illegal without a cast, but assigning u7 to i8 is fine).

thinkharderdev · on Aug 12, 2023

I'm still not sure I follow what the gain is here? I get that it makes it possible to differentiate between an int and pointer, but why is that actually important. The compiler already has type information right? And would know whether a value is a pointer or int?

rwmj · on Aug 12, 2023

An immediate reason is that some variants can be either ints or boxed. eg:

  type t = One             (* represented as int 0 (0b01) *)
         | Two             (* represented as int 1 (0b11) *)
         | Pair of t * t   (* boxed (ie. pointer) *)

The representation used by OCaml is quite subtle. There's a distinction between the very rich types known at compile time which are almost all erased, and the little bit of information that is needed at runtime mainly by the garbage collector.

Here are some links to how it all works:

https://dev.realworldocaml.org/runtime-memory-layout.html https://rwmj.wordpress.com/2009/08/04/ocaml-internals/ https://web.archive.org/web/20210412024831/http://caml.inria...

thinkharderdev · on Aug 12, 2023

Ah yes, I wasn't thinking about the gc. That makes more sense now. Thanks!

Dylan16807 · on Aug 12, 2023

The type of a variable can be "x or y or z", so no the compiler won't always know that.

gpderetta · on Aug 12, 2023

That's an union type, right? I would expect it to have a separate discriminant field if there are no available bits in the object representation itself.

a1369209993 · on Aug 12, 2023

The catch there is "separate" - if the object representation is one machine word, you really do not want to have to increase that by a factor of two to store one additional bit of information, and implementors will generally try hard to avoid that if at all possible.

Sharlin · on Aug 12, 2023

Looks like it’s an optimization to avoid boxing integers that can be stored inline. So two runtime representations for the same type.

DonPellegrino · on Aug 12, 2023

This episode of the Jane Street podcast (Signals & Threads) discusses a recent optimization to the GC (pointer prefetching) that's had impressive results. Part of why it has worked so well is the pointer tagging/63-bit integers.

Great episode, I highly recommend it to all interested in language runtime design.

https://signals-threads.simplecast.com/episodes/memory-manag...

SeanLuke · on Aug 12, 2023

NewtonScript had 31-bit integers, rather than 32-bit, for the exact same reason. It was a complete mess. Famously the Newton had its own Y2K bug (the 2010 bug) because dates could only be expressed with 31 bits.

If a language needs to rely on pointer tags for GC, avoid that language.

Someone · on Aug 12, 2023

I don’t see how one library for one language having some undesirable feature implies “If a language needs to rely on pointer tags for GC, avoid that language.”, especially not since using tagged pointers does not necessarily imply limiting the range of integers.

An implementation can choose to implement ‘small’ objects as tagged pointers, and use a pointer to an ‘integer’ object for objects it cannot ’hide’ inside a pointer.

rollcat · on Aug 12, 2023

Everyone has already realised that 32-ish bit values are too little to represent time with either sufficient range or precision, I wouldn't blame the issue on 31-bit integers specifically (that one bit hardly buys you much time), but on the short-sightedness of whoever chose to use them for that purpose.

A 63-bit integer would give you 145 years of range at nanosecond resolution; if you chose millisecond precision instead, you could accurately timestamp the death of the last mammoth.

SeanLuke · on Aug 12, 2023

This was just one example of where 31-bit ints bit you in NewtonScript. But there were many more, especially when dealing with interoperability with the OS, which was in C++.

_19qg · on Aug 12, 2023

Lisp?

SeanLuke · on Aug 12, 2023

Well, it's not part of the language, but some implementations yeah... :-(

_19qg · on Aug 12, 2023

I would think that many/most Lisp implementations implement fixnum integers with tags. But many also have automatic overflow into bignums, like in Common Lisp.

Vogtinator · on Aug 12, 2023

I wonder why they set the LSB to indicate an integer and clear it for other types. If they did it the other way, integer operations could be faster as the +1/-1 would not be necessary any longer.

rwmj · on Aug 12, 2023

But then a pointer would not be a pointer, the bottom bit would have to be masked before each use. Or I guess you could dealign everything but that has its own problems.

fweimer · on Aug 12, 2023

Most current CPUs support word loads with byte offsets, so instead of masking the implementation could just load with an odd byte offset. Probably you'd want offset 255 (or whatever matches the maximum negative offset that can be encoded directly in a load instruction) because it enables direct access to more fields.

rwmj · on Aug 12, 2023

What is also interesting is that negative (-4 or -8) offsets are already used. The pointer actually points to the first byte of the data, but there is a header located before the data which tells you how long the data is and some other runtime info used by the GC.

  +------------+------------+------------+------------+ - - -
  | header     | data[0]    | data[1]    | data[2]    |
  +------------+------------+------------+------------+ - - -
                ^
                pointer

(The real data part can be all kinds of things including stuff opaque to the GC like strings).

I'm still a bit unconvinced that swapping the sense of pointers vs ints is useful for OCaml, as it would be a massive change with no current evidence base. It seems better to me to work on better compiler optimizations so that integers can be kept in registers with their true value more often.

fweimer · on Aug 12, 2023

I was about to write that this needs an extra decoding step because an addressing mode that uses a base register, a scaled index register, and a byte offset is a very CISCy thing. But of course with 0-based integers, it's no longer necessary to decode the index, so it's probably a wash.

gbin · on Aug 12, 2023

They explain it in the article, it is to differenciate between an int directly or a pointer. Pointers are aligned so they cannot have their LSB at 1.

rwmj · on Aug 12, 2023

Also less well known (even amongst OCaml programmers) is that the "10" bottom bits combination is used to indicate a result containing an exception. So all pointers must be aligned to at least 4 bytes, of course not a problem since 16 bit arches are not supported.

hayley-patton · on Aug 12, 2023

One can re-align when accessing; this is how SBCL e.g. handles tagging. e.g. the first slot would be accessed by MOV RAX, [RBX - 1].

pfdietz · on Aug 12, 2023

Lisps have had this for decades. The realization that one could do this was one of the things that killed lisp machines, since it enabled lisps to be implemented efficiently on stock microprocessors, which were much cheaper due to economies of scale.

In SBCL, this also allows cons cells to have their own offset, so car and cdr are implemented with one instruction and are safe (that is, throw an exception if applied to something they don't work on) even in code compiled at (safety 0).

Now just waiting for parallel GC in SBCL. ;) Well, also waiting for some other things, I will admit.

hayley-patton · on Aug 12, 2023

> so car and cdr are implemented with one instruction and are safe (that is, throw an exception if applied to something they don't work on) even in code compiled at (safety 0).

It is? By default x86 will happily do unaligned accesses, and I'm pretty sure SBCL doesn't change that. Not sure about other instruction sets.

SBCL also uses double-word alignment and 4-bit tags, so you could CDR (expecting lowtag #b0111) a structure object (lowtag #b1111).

pfdietz · on Aug 13, 2023

Um, you're right. I have no idea why I thought that was the case.

hayley-patton · on Aug 13, 2023

In your defense, I recall Self did use unaligned traps to deal with dispatching on heap object/immediate, when immediates didn't usually appear at a call site.

fdej · on Aug 12, 2023

That is the way the FLINT fmpz type does it. The benefit is that you can check for pointers once and then use the operands like ordinary ints without the +1/-1 adjustments. For example, to multiply two integer matrices, you can do O(n^2) pointer checks and then do O(n^3) arithmetic operations without overhead. But if the pointer check is done each operation then the representation doesn't make much difference; whether you use the LSB or MSB as a tag, what kills performance is checking and branching based on a tag bit in the first place. What you'd want is a sufficiently smart compiler (TM) that can eliminate as many such checks as possible.

Karellen · on Aug 12, 2023

I wonder how much arithmetic is done with integers that can't fit in a 32-bit int?

Might it be better to have a tagged 64-bit type where, if the type is an int, one half of the value is a 32-bit int? You could then do arithmetic with 32-bit values just with bit-masking, or by using instructions that operate on half of a register, rather than having to shift. If the result of a calculation exceeds 32-bits (checked with carry flag) then you can box that into a 64-bit (or bignum?) int instead?

moomin · on Aug 12, 2023

This is interesting, but I feel like it only scratches the surface. What about all the types that _aren’t_ an int or a pointer?

n2d4 · on Aug 12, 2023

Those are the only ways OCaml represents types in memory (besides float, as briefly mentioned in the article)

If you have a record type, or anything else, that'll compile to a pointer and an object in memory, comparable to e.g. Python or Java objects.

Types like booleans are converted to integers by the compiler, similarly to C(++) (though there's no implicit conversion on the language level).

thedufer · on Aug 12, 2023

It's a bit more complex than this, but roughly each block must contain either all pointers/ints or no pointers. There's a tag at the beginning of each block indicating what type of block it is (and some other stuff - the GC stores some state in the tag as well). If you have, for example, a record consisting of a pointer, an int, and a float, it'll be tagged as having all pointers/ints, and the float will actually be a pointer to a block tagged as data, that contains the float.

This is the "float boxing" problem - floats cost an extra pointer chase in most cases. The 63-bit int is an optimization that avoids needing the extra indirection for ints. You can still have 64-bit ints, but they aren't the default, and they cost this extra pointer chase.

AgentOrange1234 · on Aug 12, 2023

You can think of the bottom bit here as a one-bit tag. Various lisps will devote additional tag bits to allow for multiple “immediate” types like characters and floats. Every additional tag bit cuts into your primitive integer space, so it’s a tradeoff.

jiggawatts · on Aug 13, 2023

The extra steps required for arithmetic operations like “x / y is translated to CPU instructions (((x >> 1) / (y >> 1)) << 1) + 1” is the type of thing where supercompilation can be very effective.

You have to wonder what fraction of the lost efficiency can be clawed back through such techniques…

LAC-Tech · on Aug 12, 2023

How to garbage collected languages like F# or Dlang deal with this? Is everything just boxed?

_0w8t · on Aug 12, 2023

In modern JS engines with 64-bit CPU when the engine cannot deduce types and must use a generic word to represent any kind of values numbers (double values) are not boxed. Rather for everything else a NaN tag bit pattern is used. I.e. code checks if the word matches the NaN pattern. If not, a double is assumed. Otherwise the NaN is checked if it is a real NaN o something else masked as NaN.

This slows down object access as detecting and stripping the NaN tag requires few CPU instructions. Plus it assumes that pointers only have 48 bits with rest are zeros (true for AMD, ARM and Intel) or at least have fixed values (can be arranged on more exotic CPUs). But that does not require to box numbers greatly reducing GC pressure.

jsheard · on Aug 12, 2023

> Plus it assumes that pointers only have 48 bits with rest are zeros (true for AMD, ARM and Intel)

Except when it isn't: https://en.wikipedia.org/wiki/Intel_5-level_paging

That's not something you're likely to run into on consumer hardware, but with JS being used on the server I wonder if/when JS engines will need to deal with that.

toast0 · on Aug 12, 2023

That's probably quite some time off. As an application, you'd only get the new level of paging if you asked for it, like with PAE. And you're only going to ask for it if you need more than 256TB of address space, which seems like a rather large space to need.

I guess you could have a lot of files mmaped though?

aapoalas · on Aug 12, 2023

ARM also has pointer cryptography which may one day become a blocker.

_0w8t · on Aug 12, 2023

When there are more bits in a pointer than NaN 52 bits allows, the trick is to replace pointers with indexes from the start of JS heap. This is not efficient even with arrangement like having heap aligned, say, on 4GB or even more granular address so to get the full pointer one just use bit operation, not an add. But if one wants efficiency, then make sure that types in the code is stable and JIT will generate type-specific code.

esprehn · on Aug 12, 2023

Not all engines use NaN boxing, for example v8 uses pointer tagging:

https://stackoverflow.com/questions/63550957/why-does-v8-use...

https://news.ycombinator.com/item?id=16985390

_0w8t · on Aug 13, 2023

This describes 32 bit CPU. As opposite to V8, SpiderMonkey, Mozilla’s JS engine, uses 64 bit words and NaN boxing, even on a 32 bit CPU, to represent a generic JS thing.

esprehn · on Aug 13, 2023

I'm pretty sure even on 64bit v8 doesn't use NaN boxing. Do you have docs or code that show it does?

Yes other engines use NaN boxing, but not all engines do.

beeforpork · on Aug 12, 2023

No, mixing boxed and unboxed types is very typical for garbage collected languages. As the article shows, you can distinguish with a bit whether something is a boxed type or not, so you do that during GC. For many GCs, you typically have another bit for marking (whether you need it depends on the GC technique -- copying GC may not need it). With modern 64-bit CPUs, usually the three lowest bits are free (because every struct will be 8-byte aligned and thus pointers end in 0b000) that you can use for tagging. This will give you 61 bit ints, which work just as described in the article.

LAC-Tech · on Aug 12, 2023

No, mixing boxed and unboxed types is very typical for garbage collected languages. As the article shows, you can distinguish with a bit whether something is a boxed type or not, so you do that during GC.

Both languages I mentioned use regular 8/16/32/64 bit size values, which is why I asked.

orthoxerox · on Aug 12, 2023

In F# (.NET) primitives aren't boxed unless they are stored in a variable/field of type System.Object. Generic methods are specialized by the JIT compiler when their generic arguments are primitive types.

orthoxerox · on Aug 12, 2023

Upd: I noticed that someone mentioned upthread that OCaml uses this to fit sum types that are either an int or a pointer into a single 64-bit value.

F# doesn't do that and always stores a separate tag field for sum types. This is marginally less efficient, but doesn't make interop with 99% of languages and data formats in existence awkward.

chainingsolid · on Aug 12, 2023

In my experience with Dlang I can't remember seeing 63 bit ints or boxing. In either the language or the standard lib.

amelius · on Aug 12, 2023

Didn't Ocaml recently completely overhaul their runtime system? Is this still relevant?

rwmj · on Aug 12, 2023

The representation of types is still the same (except in a few rare corner cases, notably pointers outside the OCaml heap). The thing that changed was the introduction of domains, groups of threads which can now run in parallel.

amelius · on Aug 12, 2023

Ok, I was under the impression that (to support true parallelism) they changed their garbage collector which could have affected the way they discern pointers and integers. I can understand if they kept int the same for compatibility reasons, but perhaps they now have a 64 bit unboxed type under a different name?

rwmj · on Aug 12, 2023

There's a true 64 bit boxed int (int64) and arrays of unboxed 64 bit ints (bigarray), same as before. Pointers vs integers are still distinguished using the bottom bit.

sparkie · on Aug 12, 2023

What is also lost: Efficient 64-bit FP arithmetic, because your boxing format can't support 64-bit floats as values. You either stick with 32-bit floats or have additional dereferencing for 64-bit floats.

This is why I prefer the NaN-boxing approach, where you can hold 64-bit float values and 48-bit pointers and integers as values in a 64-bit register. 48-bits is sufficient for pointers because few CPUs support greater than 48-bit addressing.

When you need full 64-bit integers, chances are you need more than one of them. Have one of your NaN-box tags indicate that a memory address is a vector of 64-bit integers and load them into a vector register.

Joker_vD · on Aug 12, 2023

OCaml's floats are unboxed 64-bit double-precision IEEE-754 floating point numbers. So there.

sparkie · on Aug 13, 2023

According to https://v2.ocaml.org/manual/intfc.html#s%3Ac-value, a 64-bit word is either an integer or pointer. Doubles are heap allocated with pointers represented by a Double_tag or Double_array_tag.

It should be obvious. A 64-bit word can't represent 63-bit integers, 63-bit pointers and 64-bit floats without overlapping/ambiguous representation.

You can represent 64-bit floats, 48-bit pointers and 48-bit integers (signed and unsigned) in a 64-bit word without any overlap though, and you can avoid the pointer dereference for FP operations, instead only needing to perform a check for NaN. Other types need unboxing which can be done in ~2 cycles, without any dereferencing.

Joker_vD · on Aug 13, 2023

    As an optimization, records whose fields all have static type float are represented as arrays of floating-point numbers, with tag Double_array_tag. (See the section below on arrays.)

    [...]

    Arrays of floating-point numbers (type float array) have a special, unboxed, more efficient representation. These arrays are represented by pointers to blocks with tag Double_array_tag.

sparkie · on Aug 13, 2023

So they're unboxed in certain situations, but otherwise boxed.

With NaN-boxing you would do this for Int64s. They would need boxing when you need a single value, but you can have one or more tags for vectors of them, in which case the elements can be stored unboxed.

My argument is you would probably use FP64 more frequently than Int64s. For most common operations involving int, Int48 would suffice. In the cases where you need full Int64 (cryptography, serialization, etc), you commonly need vectors of Int64 anyway.

eternityforest · on Aug 12, 2023

Some comments seem to be saying JIT JS engines still need this kind of thing?

So why do CPUs not have special instructions to accelerate common OOP GC language operations?

You could have math operations that did the tagged ints in a single cycle.

On the compiler side, in languages that need to support closures anyway, could they just give the data structure for a closure an extra scratchpad for numeric values that are raw and untagged with no extra stuff required? You know they're not a pointer because of where they are, and the GC doesn't need to look at them at all, they go away when the function does.

duped · on Aug 12, 2023

> So why do CPUs not have special instructions to accelerate common OOP GC language operations?

ARM kind of tried with Jazelle and ThumbEE and it turns out not to be that useful. It's much more economical to make the CPU as a whole faster rather than burning up precious die space for hardware that's only used some of the time, and only has marginal benefits.

There's also the human factor that if you actually want a JIT/interpreter to use your special sauce you will probably need to patch that runtime yourself, because it's not likely that the developers are just going to rip up their code generator unless it's really simple to do or presents massive gains.

> Could they just give the data structure for a closure an extra scratchpad for numeric values that are raw and untagged with no extra stuff required?

This is called "unboxing" and pretty much every optimizing runtime will do it in some way. But saving in memory and looking it up later is a lot more expensive than arithmetic in general.

phyrex · on Aug 12, 2023

Discussion from 2014: https://news.ycombinator.com/item?id=8383059

pizlonator · on Aug 12, 2023

What a weird post.

Lea became very slow in recent Intel cpus? News to me. Hard to imagine them shipping such a cpu considering x86 compilers pattern match to Lea hella aggressively.

Weird to show the more complex arithmetic codegen as a downside of not allocating int boxes. Having to shift or add (or both) a lot is way cheaper than asking the GC for memory or dereferencing a pointer to a box.

DonPellegrino · on Aug 12, 2023

> Lea became very slow in recent Intel cpus? News to me.

Article is from 2014. The "recent" CPUs they are discussing is the Sandy Bridge architecture launched in 2011.

See this discussion about the performance impact of LEA in Sandy Bridge: https://github.com/ocaml/ocaml/issues/6125#issuecomment-4729...

devit · on Aug 12, 2023

Why does Ocaml need to tag types? (as opposed to statically knowing the type of everything)

Do they not monomorphize generics?

ithkuil · on Aug 12, 2023

The GC scans the heap and stack and it would be hard for the GC to scan the heap while maintaining perfect knowledge about all the types.

Go has this problem too but they choose to use a "conservative collector" which will just assume that if something is a word that contains something that happens to be a valid location in memory it considers that location to be alive even if it's actually garbage.

The main drawback is that if you're not sure if something is really a pointer, you cannot relocate objects and compact the heap. Ocaml has a generational collector with a nursery implemented with a semi space collector. A semi space collector performs a collection which involves moving live objects and leaving the garbage behind. This is very efficient because young objects are less likely to survive and thus the amount of work is proportional to the amount of surviving objects and not the garbage.

zeeboo · on Aug 12, 2023

Go's GC has been precise since 1.4 released about 13 years ago.

ithkuil · on Aug 12, 2023

Time flies

gpderetta · on Aug 12, 2023

It is still not clear to me why it is necessary to store the type information inline. I'm absolutely not an expert, but I thought that some GC'd languages had stack layout metadata[1], to figure out the type of each stack slot. Similarly heap allocated data should have an associated type layout info.

[1] possibly static and indexed by the instruction pointer, not unlike DWARF unwind info.

ithkuil · on Aug 12, 2023

It's not necessary. Indeed there are many GCs that use type information (as a sibling comment says, Go GC is now precise). But it's more complicated, so it's a tradeoff. You may pay some little cost in dealing with tagged arithmetic but you gain a simpler GC codebase, one perhaps that can work with both the AOT and bytecode compilers (just speculating)

octachron · on Aug 12, 2023

No, OCaml compiles polymorphic functions once by using an universal representation of values which tends to play better with a type system with recursively polymorphic functions, higher-rank polymorphism or existential types (which all imply that concrete types are not always known statically in OCaml).

Dan42 · on Aug 12, 2023

I wonder what would be the performance characteristics of using the top 2 bits to mark 63-bit integers. So 00xxx are positive integers and 11xxx are negative integers. That way all integer operations stay simple and you just need to check that the result is in the correct range.

mgaunard · on Aug 12, 2023

The real problem with boxing is that you end up with two classes of types, and that there is inherent overhead to using a class rather than a primitive.

For those reasons, I stick to C++. With modern ABIs there is no difference between a primitive of size N and a class of size N.

moomin · on Aug 12, 2023

In some languages e.g. C# you can have unboxed product types. Confusingly for anyone with a C/C++ background, they’re called “Strucks”. Of course, you lose the GC when you do so but it’s nonetheless common to see in the stdlib.

mgaunard · on Aug 12, 2023

Ocaml doesn't have those. Of course F# does.

moomin · on Aug 12, 2023

Auto-correct fail: “structs”.

nmilo · on Aug 12, 2023

Why not right-shift all the pointers and set the int's top bit to one instead? That way arithmetic stays mostly the same and to dereference pointers you just left-shift before.

makapuf · on Aug 12, 2023

No need to shift right pointers if you promise not to use higher addresses of 64 bits pointers (which should be sufficient)

pasc1878 · on Aug 12, 2023

What is the performance hit on ARM processors?

sweetjuly · on Aug 12, 2023

From an ISA standpoint there aren't any issues, ARMv8 has all the instructions you need as well as some nice bit manipulation instructions. You can do all sorts of fun single instruction type checks with this too since they have branch if bit set (tbz, tbnz) as well as bit clear+set flags to do weird branchless type checks

frankreyes · on Aug 12, 2023

This is why Python is so slow compared with Lua, which also does this with ints.

Asraelite · on Aug 12, 2023

In fairness, the developers of CPython don't prioritize making it fast very much. Pypy has comparable performance to LuaJIT though. Do you happen to know how Pypy represents ints internally?

frankreyes · on Aug 12, 2023

No idea, but packing an int in a single 32/64 bit value without indirections is how V8 JS does it as well, hence the speed. I'm going to guess that pypy does it too.

SushiHippie · on Aug 12, 2023

Aren't python integers basically represented as an array of 32bit integers? What does this have to do with 63 bit integers?

frankreyes · on Aug 12, 2023

Python integers are represented like any other data type: a pointer to a PyObject. In the specific case of small integers, they are recycled from a pool of known instances, so we save the malloc() call for those. But the penalty still exists for indirections and cache trashing and CPU predictions, and so on...

It has to do that 63 bit integers, like Lua and V8 JS, they are much faster, because no malloc() or pointer indirection is required to do 1234+5678. It's just a few CPU instruction. In Python, it's like 100.

SushiHippie · on Aug 12, 2023

Got it, thanks!

habibur · on Aug 12, 2023

use

    struct {
      enum type;
      union { int; double; long; void\*};
    }

and you don't need to box anything. no extra allocation for primary data types. no de-referencing of pointers unless the data is larger than the union size. you can pass this struct from function to function by value, as like it's an int64 or double.

messe · on Aug 12, 2023

You do, however, double the size of your data type. A struct { enum; u64 } is going to be 128-bits vs struct { is_int: u1; int_value: u63; } at 64-bits. This can make a huge difference in performance critical code, as you double the rate of cache misses.

Joker_vD · on Aug 13, 2023

Not to mention that now you need 2 registers instead of 1 to hold it, so now stack spilling happens in 80% of functions instead of 5% (numbers invented, but really: 5 alive values taking 5 registers vs 10 is a huge difference). And don't forget about argument passing: argument-passing registers are almost definitely has to be spilled now, and there is almost mever enough of those, so now arguments too have to be spilled on the stack... this is a horrible way to represent data.