More

netr0ute · 2025-08-31T22:19:20 1756678760

> G++ picks space over time

By definition, that's zero-overhead because Ultrassembler doesn't care about space.

aidenn0 · 2025-08-31T22:28:07 1756679287

Okay, than a traditional setjmp/longjmp implementation is zero-overhead because I don't care about space or time!

netr0ute · 2025-08-31T20:02:09 1756670529

I thought about hashing, but found that hashing would be enormously slow to compute compared to a perfectly crafted tree.

dafelst · 2025-08-31T20:38:21 1756672701

But did you think about using a perfect hash function and table? Based on my prior research, it seems like they are almost universally faster on small strings than trees and tries due to lower cache miss rates.

dist1ll · 2025-08-31T21:32:49 1756675969

Ditto. Perfect hashing strings smaller than 8 bytes has been the fastest lookup method in my experience.

netr0ute · 2025-08-31T21:35:55 1756676155

Problem is, there are a lot of RISC-V instruction way longer than that (like th.vslide1down.vx) so hashing is going to be slow.

ashdnazg · 2025-08-31T22:23:09 1756678989

You could copy the instruction to a 16 byte sized buffer and hash the one/two int64s. Looking at the code sample in the article, there wasn't a single instruction longer than 5 characters, and I suspect that in general instructions with short names are more common than those with long names.

This last fact might actually support the current model, as it grows linearly-ish in the size of the instruction, instead of being constant like hash.

snvzz · 2025-09-01T00:20:58 1756686058

Note th.vslide1down.vx is a T-Head instruction, a vendor custom extension.

It is not part of RISC-V, nor supported by any CPUs outside of that vendors' own.

Lerc · 2025-09-01T01:31:37 1756690297

Is there a handy list of all RISC-V instructions?

netr0ute · 2025-08-31T18:49:19 1756666159

Hi everyone, I'm the author of this article.

Feel free to ask me any questions to break the radio silence!

benreesman · 2025-08-31T19:03:05 1756666985

Nice work and good writeup. I think most of that is very sound practice.

The codegen switch with the offsets is in everything, first time I saw it was in the Rhino JS bytecode compiler in maybe 2006, written it a dozen times since. Still clever you worked it out from first principles.

There are some modern C++ libraries that do frightening things with SIMD that might give your bytestring stuff a lift on modern stupid-wide high mispredict penalty stuff. Anything by lemire, stringzilla, take a look at zpp_bits for inspiration about theoretical minimum data structure pack/unpack.

But I think you got damn close to what can be done, niiicccee work.

Sesse__ · 2025-08-31T21:53:05 1756677185

FWIW, this is basically an implementation of perfect hashing, and there's a myriad of different strategies. Sometimes “switch on length + well-chosen characters” are good, sometimes you can do better (e.g. just looking up in a table instead of a long if chain).

The “value speculation” thing looks completely weird to me, especially with the “volatile” that doesn't do anything at all (volatile is generally a pointer qualifier in C++). If it works, I'm not really convinced it works for the reason the author thinks it works (especially since it refers to an article talking about a CPU from the relative stone age).

inetknght · 2025-08-31T19:03:02 1756666982

Overall, this is a fantastic dive into some of RISC-V's architecture and how to use it. But I do have some comments:

> However, in Chata's case, it needs to access a RISC-V assembler from within its C++ code. The alternative is to use some ugly C function like system() to run external software as if it were a human or script running a command in a terminal.

Have you tried LLVM's C++ API [0]?

To be fair, I do think there's merit in writing your own assembler with your own API. But you don't necessarily have to.

I'm not likely to go back to assembly unless my employer needs that extra level of optimization. But if/when I do, and the target platform is RISC-V, then I'll definitely consider Ultraseembler.

> It's not clear when exactly exceptions are slow. I had to do some research here.

There are plenty of cppcon presentations [1] about exceptions, performance, caveats, blah blah. There's also other C++ conferences that have similar presentations (or even, almost identical presentations because the presenters go to multiple conferences), though I don't have a link handy because I pretty much only attend cppcon.

[0]: https://stackoverflow.com/questions/10675661/what-exactly-is...

[1]: https://www.youtube.com/results?search_query=cppcon+exceptio...

netr0ute · 2025-08-31T19:09:33 1756667373

> LLVM's C++ API

I think I read something about this but couldn't figure out how to use it because the documentation is horrible. So, I found it easier to implement my own, and as it turns out, there are a few HORRIBLE bugs in the LLVM assembler (from cross reference testing) probably because nobody is using the C++ API.

> There are plenty of cppcon presentations [1] about exceptions, performance, caveats, blah blah.

I don't have enough time to watch these kinds of presentations.

mpyne · 2025-09-01T01:14:06 1756689246

A specific presentation I'd point to is Khalil Estell's presentation on reducing exception code size on embedded platforms at https://www.youtube.com/watch?v=bY2FlayomlE

But honestly you'd get vast majority of the benefit just by skimming through the slides at https://github.com/CppCon/CppCon2024/blob/main/Presentations...

With a couple of symbols you define yourself a lot of the associated g++ code size is sharply reduced while still allowing exceptions to work. (Slide 60 on)

0x98 · 2025-08-31T23:41:58 1756683718

> I think I read something about this but couldn't figure out how to use it because the documentation is horrible.

Fair enough.

> So, I found it easier to implement my own, and as it turns out, there are a few HORRIBLE bugs in the LLVM assembler (from cross reference testing)

Interesting claim, do you have any examples?

inetknght · 2025-09-01T16:58:33 1756745913

> I don't have enough time to watch these kinds of presentations.

Then let me pick and share some of my favorites that I found enlightening, and summarize with some information that I found useful.

By far, the most useful one is Khalil Estell's presentation last year [0]. It's a fairly face paced but relatively deep dive into exception mechanics. At the end, he advocates for a new tool that would audit a program to determine what exceptions could be thrown. I think that's a flipping fantastic idea for a tool. Unfortunately I haven't seen any progress toward it -- if someone here knows where his tool is, or a similar tool, please reply! I did send him an email a few months ago inquiring about it, but haven't received a reply. Nonetheless, the whole presentation was excellent in my opinion. I did see that he had another related presentation at ACCU this year [4] with a topic of "C++ Exceptions are Code Compression" (which I totally can believe -- I've seen it myself in binary sizes), but I haven't seen his presentation yet. I'll watch it later today.

Just about anything from Herb Sutter is good. I don't like that he works for Microsoft, but he does great stuff for C++, including the old Guru of the Week series [1]. In particular, his 2019 presentation [2] describes different error handling techniques, some difficulties and pitfalls in combining libraries with different error handling techniques, and leads up to explaining why std::expected came about. He does pontificate a lot though, so the presentation is fairly high level and slow paced.

Dave Watson's 2017 presentation [3] dives into a few different implementations of stack unwinding. It's good to understand how different compilers implement exceptions with low- or zero-cost overhead and what that "overhead" is really measuring.

So, there's about a half of a day of presentations to watch here. I hope that's not too much for you.

[0]: https://www.youtube.com/watch?v=bY2FlayomlE

[1]: https://herbsutter.com/gotw/

[2]: https://www.youtube.com/watch?v=ARYP83yNAWk

[3]: https://www.youtube.com/watch?v=_Ivd3qzgT7U

[4]: https://www.youtube.com/watch?v=LorcxyJ9zr4

inetknght · 2025-09-01T21:24:20 1756761860

Update: it looks like link [4] is just a rehash of his talk from last year's cppcon [0].

[0]: https://www.youtube.com/watch?v=bY2FlayomlE

[4]: https://www.youtube.com/watch?v=LorcxyJ9zr4

NooneAtAll3 · 2025-08-31T21:19:04 1756675144

isn't your MemoryBank already somewhere in std::pmr?

If I'm honest, I've never looked into pmr, but I always thought that that's where std has arena allocators and stuff

https://en.cppreference.com/w/cpp/header/memory_resource.htm...

msla · 2025-08-31T19:54:11 1756670051

What's the difference between a Programming Furu and a Programming Guru? Is there a joke I'm missing?

netr0ute · 2025-08-31T19:59:41 1756670381

Furus are "fake gurus." It comes from the Fintwit space where "furus" share their +1000% option trades as if they're geniuses in order to get you to sign up for their expensive Substack.

jclarkcom · 2025-08-31T21:34:20 1756676060

You might look into using memory mapped IO for reading input and writing your output files. This can save some memory allocations and file read and write times. I did this with a project where I got more than 10x speed up. For many cases file IO is going to be your bottleneck.

Sesse__ · 2025-08-31T21:54:59 1756677299

mmap-based I/O still needs to go through the kernel, including memory allocation (in the page cache) and all. If you've got 10x speedup from mmap, it is usually because your explicit I/O was very inefficient; there are situations where mmap is useful, but it's rarely a high-performance strategy, as it's really hard for it to guess what your intended I/O patterns are just from the page faults it's seeing.

jclarkcom · 2025-09-01T07:58:16 1756713496

Windows uses memory mapped IO for loading all executable processes because it allows you to start executing a process after loading a few pages even if the exe is megabytes. You can use the same to reduce latency for starting to assemble data before the rest of the file loads, the rest can be loaded using more efficienct asynchronous mechanisms. Using for output also means your process doesnt waits on flushes that is also async. And in memory constrained environments the OS doesn’t have to write your data to swap, it can just reload it from the meeting mapped file.

Sesse__ · 2025-09-01T11:44:57 1756727097

Linux also uses mmap for running executables. But explicit I/O does not mean you have to start off by a gigabyte-long read().

jclarkcom · 2025-09-01T18:10:07 1756750207

More detailed explanation from ChatGPT. As quick estimate you could achieve a >2x speed up using memory mapped files for a typical assembler workload.

https://chatgpt.com/share/68b5e0db-a6d0-8005-9101-d326d2af0a...

Sesse__ · 2025-09-01T19:47:01 1756756021

Why would anyone be interested in arguing against a confused AI?

jclarkcom · 2025-09-01T19:58:03 1756756683

I was trying to provide a more detailed explanation without typing a lot. I studied this problem a lot as PE at vmware.

Sesse__ · 2025-09-01T20:11:49 1756757509

https://distantprovince.by/posts/its-rude-to-show-ai-output-...

In any case, if you really believe mmap is great for an assembler, then sure, go ahead. But it's not.

jclarkcom · 2025-09-01T20:41:23 1756759283

I implemented an assembler as part of VMware thinapp and this was big performance boost for me but maybe you have a different experience from your efforts?

Sesse__ · 2025-09-01T21:21:30 1756761690

Yes. (I'm not going into a pissing contest.)

jclarkcom · 2025-09-03T08:44:07 1756889047

https://github.com/jclarkcom/assembler-io-benchmark

netr0ute · 2025-07-06T15:27:57 1751815677

I don't remember this being the case, you could reuse your old MC purchase when they made the transition over.

areyourllySorry · 2025-07-06T15:36:21 1751816181

the mojang to minecraft.net transition, yeah. the minecraft.net to microsoft transition, no. https://youtu.be/rUFDRAEducI

netr0ute · 2025-02-22T17:19:10 1740244750

The only thing I don't like about this is the focus on x86 assembly, which is a sinking ship because RISC-V is coming to eat its lunch, FAST.

KeplerBoy · 2025-02-22T17:31:40 1740245500

I could understand if you wrote arm, because that's an architecture with actual marketshare. arguably more marketshare than x86-64 at this point, but you had to choose risc-v for the lols.

wolf550e · 2025-02-22T17:34:36 1740245676

Where are the high performance RISC-V implementations? Those that compete with AMD Zen-5 and Apple M4? Or at least AWS Graviton 4?

zozbot234 · 2025-02-22T22:46:01 1740264361

The Tenstorrent folks are working on that.

high_na_euv · 2025-02-22T17:24:08 1740245048

HackerNews does not reflect real world well

ksec · 2025-02-22T17:30:09 1740245409

The unwritten rule of HN:

You do not criticise The Rusted Holy Grail and the Riscy Silver Bullet.

do_not_redeem · 2025-02-22T17:31:37 1740245497

How would you define "fast"?

hagbard_c · 2025-02-22T17:52:41 1740246761

In relative terms, compared with similarly priced and powered devices on the market. RISC-V does lag behind the others - ARM, x86/64 - here, at least for now.

snvzz · 2025-02-22T17:57:10 1740247030

Not eating. Only drinking water or zero calory drinks such as black coffee.

Only while fasting can a person think clearly. When thinking clearly, RISC-V is inevitably chosen as the ISA.

Fasting will also eventually make you hungry. Thus "RISC-V is coming to eat its lunch, FAST."

astrange · 2025-02-23T00:05:29 1740269129

Doesn't RISC-V use vector stream processing instead of SIMD? That's a poor fit for ffmpeg.

astrange · 2025-02-23T03:12:45 1740280365

I should say, I think it would be. I haven't actually tried it and know ARM has added it too, so it'd be interesting to see for sure.

201984 · 2025-02-22T17:39:50 1740245990

Wake me up when a RISC-V processor is on par with an N50.

netr0ute · 2025-01-23T00:50:48 1737593448

This is basically irrelevant now that better ISAs like RISC-V have a fixed instruction length (2 or 4 bytes) so the fancy algorithm here isn't necessary.

lifthrasiir · 2025-01-23T01:57:12 1737597432

That fancy algorithm is relevant to RISC-V (and in fact, most fixed-length ISAs) because loading an immediate into a register needs one or two instructions depending on the immediate; you surely want to elide a redundant LUI instruction if you can. Of course such redundant instructions don't harm by itself, but that equally applies to x86 as the algorithm is an optimization.

Coolbeanstoo · 2025-01-23T00:56:58 1737593818

As a result of RISC-V existing, all x86 processors have ceased to exist or be produced.

snvzz · 2025-01-23T04:49:15 1737607755

Accurate, if said sometime in the future rather than today.

saagarjha · 2025-01-23T08:36:12 1737621372

There are still people making z80 machines today, so no.

remexre · 2025-01-23T04:07:05 1737605225

This same problem applies to RISC-V with the C extension, because the J and JAL instructions have a larger range than the C.J and C.JAL instructions.

tliltocatl · 2025-01-23T08:20:07 1737620407

Having fixed instruction length doesn't make the need to load large constants magically disappear. These just get split between multiple instructions. If anything, RISC-V might be worse. See also https://maskray.me/blog/2021-03-14-the-dark-side-of-riscv-li....

nicebyte · 2025-01-23T01:00:37 1737594037

ARM would have been a better example because the amount of people that care about RISC-V is a rounding error compared to x86 or ARM.

netr0ute · on Nov 12, 2024

> Qobuz removed a range of releases a couple months ago at short notice, including from users' accounts.

What was the deal with this?

Springtime · on Nov 12, 2024

It wasn't explained officially. I assume some distribution arrangement changed but the artists/releases were so varied and from various labels that I can't determine the relationship. I'd bought a handful of releases there and all of them were affected.

netr0ute · on Oct 13, 2024

Losing muscle

pessimizer · on Oct 13, 2024

Bodybuilders were the first people to start intermittent fasting after the mouse study. The title of the front page of https://leangains.com/ is "Leangains - Birthplace of Intermittent Fasting"

April 14, 2010: https://leangains.com/the-leangains-guide/

netr0ute · on Oct 11, 2024

The Milk-V Pioneer has 64 out of order cores and supports 128GB of ECC memory!

SigmundA · on Oct 11, 2024

Its Sophon SG2042 SOC has about the same per core performance as an A72 like in a Rpi 4 or Graviton 1 from 2018...

ramon156 · on Oct 11, 2024

I don't know why people especially RISC-V to already be on the level ARM and x64 is. The fact RISC-V even exists to begin with is amazing.

My opinion is definitely biased, though. Only time will tell

seanw444 · on Oct 11, 2024

The fact that large corporations like Google and Facebook have incentives to have a better alternative to x86 and ARM for the data center is very beneficial too, and can only speed development up.

netr0ute · on July 20, 2024

Missing RISC-V

mebeim · on July 20, 2024

That's the next arch I want to add but it takes a bit of work, sooner or later I will add it though :')

mfranc42 · on July 20, 2024

I'm missing s390x.

stevefolta · on July 20, 2024

Yeah, it seems odd that it has PowerPC but not RISC-V.