Author here. I think today that apart from making my case in a bit of an obnoxio...

nostrademons · on Feb 27, 2017

I'm curious whether you think the ideal boundary between SW/HW might've shifted in the last ~40 years, as the things we use computers for have drastically changed?

I know basically nothing about hardware, but I know the software layer from the OS/compiler up through the UI. There's a fair bit of evidence that things we've traditionally assumed belong in the kernel actually belong in userspace, and they're being reinvented in userspace as a result. For example, most modern languages & frameworks put some form of scheduler in the standard libs - we're reimplementing the abstraction of a thread as promises or fibers or async/await or callbacks. Many big Internet companies disable virtual memory in their production servers, because once the box begins swapping you might as well count it as down. Many common business apps program to a database, not a filesystem, and then the database uses block-based data structures like B-trees and SSTables but then has to implement them on top of filesystems.

At the same time, the classic OS protection boundary is the process, but the unit of code-sharing in the open-source world is the library. As a result, the protection mechanisms that OSes have gotten very good at are largely useless at preventing huge security violations from careless coding in a library dependency.

Most of these came from computers being used outside of the original domains that the system software developers assumed, eg. nobody in the 1970s would've imagined 10 million GitHub users of widely different skill levels all swapping code. Knowing what we do now about the big markets for computation, are there additional operations we'd want to put in hardware, or things currently done in hardware that should be moved to software?

qznc · on Feb 27, 2017

There is a fascinating talk by Cliff Click "A JVM Does That?" At the end he shares some opinion about what should be done by JVM or OS and what should change.

video: https://youtu.be/uL2D3qzHtqY

slides: http://www.azulsystems.com/blog/wp-content/uploads/2011/03/2...

I remember a talk, where he was also talking about hardware. It does not seem to be this one. For example, a time register would be useful. Syscalls like clock_gettime are too slow. CPU info like cycle counts fail with dynamic frequency scaling.

hueving · on Feb 27, 2017

Nit: disabling swapping is not the same this as disabling virtual memory. Virtual memory is just something that allows swapping, but does not require swapping.

chriswarbo · on Feb 27, 2017

> There's a fair bit of evidence that things we've traditionally assumed belong in the kernel actually belong in userspace, and they're being reinvented in userspace as a result.

That's basically the pitch given by unikernels like Mirage: functionality which was traditionally implemented by an OS, like storage, is turned into a library that gets compiled into the application like any other. If an application wants to access a filesystem on the disk it can use an appropriate library; if instead it wants to manage the data being stored on (some section of) the disk directly, it just needs a different library. That way, applications like relational databases can claim their own section of the disk and read/write it directly, to avoid the performance and reliability (e.g. caching/flushing) penalties of going via a filesystem.

yosefk · on Feb 27, 2017

I think all of your points are very valid, but they're focusing on one part - virtual memory - which you suggest to remove (and this can be done today by not using that part of the hardware, and in this case the penalty of having that unused hardware is AFAIK fairly small.) My original point was that adding (or changing) hardware to accommodate HLLs is not going to buy you as much performance as people think, and this is focusing on a different part sw/hw boundary (basically what should compilers/interpreters/runtimes be doing vs what should be handled at the ISA level, versus your points which talk about what protection mechanisms we want and who among hardware, software and OS should do what here. I guess I should have said C/RISC and keep Unix/MMU out of it as I did in TFA.)

What will actually happen with protection mechanisms I don't know; certainly Unix-style mechanisms are used to ever more places with say HSA's idea of accelerators and CPUs being aware of the same virtual memory maps. Compatibility is a very strong force here. On the other hand there's a lot of stuff happening with the memory protection disabled as you described. My predictions here are going to be less educated than many others', to be honest, because I deal with embedded systems whereas most of the exciting stuff here happens in servers, I'd guess (but I can tell that in automotive embedded systems of all places not only do Unix-style processes gain traction right now but so do hypervisors with actual multiple OSes, some of them POSIXy, sharing chips. So this is a data point showing a trend in the "more of the same" direction.)

aidenn0 · on Feb 27, 2017

One simple and concrete request:

Traps on integer overflow:

http://blog.regehr.org/archives/1154

naasking · on Feb 27, 2017

A promising alternate architecture that places some previously features in hardware [1]. The execution model still closely matches current architectures.

See also different approaches to programming that make space-time tradeoffs more explicit [2], and use natural-law like principles to distribute computing across a simpler but highly connected computing fabric.

[1] https://millcomputing.com/

[2] http://web.mit.edu/jakebeal/www/Publications/PTRSA2015-Space...

[3] http://blob.lri.fr/

justin66 · on Feb 27, 2017

Alan Kay takes part in discussion here occasionally. He seems like a pretty easygoing guy (but you might want to reign it in a bit) so you could probably just email him...

yosefk · on Feb 27, 2017

Alan Kay likes to fail at least 90÷ of the time (shows you aim high enough) and says the industry is too dumb to digest good ideas. I like to succeed at least 90÷ of the time so that the dumb industry keeps employing me. I'm afraid we have irreconcilable differences. (And, this is me reigning in A LOT right here. Don't get me started...)

grzm · on Feb 27, 2017

90÷

To be clear, you mean 90%, correct? If so, the ÷ symbol (which I just learned is called an obelus) is typically used for division—I've never seen it used to mean percent. Is this a locale difference? A keyboard issue? I've seen that Android users will sometimes mistype ℅ for % due to their proximity on a certain keyboard, for example.

andai · on Feb 27, 2017

Maybe they were holding their phone at an angle?

comboy · on Feb 27, 2017

Can you elaborate a bit on why do you think Reduceron or using FPGA along with CPU is not a good idea? I thought that since the clocks aren't gonna be much higher, that is the future. That maybe compilers will start generating some kind of VHDL that can make the app you spend your most CPU time on much faster (theoretical possibilities seems great with big enough FPGAs).

adwn · on Feb 27, 2017

Speaking as someone who programs FPGAs for a living, they are good for three things (I'm simplifying a bit here):

* interfacing with digital electronics, implementing low-level protocols, and deterministic/real-time control systems

* emulating ASICs for verification

* speeding up a small subset of highly specialized algorithms

Of those, only the last one would apply in the context of this thread. However, caused by their structure and inherent tradeoffs, they are completely incapable of speeding up general purpose computation. As for specialized computation, if they heavily rely on floating-point ops, a GPU will nearly always be faster and cheaper.

yosefk · on Feb 27, 2017

I think the new Stratix might well beat GPUs, not?

The Reduceron specifically tries to quickly perform application of lambda expressions that GHC will try to avoid generating in the first place. The Reduceron speeds things up using several memory banks etc. but it still does things that shouldn't be done at all and the overhead is there at least in area and power.

simias · on Feb 27, 2017

I think FPGAs are too expensive to be used for general purpose computing. If on top of the chip price you add the development time it's just not cost effective. A high-end FPGA will cost you thousands of dollars and you won't be able to easily convert software code to HDL. A very high end GPU will be cheaper and easier to develop for.

There are situations where a FPGA is better suited of course (very low latency real time signal processing for instance) but for general purpose computing FPGAs are not exactly ready for primetime IMO.

adwn · on Feb 27, 2017

> I think the new Stratix might well beat GPUs, not?

Adding to what simias said, even an FPGA with built-in floating-point primitives can beat a GPU (in floating-point-heavy computations when the measure is performance/cost) only if the algorithm doesn't fit well onto the GPU architecture – for example, if you can make use of the highly flexible SRAM banks on the FPGA. I suppose there exist such workloads, but they're rare.

Also, keep in mind that no FPGA comes even close to the raw external memory bandwidth of modern GPUs.

someguy12 · on Feb 27, 2017

Altera/Intel and Xilinx have FPGAs coming out this year with HBM which should make them competitive with existing GPU memory bandwidth.

dnautics · on Feb 27, 2017

Even #2 is questionable. I had a front row view of a company that made a chip very quickly and one way they were able to do it was to not bother emulating ASIC in FPGA. There are some very nice and open-source hardware development tools that basically obviate that need.

adwn · on Feb 27, 2017

Once you get to the point where you need ASIC emulation, there are no open-source tools that are up to the task.

You don't need emulation for simple stuff like Bitcoin miners or other small and/or highly regular chips. You use it if you develop a large SoC taking tens of millions of dollars to develop. It takes months after finishing your HDL code and before you get the first silicon from the fab, and you don't want to wait that long before you can start testing your custom software.

So, no, #2 isn't questionable, it's routine practice. In fact, the largest FPGAs by Xilinx and Altera are structured explicitly with that use case in mind.

gchadwick · on Feb 27, 2017

Emulation via FPGA or dedicated HDL emulator (a special supercomputer designed for running Verilog/VHDL, very fast, very good, very expensive) is also essential for functional verification of things like CPUs.

For example booting linux and running a simple application can take many billions of cycles. You simply can't simulate that many you need something faster (You can simulate a few billion overnight with an appropriate server farm, but that's across many tests using many simulator instances).

dnautics · on Feb 27, 2017

Unless you're NVIDIA, Intel, AMD, or Qualcomm, Apple, Samsung (you get the idea) why would you ever want to build anything besides "small and/or highly regular chips"?

I think there's a lot of interest in designing "highly regular" chips. And it's definitely possible to go quite far with open source tools. I've seen a 16-core general purpose chip with full 64-bit IEEE FP, ALU, and memory instructions operating at 1 MHz (real speed) as a gate-level simulation on a desktop computer. This could potentially be "running linux" at a reasonable (if sluggish) speed.

adwn · on Feb 28, 2017

> Unless you're NVIDIA, Intel, AMD, or Qualcomm, Apple, Samsung (you get the idea) why would you ever want to build anything besides "small and/or highly regular chips"?

What's your point? The ASIC emulation market exists, there are several companies that build and sell ASIC emulators, and Xilinx and Altera cater to that market with dedicated FPGA devices. I'm not sure why you're arguing here.

vmsplice · on Feb 27, 2017

I'm just a curious hardware development newbie passing by but would you be willing to share the open-source development tools used? It would be really interesting to take a peek at something that was used to develop a chip very quickly. Most of the hardware stuff seems to be quite complicated and not all that open.

rwmj · on Feb 27, 2017

I'm not the original poster, but perhaps he's thinking of https://chisel.eecs.berkeley.edu/ I believe it can (or could?) generate C++ code which compiles into a program that simulates your design.

dnautics · on Feb 27, 2017

They wanted to use chisel for the whole stack, but that was impossible because of one of their engineers.

dnautics · on Feb 27, 2017

Check out verilator

alain94040 · on Feb 27, 2017

Just curious, what did you mean by this:

If your architecture meets these requirements, I'll consider a physical implementation very seriously (because we could use that kind of thing), and if it works out, you'll get a chip

Fabbing someone else's idea sounds expensive. What did you have in mind?

abainbridge · on Feb 27, 2017

People put experimental digital blocks in ASICs all the time, it isn't necessarily that expensive. Its a bit like taking an extra pair of shoes on holiday - in general, I'm a bit concerned about reaching the airline's baggage weight limit. If you ask me to add your shoes to my bag before I start packing, I'm going to say no. If you ask at the end, and I've got some space left, then fine.

However, he's probably talking about an FPGA implementation. That'd be sufficient to prove the concept. Once you've gone that far, you can normally do some simulations to predict the energy consumption on a real chip.

yosefk · on Feb 27, 2017

I was at the time of writing and still am an accelerator architect and I'd gladly use someone's idea in a mass market product (ASIC) if they didn't mind. However, it is also true that working on any real product means that many valid ideas useful in some contexts will not be useful for me, and I guess this is true for many ideas for speeding up higher-level programming models, and perhaps it was misleading of me to fail to point this out. (As I said I don't love the tone of that article, it is unfortunately very effective as my articles written in that tone around 2008 tend to resurface more often than articles written in nicer, more balanced tone and with way more technical details from around say 2012-2013. What is my takeoff wrt future writing I'm still not quite sure.)

jstimpfle · on Feb 27, 2017

What do you mean by "for loop instruction added"? How would that look from a developer's perspective and what could be done in hardware to improve efficiency?

yosefk · on Feb 27, 2017

I used that as an example of a bad idea; I don't have details on this bad idea but you could have an instruction looking at an init, bound and increment registers and a constant telling where the loop ends and voila, the processor runs for loops without needing lower-level increment and branch instructions, and it shaves off one instruction (not a cycle, necessarily, but an instruction):

    FOR counter_reg, init_val_reg, bound_reg, step_reg, END_OF_LOOP
    ...
  END_OF_LOOP:

...instead of:

    MOVE counter_reg, init_val_reg
  START_OF_LOOP:
    ...
    ADD counter_reg, step
    BRANCH_LESS_THAN counter_reg, bound_reg, START_OF_LOOP

I was saying that this obviously not-so-good idea is not much different in spirit from building hardware for quickly creating and applying lambda terms, which is what the Reduceron does. Lowering lambda calculus to simpler operations so that lambda expressions are not represented in a runtime data structure at all much of the time, the way GHC and other compilers approach the problem, is a better idea.

jstimpfle · on Feb 27, 2017

I see - sorry, I'd missed the "not" in "the way not to go"

dnautics · on Feb 27, 2017

Yossi, do you think there's a market for special matrix mult machines that use low-precision FP, maybe as a systolic array?

yosefk · on Feb 27, 2017

This is OT, but - most of linear algebra dies quickly even in single precision, meaning that your equation system solver produces a solution that doesn't really solve these equations, etc. One exception is neural networks where Google's TPU is just the start and in general GPUs while beating CPUs leave a lot of room for improvement.

BuuQu9hu · on Feb 27, 2017

Actor-based dynamic language author here. (Doesn't matter which one; I think I speak for all of us.) Thank you for being honest with us; we are not a very performance-oriented group sometimes.

We're generally in favor of things which accelerate message passing between shared-nothing concurrent actors. Hardware mailboxes or transactional memory are nifty. OS-supported message queues are nifty; can those be lowered to hardware in a useful way?

yosefk · on Feb 27, 2017

I'm not sure whether today's coherent caches, atomic operations etc. are a poor fit for what you want to do leaving much room for improvement (I'm sure someone familiar with say the Go stack will be able to say more; I can say that for computational parallelism everything is fine with current hw but there 100K tasks would map to dozens of threads, tops and if you want 100K concurrent actors maybe things look differently, at any rate I don't see how the shared-nothing part creates a problem hardware can solve here, I think maybe there are problems in the (lots and lots of concurrent actors) part but I'm not sure.)

Incidentally, IMO shared-nothing is an inherently inefficient model for multiple actors cooperating to perform a single computation, and nothing done in hardware can fully eliminate the cost introduced by the model (and if something can be done is can be done by code analysis transforming the code into a more efficient shared memory model.) This is not to say that there's no value in such a system - far from it, just that it's a poor fit for something things which can only mapped onto it with some overhead that hardware cannot eliminate.

meredydd · on Feb 27, 2017

Well, I never thought I'd be plugging my PhD research here, but:

"Asynchronous Remote Stores for Inter-Core Communication" http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.592...

To my knowledge, this is still the only hardware-assisted message passing scheme that is virtualisable (ie compatible with a "real" OS like Linux).

Hardware mailboxes are great, but time-sharing OSs can't deal with finite hardware resources that can't be swapped out easily. Software-based queues die a fiery death thanks to cache coherency - reading something that another core just wrote will block you for hundreds of cycles.

yosefk · on Feb 27, 2017

Virtualizable hardware-assisted message passing is awesome. (MIPS for instance had a big fat ISA extension for hardware-assisted message passing and cheap hardware multithreading which Linux couldn't use and they then threw out the window exactly when they introduced hardware virtualization of the entire set of processor resources.)

As to software-based queues dying a fiery death - in what scenarios? As I said in a sister comment, I (think that I) know that things work out in computational parallelism scenarios where many tasks are mapped onto a thread pool, TBB-style, that is, I don't think the hardware overhead is ridiculously large in these systems. Where do things go badly? 100K lightweight threads communicating via channels, Go-style?

meredydd · on Feb 28, 2017

Whoa - MIPS virtualised their message passing hardware? How?!

Software-based queues die a fiery death when the latency of a send/receive is critical, because you end up stalling on a really slow cache-coherence operation. So, for example, anything like a cross-thread RPC call takes ages (you wait for the transmission and wait for a response, so it's much slower than a function call and often a system call - the Barrelfish research OS suffers a bunch from this). There are also algorithms you just can't parallelise because you can't split them into large chunks, and if you split them into small chunks the cost of communicating so frequently destroys your performance. (Eg there was a brave attempt to parallelise the inner loop of bzip2 - which resists coarse parallelisation thanks to loop-carried dependencies - this way).

Software based queues perform just fine on throughput, though - if you're asynchronous enough to let a few messages build up in your message queue, you'll only pay the latency penalty once per cache line when you drain it (and with a good prefetcher, even less than that).

The examples you cite are actually both instances of software cunningly working within the limits of slow inter-core communication. Work-queue algorithms typically keep local, per-core queues and only rebalance tasks between cores ("work stealing") infrequently, so as to offset how expensive that operation is. Lightweight threads with blocking messages (like Go or Occam or some microkernels) work by turning most message sends into context switches within one core - when you send a message on a Go channel, you can just jump right into the code that receives it. Again, they can then rebalance infrequently. (For an extra bonus, by making it easy to create 100k "threads", they hope to engage in latency-hiding for individual threads - and once you're in "throughput" mode it's all gravy).

yosefk · on March 2, 2017

> Whoa - MIPS virtualised their message passing hardware? How?!

No, I meant to say that they simply obsoleted that part of their architecture when they added virtualization, because they couldn't virtualize it.

> Eg there was a brave attempt to parallelise the inner loop of bzip2 - which resists coarse parallelisation thanks to loop-carried dependencies - this way.

So you say you can do hardware-assisted message passing that can be virtualized and can speed up bzip2 by parallelizing? How few instructions per RPC call does it take for you to still be efficient vs today's software-based messaging? (This is getting fairly interesting and it should be particularly interesting to serious CPU vendors.)

meredydd · on March 3, 2017

This is getting deep in an ageing thread - do you want to take this to email? (It's in my profile)

Pipelined bzip2 wasn't in the evaluation for my research, but I bet remote stores would get considerably better results than software queues. Parallelising one algorithm is something of a stunt, and gets you just a single data point. Instead, I did a bunch of different benchmarks (microbenchmarks for FIFO queues and synchronisation barriers; larger benchmarks including a parallel profiler and a variable-granularity MapReduce to measure how far remote stores could move the break-even point for communication vs computation; and an oddball parallel STM system that I'd previously demonstrated on dedicated (FPGA) hardware). I got around an order of magnitude on all of them (some a little less, some much more).

The writeup starts on page 59 of https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-831.pdf and the evaluation on page 65.

Looking back, I seriously regret not taking more time to sit down and write it up more clearly, because I do think this should be interesting to serious CPU vendors. However, by that point I had reached the point of "I'm fed up with this PhD; I'm going home now". As I knew I didn't want to stay in academia, I published in a mediocre venue rather than revising for a better one, and went off to Silicon Valley instead. Your comments have made me re-read my old work, and it's painful to wonder how much further it could have gone if I had explained it better.

qznc · on Feb 27, 2017

> Hardware mailboxes or transactional memory

If you do them in hardware, they always come bounded. At most n elements of size m bytes and both numbers usually single digit. If you want to lift that limitation, it usually is just as slow as doing it in software.

dkersten · on Feb 27, 2017

Unbounded queues are arguably not a good idea (although single digit bounds are possibly too low?), at the very least there would probably need to be some concept of back pressure.

exDM69 · on Feb 27, 2017

Have you tried or heard anyone trying to tinker with caching attributes (ie. uncached, write-combining, etc) for message passing buffers? I think you do need to be in kernel mode to be able to change the attributes, but you can access the memory in userspace once set up.

Sounds to me like you could have some improvoment in shared-nothing message passing by avoiding traffic on the CPU core interconnect due to unnecessary caching.

That said, I only have experience of tinkering with caching attributes in CPU to GPU communication and I'm not very well familiar with the internals of CPU interconnects so take my words with a grain of salt.

meredydd · on Feb 27, 2017

Ick, no. Nasty as the latencies of cache coherence are, going out to memory every time will slaughter your performance coming and going. Do not want.

exDM69 · on Feb 27, 2017

There are various levels of caching, which have very different performance characteristics with different access patterns from multiple cores. Don't dismiss it outright. You can reduce the pressure on the cache coherency protocol by making different tradeoffs.

meredydd · on Feb 27, 2017

Yes there are, and there were a bunch of interesting research machines in the 90s that played with weird cache modes for shared data. (If you're interested, I can go dig out the literature review section of my thesis for you. There were some weird and wacky schemes, none of which saw industrial deployment).

But the bottom line is that the cache options available in modern desktop/server processors won't really help. They all basically disable the cache in one way or another. (They're really intended for controlling memory-mapped devices.) And while the latency of fetching something out of another core's L1 cache is nasty, going all the way out to DRAM for every access is really, really slow.

So, I'll stand by my summary, albeit less caustically. While it's an interesting idea, you really don't want to do what you suggest.

ori_b · on Feb 27, 2017

It seems to me that Intel TSX and general work on improved atomics is what you want. The high level constructs themselves probably shouldn't be directly in hardware.