*the limit isn't the processor, it's how fast memory is. With DDR3-1600, it shou...

qb45 · on June 13, 2017

It is NOT limited to the external RAM speed and the best proof is that it actually uses over 40GB/s of memory bandwidth.

For each byte of data passing through pv:

1. the byte is read from yes memory to CPU registers by the write syscall handler in the kernel

2. the byte is written to kernel's internal buffer associated with the pipe

3. the byte is read back in the read syscall called by pv

4. the byte is written to a buffer in pv memory

5. and thas's the end because write syscall executed by pv on /dev/null very likely doesn't actually bother reading the submitted buffer at all

edit: Actually it might only be 20GB/s because on Linux pv seems to use the splice syscall to transfer data from stdin to stdout without copying to userspace.

This is also the reason why further "optimization" of this program in assembly was a fool's errand: the bulk of CPU load is in the kernel.

joosters · on June 13, 2017

All good valid points, but I'm still a bit surprised that the limit is not higher, I thought that the L2 cache was over an order of magnitude faster than main memory (plus, as someone pointed out in the reddit thread, the peak memory performance should really be double the quoted 12GB/s due to dual channel memory).

The actual throughput then, once you include OS copying, is either 2 or 4 times the quoted speed (depending on splice usage), so we're either at main memory theoretical speeds, or double main memory speeds. Intuitively, I'd still have expected that it should be a larger multiple.

(A quick search can't find me any reliable Intel L1/L2 cache speeds/multiples to quote, so I admit this comment is more speculation than it should be!)

pjc50 · on June 13, 2017

L2 cache and copying "y" bytes have very little to do with this; I suspect if you could produce high-granularity timings it would almost all be in the syscall overhead.

See eg. https://stackoverflow.com/questions/23599074/system-calls-ov... who benchmarked it at ~638ns per "read" call.

(Many, many years ago I was working on the Zeus web server, and we went to surprising lengths to avoid syscalls for performance.)

maxhou · on June 13, 2017

Yes they do.

A read() syscall takes longer than a getpid() syscall because read() has more work to do, it actually does a data copy of len bytes, which takes some time (and will be faster/slower if data is cache hot)

What we call the "syscall overhead" is what happens before and after the actual data copy, switching between user and kernel mode.

You make that overhead negligible by calling read() with a large size.

joosters · on June 13, 2017

(Many, many years ago I was working on the Zeus web server, and we went to surprising lengths to avoid syscalls for performance.)

Snap!

IIRC, ZWS used shared memory to mirror the results of the time() syscall across processes, to save a few nanoseconds on some operating systems :) That was before Linux and other OSs used techniques like the vsyscall/VDSO mentioned in the stackoverflow discussion...

maxhou · on June 13, 2017

yes and pv processes are not scheduled on the same CPU core, so different L2 cache.

daveguy · on June 13, 2017

I wonder if `taskset -c1 yes | taskset -c1 pv > /dev/null` would significantly change the throughput.

joosters · on June 13, 2017

    $ yes |pv > /dev/null
    46.6GiB 0:00:05 [9.33GiB/s]

    $ taskset 1 yes |taskset 1 pv > /dev/null
    32.9GiB 0:00:05 [6.58GiB/s]

    $ taskset 1 yes |taskset 2 pv > /dev/null
    45.7GiB 0:00:05 [9.13GiB/s]

    $ taskset 1 yes |taskset 4 pv > /dev/null
    45.7GiB 0:00:05 [9.18GiB/s]

Very rough numbers - the 9.13/9.33 difference flip-flopped when I ran the commands again. Binding both processes to the same core is definitely a performance hit though. There might be some gain through a shared cache, but it's lost more through lack of parallelism.

I tried 2/4 as not sure how 'real' cores vs 'hyperthread' cores are numbered. These numbers are from a i7-7700k.

maxhou · on June 13, 2017

How do you know that the dataset fits in L2 ?

Assuming pv uses splice(), there is one only copy in the workload: copy_from_user() from fixed source buffer to some kernel allocated page, then those pages are spliced to /dev/null.

If the pages are not "recycled" (through LRU scheme for allocation), the destination changes every time and the L2 cache is constantly trashed.

joosters · on June 13, 2017

I only learned of pv from this article so I can't speak much about its buffering. I would guess that the kernel would try to re-use recent freed pages to minimise cache thrashing. But anyway, on the 'yes' side, the program isn't re-allocating its 8kb buffer after every write(), so there's a lot of data being re-read from the same memory location.

alexeldeib · on June 13, 2017

As another point in your favor, one of the commentors reached 123 GB/s modifying both yes and pv.

https://www.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes...

pandaman · on June 13, 2017

In general, only the CPU itself sees the L2 cache. Anything you see on another device (screen, disk, NIC etc) has been flushed out of cache.

joosters · on June 13, 2017

Sure, but this is a pipe between two tiny processes, hopefully with very little else being run on the computer at the time (otherwise all bets are off for any benchmarking). There's no kind of 'real' I/O going on (in terms of stuff like screen, disk, NIC, and so on)

There's no reason that the L2 cache needed to be flushed at any point - the caches are all dealing with physical memory, rather than virtualised address space, so the fact that there are two processes here shouldn't stop the caching from working.

pandaman · on June 13, 2017

I am probably way behind the current state of the CPU judging by the downvotes I got so if you are saying there is no reason and the data can be written into a device without leaving the CPU I will just concede my ignorance.

joosters · on June 13, 2017

Don't fret about the downvotes, these magic internet points aren't redeemable anywhere :)

It's completely possible for the data to not (all) leave the CPU. If the caches are large enough, then the pages full of "y\n" will be still resident in the cache when the next iteration of the program overwrites the same pages again. Then the CPU has no need to send the original page out to main memory.

pandaman · on June 14, 2017

If you did for(;;) buffer[(i++)%size] = 'y'; then you'd be correct. However, you do i/o. And the 'y's appear at the i/o driver and have to be made visible for the device it's driving, which can be anything, including a process on another CPU core in another socket. If they remained in the issuing CPU cache, I fail to see how the destination device could possibly see them. There are some devices which can snoop the cache (like SoC GPUs and other CPU sockets on some archs) but the snooping is much slower than memory bus. Writing to memory is the only way which a) guarantees the data is available elsewhere and b) is the fastest.