Go-benchmark: Golang benchmarks used for optimizing code

throwaway2016a · on June 20, 2018

I personally wouldn't do any of these for an average project. They seem like micro-optimizations. But this is Missing the largest Go optimizations I can think of and I would do. String vs byte[] and memory allocation.

A lot of Go programs spend a lot of time converting between byte[] and string. But the conversion is actually really slow and keeping your string data as byte arrays is much faster and since many go libraries can work with either, the conversation is not necessary.

Also, memory allocations and garbage collection are not free. If you have a hot path that gets hit every request or in a loop, it can save a lot of time reusing the same slices assuming you can do so without introducing bugs and/or multi-threading issues.

notheguyouthink · on June 20, 2018

Dumb question, but can you give an example of using []byte over string?

I've got a project where strings aren't exactly crucial, but we're storing a ton of them in memory. We store them as string and, frankly, rarely access most of them - though we do do some compares of strings on the hot path.

Using []byte sounds interesting, but troublesome at the same time. Namely the fact that byte is useless for reading characters from, as you'd eventually have to get runes from it anyway no?

Thoughts?

ihsw2 · on June 20, 2018

The key takeway is there is a performance cost to converting between []byte and string and minimizing the number of conversions would be a reasonably low-hanging fruit in looking for performance optimizations.

coldtea · on June 20, 2018

>Using []byte sounds interesting, but troublesome at the same time. Namely the fact that byte is useless for reading characters from, as you'd eventually have to get runes from it anyway no?

Tons of string manipulation just needs to split, etc on \n, ., ;, - which don't need runes.

Besides that, string comparisons don't need runes either (assuming what you're comparing is normalized to the same bytes, which if you do the same input processing to anything you store and to the strings you query on, it would be).

nemo1618 · on June 20, 2018

[]byte is crucial if you need to frequently modify the contents. The classic example is building a string with a loop that appends a new character on each iteration. Using += to concatenate strings inside such a loop will be slow; the Right Thing To Do is to preallocate a []byte with sufficient capacity, append to it in a loop, and then (if necessary) convert to a string only at the end.

Cthulhu_ · on June 20, 2018

Isn't there a premade library that will abstract these optimizations away? e.g. a rope library. Looks like there's a few libraries out there that implement this data structure. (https://en.wikipedia.org/wiki/Rope_(data_structure))

coldtea · on June 20, 2018

Doesn't Go have a mutable char/rune array?

jerf · on June 20, 2018

Yes, it's just []rune, which has just a bit of special sauce in the runtime that allows you to convert it directly to a string [1], and some functions that convert UTF-8 to and from it. However rune is a 32-bit number sufficient for storing a Unicode code point. Generally you're dealing in UTF-8 and prefer to just pass that through code. I think I've only ever use a []rune once, for when I was doing some very heavy-duty Unicode-aware code, and I still ended up refactoring it into using UTF-8 directly in the end.

[1]: https://play.golang.org/p/5DEzw85J5Ob Note this is a conversion; the string is UTF-8 and the runes are 32-bit ints, so this creates a new string.

pjmlp · on June 20, 2018

Yep, GC allocation is not the only path to use memory in Go.

flgr · on June 20, 2018

There's some context missing here. AFAIK the author of the repositories was specifically doing micro-benchmarks here to inform the design of his https://github.com/cornelk/hashmap library. I think for that use case (designing a high-performance lock-free data structure), that kind of micro-benchmarking makes a lot of sense.

That hash map looks good and I'm thinking we'd probably benefit from using it on the hot path of some code we have that needs to be highly scalable.

(Both links were submitted to HN, but only this one seems to have landed on the front page.)

sesteel · on June 20, 2018

Somewhat related; I was building a game a few years ago and passing 4x4 matrices (128 bytes) by value was faster than pass by reference. I don't know exactly how many bytes can be passed by value before it becomes slower, but please don't read this and mindlessly assume pass by value is slower.

Cthulhu_ · on June 20, 2018

Also, take GC pressure into consideration; pass by value is 'free' in that it does not put data onto the heap and thus does not cause any GC overhead.

friday99 · on June 20, 2018

This is not a very good example of benchmarking. First, all of the benchmarks need to be run multiple times e.g. -count=5 at least to get reliable results. Second, benchmarking like this is really meant for analyzing a single operation. It confuses the analysis when a repetition of 1024 operations is benchmarked so that the numbers have to be divided by 1024 to get comparable numbers. The golang benchmarking library can reliably measure operations of a few instructions long, so no need to repeat them N times for measurement. Sometimes you will want to see how an operation changes based on something like length, e.g. hashing 8 versus 128 characters, but in those cases the operation is "hashing" and you should have separate benchmarks to measure performance versus different lengths, which is a great use case for sub-benchmarks r.Run(). Finally, as others have mentioned, most of these are micro-optimizations that make your code less maintainable, e.g. you should always iterate over a slice with range unless you have some logical reason not to.

leaveyou · on June 20, 2018

>>Writing and reading an int atomic

>>BenchmarkAtomicInt64-8 5000 354907 ns/op

that ns/op looks too big to be only for an int64 ..

dis-sys · on June 20, 2018

Almost all those ns/op numbers were measured by repeating BenchMarkSizeLong or BenchMarkSize iterations.

mickronome · on June 20, 2018

Each op must be around 1000 iterations, because that's how much they seem the ns/op seem to be off.

shabbyrobe · on June 20, 2018

Yep! 1024. Good eye! Some tests, like the atomic ones, use an even larger constant (which explains the crazy 355971 nsec/op result for an atomic increment):

    BenchMarkSize     = 1 << 10 // 1024
    BenchMarkSizeLong = BenchMarkSize << 5

jerf · on June 20, 2018

Yes, that one really caught my eye because if an atomic int increment took 35 microseconds, I have some code that would have exploded a long time ago.

I'm not sure what the idea is there, the benchmark system can accommodate small operations, I've used it before on things like single increments or single interface calls and gotten reasonable answers.

Also benchmarking atomic int increment without any contention is not necessarily "useless", but certainly not a full picture if one is investigating using it in a contended data structure. (My usage of it is mostly just counters, where I expect thousands upon thousands of instructions to be between the increment of any given one of them, meaning there's probably no contention to speak of even on production multi-core servers, so I have no idea how they perform under serious contention. But AFAIK it's just the hardware-supported atomic instructions, so what I don't know about is the hardware, not really Go per se.)

jasonwatkinspdx · on June 20, 2018

> I'm not sure what the idea is there, the benchmark system can accommodate small operations, I've used it before on things like single increments or single interface calls and gotten reasonable answers.

Yeah, I've been wondering if I'm missing something, because adding a second redundant loop to b.N is common in go repos, but seems pointless and needlessly obscuring.

AYBABTME · on June 20, 2018

These results should be taken with a big grain of salt. A lot of stuff true in this code will not be true at all in YOUR code. This benchmarking code, for instance, is trivial and many parts will get optimized by the compiler. This likely won't be true in your more complex code.

HankB99 · on June 20, 2018

That can be a real pitfall for benchmarking. I recall looking at some promotional material touting the performance advantages of Rust against some other language (C?) It involved a loop so I kept increasing the loop count to try to get the Rust execution to take the same time as the other language. It took me a while but I finally figured out that the way the code was structured, the Rust compiler could see that the results of the computation were never used and the loop was simply optimized away.

Benchmarks can measure execution time of algorithms or if not careful, the ability of the ability of the compiler to optimize away unnecessary code.

(IMO the 'performance comparison' that was vastly in favor of Rust was a bit disingenuous.)

cooervo · on June 20, 2018

So we should avoid defeR?

throwaway2016a · on June 20, 2018

Unless it is a hot path and the CPU is your bottle neck most of these optimizations would make the code less maintainable / less safe and I wouldn't personally do any of them.

What good is fast code if it is unmaintainable?

dronescanfly · on June 20, 2018

What good is maintainable, easily extendable, neatly structured code when it's slow and doesn't meet the execution time criteria imposed by your client?

throwaway2016a · on June 20, 2018

For most products the "time" that matters is "time to market" and "time to resolution" on bugs. Most business uses cases don't require the kind of performance we expect of them.

Depending on industry, time is relative. These optimizations will save you milliseconds on tight loops. In reality most apps aren't spending their time on tight loops. They are spending their time waiting for I/O or using inefficient algorithms. Neither of which these optimizations solve for.

If you are underperforming spec by 100ms it is likely you have one big change you need to make to save 100ms not 100 small micro-optimizations.

dronescanfly · on June 20, 2018

As you stated yourself: Most projects, not all. The 1% of project that are performance critical still matter to thousands of engineers.

Only because most people don't have to do this there are still some people that have to do optimzation where possible.

I hope you understand that optimization without profiling is worthless. Once profiling is identifying something like your datastructure as some kind of bottleneck it might be worth a shot investigating that hint.

And these kind of articles are a nice summary of work being done, giving an overview over techniques and small tweaks that you might not have thought about previously.

It is good work by the author and he deserves being recognized for it. Not being dismissed by the bland statement don't waste your time with optimization.

zimbatm · on June 20, 2018

No. Defer has a role, which is to reduce the number of issues related to resource return like closing file descriptors and returning pool objects.

Don't use defer when there is only one branch in your code.

Remove defer when optimising hot code paths.

Matt3o12_ · on June 20, 2018

Please use defer when there is only one code path as well unless it is a hot function and you have measured a noticeable performance decrease.

Code will get changed over time and one code path often becomes two or more. Chances are, the next developer who adds code will not realize that the resource needs to be freed, or closed and simply forgets to add the defer. The only thing where defer really gets into your way is for error handling (because it can get verbose if you handle errors in defer).

In the grand scheme of things, defer is not noticeablily slower.

arendtio · on June 20, 2018

I was wondering too. Does anybody know what makes defer so expensive?

jeffrallen · on June 20, 2018

Memory allocation.

Use defer to make code readable. Go code should be readable, so you can figure out if it is correct, and only then should it be fast.

When benchmarking/profiling leads you to a hot spot in the code, rewrite it to remove defers.

dis-sys · on June 20, 2018

5300+ ns/op is for BenchMarkSize (1024) defer operations, so we are actually paying about 5ns/op for defer. Not sure why the author chose to report the ns/op number in its current way.

jerf · on June 20, 2018

"so we are actually paying about 5ns/op for defer"

Yeah, it's in the class of "things you should know about when optimizing tight loops", but not "things you should always be worrying about"; remember 5ns is on the order of a branch mispredict or an L2 cache miss [1]. If you haven't already squeezed out all the main memory accesses you can from your algorithm at 100ns a pop, or if you're in code dealing with networks (routinely micro seconds, even within datacenters; milli seconds if you have to leave), or files, or anything else like that, micro-optimizing defers is not going to have any visible results to your speed but can very badly damage your code's correctness and ease of writing and modifying.

The other thing to know about defers is that they are not declarations, they are instructions; every time the program counter moves past one, another defer call is added to the stack of calls to make while exiting the function. So you do have to be careful about deferring in a loop, not so much because it's "dangerous" but just because it's easy to be unaware that the defers are not declarations. If you intend to do it, it's fine, it can just bite you if you do it accidentally.

[1]: https://gist.github.com/jboner/2841832

eternalban · on June 20, 2018

He does not have a firm grasp on testing/B. The package is reporting timings for b.N loops ("ops"). There is no need for yet another loop.

This is typical: https://github.com/cornelk/go-benchmark/blob/master/defer_te...

JulianMorrison · on June 20, 2018

Defer also runs in the case of a panic. So it's not just a hidden goto.

32bitkid · on June 20, 2018

more detail on why `defer` can/is more expensive (and not equivalent) to unrolling it: https://blog.golang.org/defer-panic-and-recover

arendtio · on June 20, 2018

Thanks, that is what I was looking for. I used defer just as convenience and wasn't aware (anymore) that it does something different than just pushing the function call to the end. Maybe it's time to take the Golang tour again ;-)

joppy · on June 20, 2018

Defer is for reclaiming resources, or other cleanup at the end of a function which cannot otherwise be handled by the language, and (usually?) this is not going to be in your hot path; for example opening and closing a file many many times is not really a good thing to do anyway.

lmb · on June 20, 2018

That repo encourages mindless microoptimization. Don't waste your time worrying how to iterate a slice. Most of the time it makes no difference.

wmsiler · on June 20, 2018

That repo doesn't encourage anything. It's presenting data that the author found useful for another project they were working on, which presumably did benefit from micro-optimization. You are right that micro-optimization is usually not needed, but the repo and its author never claimed otherwise.

dronescanfly · on June 20, 2018

Your assessment of the situation encourages mindless ignoring of optimization. It always depends on the situation. Stuff that is done once in a moon... idc as long as it's reasonably fast. For stuff that's on the hot path you sometimes have to pull up your sleeves and optimize where possible.

At work I just had to implement a heap just because the heap provided by the standard libs wasn't fitting our problem.

Please don't discourage people creating this kind of content. It matters to far more people than you might have in mind.

notheguyouthink · on June 20, 2018

Most of the time it doesn't except for when it does. Your statement is overreaching and is a micro optimization in and of itself. Valuing to ignore understanding when it is important to optimize in this manner, and when it isn't.

The author's use case is very beneficial to optimize in this manner.

tapirl · on June 20, 2018

There are several slcie iteration ways in Go. They do have some performance differences, depending on the element types of slices. Please view the example in the end of the article for details: https://go101.org/article/value-copy-cost.html

_wldu · on June 20, 2018

Yes, but it's always fun to go faster ;)

__aeneas · on June 20, 2018

This should also show that append is really, really heavy when the slice is not pre-allocated and should be avoided whenever possible!

majewsky · on June 20, 2018

Sure, "premature optimization is the root of all evil" and so on. But what I find much worse, is that hashing algorithms are put side-by-side without any context, making it look like MD5 is better than SHA-2. That's downright irresponsible.