Awesome

Golang Benchmarks

Various benchmarks for different patterns in Go. Some of these are implementations of or were inspired by this excellent article on performance in Go. Furthemore, the Golang wiki provides a list of compiler optimizations.

Lies, damned lies, and benchmarks.

Allocate on Stack vs Heap

allocate_stack_vs_heap_test.go

Benchmark Name	Iterations	Per-Iteration	Bytes Allocated per Operation	Allocations per Operation
BenchmarkAllocateFooStack	1000000000	2.27 ns/op	0 B/op	0 allocs/op
BenchmarkAllocateBarStack	1000000000	2.27 ns/op	0 B/op	0 allocs/op
BenchmarkAllocateFooHeap	50000000	29.0 ns/op	32 B/op	1 allocs/op
BenchmarkAllocateBarHeap	50000000	30.2 ns/op	32 B/op	1 allocs/op
BenchmarkAllocateSliceHeapNoEscape	50000000	32.3 ns/op	0 B/op	0 allocs/op
BenchmarkAllocateSliceHeapEscape	5000000	260 ns/op	1024 B/op	1 allocs/op

Generated using go version go1.7.5 darwin/amd64

This benchmark just looks at the difference in performance between allocating a struct on the stack versus on the heap. As expected, allocating a struct on the stack is much faster than allocating it on the heap. The two structs I used in the benchmark are below:

type Foo struct {
	foo int64
	bar int64
	baz int64
}

type Bar struct {
	foo int64
	bar int64
	baz int64
	bah int64
}

One interesting thing is that although Foo is only 24 bytes, when we allocate it on the heap, 32 bytes are allocated and when Bar is allocated on the heap, 32 bytes are allocated for it as well. When I first saw this, my initial suspicion was that Go's memory allocator allocates memory in certain bin sizes instead of the exact size of the struct, and there is no bin size between 24 and 32 bytes, so Foo was allocated the next highest bin size, which was 32 bytes. This blog post examines a similar phenomenen in Rust and its memory allocator jemalloc. As for Go, I found the following in the file runtime/malloc.go:

// Allocating a small object proceeds up a hierarchy of caches:
//
//	1. Round the size up to one of the small size classes
//	   and look in the corresponding mspan in this P's mcache.
//  ...

The last two benchmarks look at an optimization the Go compiler performs. If it can prove through escape analysis that a slice does not escape the calling function, then it allocates the data for the slice on the stack instead of the heap. More information can be found on this golang-nuts post.

Append

append_test.go

Benchmark Name	Iterations	Per-Iteration	Bytes Allocated per Operation	Allocations per Operation
BenchmarkAppendLoop	500000	2456 ns/op	0 B/op	0 allocs/op
BenchmarkAppendVariadic	20000000	97.1 ns/op	0 B/op	0 allocs/op

Generated using go version go1.8.1 darwin/amd64

This benchmark looks at the performance difference between appending the values of one slice into another slice one by one, i.e. dst = append(dst, src[i]), versus appending them all at once, i.e. dst = append(dst, src...). As the benchmarks show, using the variadic approach is faster. My suspicion is that this is because the compiler can optimize this away into a single memcpy.

Atomic Operations

atomic_operations_test.go

Benchmark Name	Iterations	Per-Iteration
BenchmarkAtomicLoad32	2000000000	1.77 ns/op
BenchmarkAtomicLoad64	1000000000	1.79 ns/op
BenchmarkAtomicStore32	50000000	27.5 ns/op
BenchmarkAtomicStore64	100000000	25.2 ns/op
BenchmarkAtomicAdd32	50000000	27.1 ns/op
BenchmarkAtomicAdd64	50000000	27.8 ns/op
BenchmarkAtomicCAS32	50000000	28.8 ns/op
BenchmarkAtomicCAS64	50000000	28.6 ns/op

Generated using go version go1.7.5 darwin/amd64

These benchmarks look at various atomic operations on 32 and 64 bit integers. The only thing that really stands out is that loads are significantly faster than all other operations. I suspect that there's two reasons for this: there's no cache invalidation because only reads are performed and on x86_64 loads and stores using movq are atomic if performed on natural alignments. I took a look at the Load64 function in src/sync/atomic/asm_amd64.go:

TEXT ·LoadInt64(SB),NOSPLIT,$0-16
	JMP	·LoadUint64(SB)

TEXT ·LoadUint64(SB),NOSPLIT,$0-16
	MOVQ	addr+0(FP), AX
	MOVQ	0(AX), AX
	MOVQ	AX, val+8(FP)
	RET

It uses Go's assembly language which I'm not too familiar with, but it appears to move the address of the integer into the AX register in the first function, move the value pointed to by that address into the AX register in the second instruction, and then move that value into the return value of the function in the third instruction. On x86_64 the Go assembly likely can be translated exactly using the movq instruction and since this instruction is atomic if executed on natural alignments the load will be atomic as well.

Bit Tricks

bit_tricks_test.go

Benchmark Name	Iterations	Per-Iteration
BenchmarkBitTricksModPowerOfTwo	2000000000	0.84 ns/op
BenchmarkBitTricksModNonPowerOfTwo	2000000000	1.58 ns/op
BenchmarkBitTricksAnd	2000000000	0.46 ns/op
BenchmarkBitTricksDividePowerOfTwo	2000000000	0.72 ns/op
BenchmarkBitTricksDivideNonPowerOfTwo	2000000000	1.09 ns/op
BenchmarkBitTricksShift	2000000000	0.52 ns/op

Generated using go version go1.8.1 darwin/amd64

These benchmarks look at some micro optimizations that can be performed when doing division or modulo division. The first three benchmarks show the overhead of doing modulus division and how we can replace modulus division by a power of two with a bitwise and, which is a faster operation. Likewise the last three benchmarks show the overhead of division and how we can improve the speed of division by a power of two by performing a right shift.

Buffered vs Synchronous Channel

buffered_vs_unbuffered_channel_test.go

Benchmark Name	Iterations	Per-Iteration
BenchmarkSynchronousChannel	5000000	240 ns/op
BenchmarkBufferedChannel	10000000	108 ns/op

Generated using go version go1.7.5 darwin/amd64

This benchmark examines the speed with which one can put objects onto a channel and comes from this golang-nuts forum post. Using a buffered channels is over twice as fast as using a synchronous channel which makes sense since the goroutine that is putting objects into the channel need not wait until the object is taken out of the channel before placing another object into it.

Channel vs Ring Buffer

channel_vs_ring_buffer_test.go

Benchmark Name	Iterations	Per-Iteration	Bytes Allocated per Operation	Allocations per Operation
BenchmarkChannelSPSC	20000000	102 ns/op	8 B/op	1 allocs/op
BenchmarkRingBufferSPSC	20000000	72.2 ns/op	8 B/op	1 allocs/op
BenchmarkChannelSPMC	3000000	464 ns/op	8 B/op	1 allocs/op
BenchmarkRingBufferSPMC	1000000	1065 ns/op	8 B/op	1 allocs/op
BenchmarkChannelMPSC	3000000	447 ns/op	8 B/op	1 allocs/op
BenchmarkRingBufferMPSC	300000	5033 ns/op	9 B/op	1 allocs/op
BenchmarkChannelMPMC	10000	193557 ns/op	8016 B/op	1000 allocs/op
BenchmarkRingBufferMPMC	30	34618237 ns/op	8000 B/op	1000 allocs/op

Generated using go version go1.8.3 darwin/amd64

The blog post So You Wanna Go Fast? also took a look at using channels versus using a lock-free ring buffer. I decided to run similar benchmarks myself and the results are above. The benchmarks SPSC, SPMC, MPSC, and MPMC refer to Single Producer Single Consumer, Single Producer Mutli Consumer, Mutli Producer Single Consumer, and Mutli Producer Mutli Consumer respectively. The blog post found that for the SPSC case, a channel was faster than a ring buffer when the tests were run on a single thread (GOMAXPROCS=1) but the ring buffer was faster when the tests were on multiple threads (GOMAXPROCS=8). The blog post also examined the the SPMC and MPMC cases and found similar results. That is, channels were faster when run on a single thread and the ring buffer was faster when the tests were run with multiple threads. I ran all the test with GOMAXPROCS=4 which is the number of CPU cores on the machine I ran the tests on (a 2015 MacBook Pro with a 3.1 GHz Intel Core i7 Processor, it has 2 physical CPUs,sysctl hw.physicalcpu, and 4 logical CPUs, sysctl hw.logicalcpu). Ultimately, the benchmarks I ran produced different results. They show that in the SPSC and SPMC cases the performance of a channel and ring buffer are similar with the ring buffer holding a small advantage. However, in the MPSC and MPMC a channel performed much better than a ring buffer did.

defer

defer_test.go

Benchmark Name	Iterations	Per-Iteration
BenchmarkMutexUnlock	50000000	25.8 ns/op
BenchmarkMutexDeferUnlock	20000000	92.1 ns/op

Generated using go version go1.7.5 darwin/amd64

defer carries a slight performance cost, so for simple use cases it may be preferable to call any cleanup code manually. As this blog post notes, defer can be called from within conditional blocks and must be called if a functions panics as well. Therefore, the compiler can't simply add the deferred function wherever the function returns and instead defer must be more nuanced, resuling in the performance hit. There is, in fact, an open issue to address the performance cost of defer. Another discussion suggests calling defer mu.Unlock() before one calls mu.Lock() so the defer call will be moved out of the critical path:

defer mu.Unlock()
mu.Lock()

False Sharing

false_sharing_test.go

Benchmark Name	Iterations	Per-Iteration
BenchmarkIncrementFalseSharing	3000	453087 ns/op
BenchmarkIncrementNoFalseSharing	5000	246124 ns/op
BenchmarkIncrementNoFalseSharingLocalVariable	20000	71624 ns/op

go version go1.8.3 darwin/amd64

This example demonstrates the effects of false sharing when multiple goroutines are updating a variable. In the first benchmark, although the goroutines are each updating different variables because those variables are on the same cache line, the updates contend with one another. In the second example, however, we introduce some padding to ensure the integers are on different cache lines so the updates won't interfere with each other. Finally, the last example performs the increments locally and then writes the variable to the shared slice.

Function Call

function_call_test.go

Benchmark Name	Iterations	Per-Iteration
BenchmarkPointerToStructMethodCall	2000000000	0.32 ns/op
BenchmarkInterfaceMethodCall	2000000000	1.90 ns/op
BenchmarkFunctionPointerCall	2000000000	1.91 ns/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at the overhead for three different kinds of function calls: calling a method on a pointer to a struct, calling a method on an interface, and calling a function through a function pointer field in a struct. As expected, the method call on the pointer to the struct is the fastest since the compiler knows what function is being called at compile time, whereas the others do not. For example, the interface method call relies on dynamic dispatch at runtime to determine which function call and likewise the function pointer to call is determined at runtime as well and has almost identical performance to the interface method call.

Interface conversion

interface_conversion_test.go

Benchmark Name	Iterations	Per-Iteration	Bytes Allocated per Operation	Allocations per Operation
BenchmarkInterfaceConversion	2000000000	1.32 ns/op	0 B/op	0 allocs/op
BenchmarkNoInterfaceConversion	2000000000	0.85 ns/op	0 B/op	0 allocs/op

Generated using go version go1.8.1 darwin/amd64

This benchmark looks at the overhead of converting an interface to its concrete type. Surprisingly, the overhead of the type assertion, while not zero, it pretty minimal at only about 0.5 nanoseconds.

Map Lookup

map_lookup_test.go

Generated using go version go1.9.3 darwin/amd64

Benchmark Name	Iterations	Per-Iteration	Allocations per Operation
BenchmarkMapUint64	50000	24322 ns/op	0 allocs/op
BenchmarkMapString1	100000	22587 ns/op	0 allocs/op
BenchmarkMapString10	50000	28892 ns/op	0 allocs/op
BenchmarkMapString100	50000	35785 ns/op	0 allocs/op
BenchmarkMapString1000	20000	85820 ns/op	0 allocs/op
BenchmarkMapString10000	2000	1046144 ns/op	0 allocs/op

This benchmark looks at the time taken to perform lookups in a map with different key types. The motivation for this benchmark comes from a talk given by Björn Rabenstein titled "How to Optimize Go Code for Really High Performance". In the talk, Björn includes a section looking at the performance of map lookups with different types of keys. He found that as the length of the string got longer the performance of the lookup got significantly worse. This isn't entirely unexpected since a map lookup requires us to hash the string, and the longer the string the longer that hash calculation will take, and also to compare the lookup with the key in the corresponding to verify that they match, and here again the longer the string is the longer that comparison will take. As a result of this degredation, Prometheus decided to use maps with int64 keys and perform collision detection themselves.

Memset optimization

memset_test.go

Benchmark Name	Iterations	Per-Iteration
BenchmarkSliceClearZero/1K	100000000	12.9 ns/op
BenchmarkSliceClearZero/16K	10000000	167 ns/op
BenchmarkSliceClearZero/128K	300000	3994 ns/op
BenchmarkSliceClearNonZero/1K	3000000	497 ns/op
BenchmarkSliceClearNonZero/16K	200000	7891 ns/op
BenchmarkSliceClearNonZero/128K	20000	79763 ns/op

Generated using go version go1.9.2 darwin/amd64

This benchmark looks at the Go compiler's optimization for clearing slices to the respective type's zero value. Specifically, if s is a slice or an array then the following loop is optimized with memclr calls:

for i := range s {
	a[i] = <zero value for element of s>
}

If the value is not the the zero value of the type though then the loop is not optimized as the benchmarks show. The library go-memset provides a function which optimizes clearing byte slices with any value not just zero.

Mutex

mutex_test.go

Benchmark Name	Iterations	Per-Iteration
BenchmarkNoMutexLock	2000000000	1.18 ns/op
BenchmarkRWMutexReadLock	30000000	54.5 ns/op
BenchmarkRWMutexLock	20000000	96.0 ns/op
BenchmarkMutexLock	20000000	78.7 ns/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at the cost of acquiring different kinds of locks. In the first benchmark we don't acquire any lock. In the second benchmark we acquire a read lock on a RWMutex. In the third we acquire a write lock on a RWMutex. And in the last benchmark we acquire a regular Mutex lock.

Non-cryptographic Hash functions

non_cryptographic_hash_functions_test.go

Benchmark Name	Iterations	Per-Iteration	Allocations per Operation
BenchmarkHash32Fnv	20000000	70.3 ns/op	0 allocs/op
BenchmarkHash32Fnva	20000000	70.4 ns/op	0 allocs/op
BenchmarkHash64Fnv	20000000	71.1 ns/op	0 allocs/op
BenchmarkHash64Fnva	20000000	77.1 ns/op	0 allocs/op
BenchmarkHash32Crc	30000000	87.5 ns/op	0 allocs/op
BenchmarkHash64Crc	10000000	175 ns/op	0 allocs/op
BenchmarkHash32Adler	30000000	40.3 ns/op	0 allocs/op
BenchmarkHash32Xxhash	30000000	46.1 ns/op	0 allocs/op
BenchmarkHash64Xxhash	30000000	47.4 ns/op	0 allocs/op
BenchmarkHash32Murmur3	20000000	59.4 ns/op	0 allocs/op
BenchmarkHash128Murmur3	20000000	63.4 ns/op	0 allocs/op
BenchmarkHash64CityHash	30000000	57.4 ns/op	0 allocs/op
BenchmarkHash128CityHash	20000000	113 ns/op	0 allocs/op
BenchmarkHash32FarmHash	30000000	44.4 ns/op	0 allocs/op
BenchmarkHash64FarmHash	50000000	26.4 ns/op	0 allocs/op
BenchmarkHash128FarmHash	30000000	40.3 ns/op	0 allocs/op
BenchmarkHash64SipHash	30000000	39.3 ns/op	0 allocs/op
BenchmarkHash128SipHash	30000000	44.9 ns/op	0 allocs/op
BenchmarkHash64HighwayHash	50000000	36.9 ns/op	0 allocs/op
BenchmarkHash32SpookyHash	30000000	58.1 ns/op	0 allocs/op
BenchmarkHash64SpookyHash	20000000	62.7 ns/op	0 allocs/op
BenchmarkHash128SpookyHash	30000000	68.2 ns/op	0 allocs/op
BenchmarkHashMD5	10000000	169 ns/op	0 allocs/op
BenchmarkHash64MetroHash	100000000	18.6 ns/op	0 allocs/op
BenchmarkHash128MetroHash	30000000	48.8 ns/op	0 allocs/op

Generated using go version go1.8.3 darwin/amd64

These benchmarks look at the speed of various non-cryptographic hash function implementations in Go.

Pass By Value vs Reference

pass_by_value_vs_reference_test.go

Benchmark Name	Iterations	Per-Iteration
BenchmarkPassByReferenceOneWord	1000000000	2.20 ns/op
BenchmarkPassByValueOneWord	1000000000	2.58 ns/op
BenchmarkPassByReferenceFourWords	500000000	2.71 ns/op
BenchmarkPassByValueFourWords	1000000000	2.78 ns/op
BenchmarkPassByReferenceEightWords	1000000000	2.32 ns/op
BenchmarkPassByValueEightWords	300000000	4.35 ns/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at the performance cost of passing a variable by reference vs passing it by value. For small structs there doesn't appear to be much of a difference, but as the structs gets larger we start to see a bit of difference which is to be expected since the larger the struct is the more words that have to be copied into the function's stack when passed by value.

Pool

pool_test.go

Benchmark Name	Iterations	Per-Iteration	Bytes Allocated per Operation	Allocations per Operation
BenchmarkAllocateBufferNoPool	20000000	118 ns/op	368 B/op	2 allocs/op
BenchmarkChannelBufferPool	10000000	213 ns/op	43 B/op	0 allocs/op
BenchmarkSyncBufferPool	50000000	27.7 ns/op	0 B/op	0 allocs/op

Generated using go version go1.7.5 darwin/amd64

This benchmark compares three different memory allocation schemes. The first approach just allocates its buffer on the heap normally. After it's done using the buffer it will eventually be garbage collected. The second approach uses Go's sync.Pool type to pool buffers which caches objects between runs of the garbage collector. The last approach uses a channel to permanently pool objects. The difference between the last two approaches is the sync.Pool dynamically resizes itself and clears items from the pool during a GC run. Two good resources to learn more about pools in Go are these blog posts: Using Buffer Pools with Go and How to Optimize Garbage Collection in Go.

Pool Put Non Interface

pool_put_non_interface_test.go

Benchmark Name	Iterations	Per-Iteration	Bytes Allocated per Operation	Allocations per Operation
BenchmarkPoolM3XPutSlice	5000000	282 ns/op	32 B/op	1 allocs/op
BenchmarkPoolM3XPutPointerToSlice	5000000	327 ns/op	0 B/op	0 allocs/op
BenchmarkPoolSyncPutSlice	10000000	184 ns/op	32 B/op	1 allocs/op
BenchmarkPoolSyncPutPointerToSlice	10000000	177 ns/op	0 B/op	0 allocs/op

Generated using go version go1.8.3 darwin/amd64

This benchmark looks at the cost of pooling slices. Since slices are three words they cannot be coerced into interfaces without an allocation, see the comments on this CL for more details. Consequently, putting a slice in a pool requires the three words for the slice to be allocated on the heap. This cost will admittedly likely be offset by the savings from pooling the actual data backing the slice, however this test looks performs the benchmarks to look at just that. Indeed we see that although the putting a slice on a pool does require an additional allocation, there does not appear to be a significant cost in speed.

Rand

rand_test.go

Benchmark Name	Iterations	Per-Iteration	Allocations per Operation
BenchmarkGlobalRandInt63	20000000	115 ns/op	0 allocs/op
BenchmarkLocalRandInt63	300000000	3.95 ns/op	0 allocs/op
BenchmarkGlobalRandFloat64	20000000	96.5 ns/op	0 allocs/op
BenchmarkLocalRandFloat64	200000000	6.00 ns/op	0 allocs/op

Generated using go version go1.8.3 darwin/amd64

Go's math/rand package exposes various functions for generating random numbers, for example Int63. These functions use a global Rand struct which is created by the package. This struct uses a lock to serialize access to its random number source though which can lead to contention if multiple goroutines are all trying to generate random numbers using the global struct. Consequently, these benchmarks look at the performance improvement that comes from giving each goroutine its own Rand struct so they don't need to acquire a lock. This blog post explores the similar optimizations for using the math/rand package for those who are interested.

Random Bounded Numbers

random_bounded_test.go

Benchmark Name	Iterations	Per-Iteration
BenchmarkStandardBoundedRandomNumber	100000000	18.3 ns/op
BenchmarkBiasedFastBoundedRandomNumber	100000000	11.0 ns/op
BenchmarkUnbiasedFastBoundedRandomNumber	50000000	40.5 ns/op

Generated using go version go1.8.1 darwin/amd64

Benchmarks for three different algorithims for generating a random bounded number as discussed in the blog post Fast random shuffling. The top result is the standard approach of generating a random number and taking its modulus of the bound. The second approach implements the algorithim discussed in the aforementioned blog post which avoids using the modulus operator. The third algorithim is an implementation of the second algorithim which is unbiased. As mentioned in the article, for most applications the second algorithim will be sufficient enough as the bias introduced by the algorithim is likely less than the bias from the pseudo-random number generator which is used.

Range over Arrays and Slices

range_array_test.go

Benchmark Name	Iterations	Per-Iteration	Allocations per Operation
BenchmarkIndexRangeArray	100000000	10.6 ns/op	0 allocs/op
BenchmarkIndexValueRangeArray	100000000	14.1 ns/op	0 allocs/op
BenchmarkIndexValueRangeArrayPtr	100000000	10.1 ns/op	0 allocs/op
enchmarkIndexSlice	100000000	10.4 ns/op	0 allocs/op
BenchmarkIndexValueSlice	100000000	10.3 ns/op	0 allocs/op

Generated using go version go1.8.3 darwin/amd64

These tests look at three different ways to range over an array or slice. The first three benchmarks range over an array. The first uses just the index into the array (for i := range a), the second uses both the index and the value (for i, v := range a), and the third uses the index and value while ranging over a pointer to an array (for i, v := range &a). What's interesting to note is that the second benchmark is noticably slower than the other two. This is because go makes a copy of the array when you range over the index and value. Another example of this can been in this tweet by Damian Gryski and there is even a linter to catch this. The last two benchmarks look at ranging over a slice. The first uses just the index into the slice and the second uses both the index and the value. Unlike in the case of an array, there is no difference in performance here.

Reducing an Integer

reduction_test.go

Benchmark Name	Iterations	Per-Iteration
BenchmarkReduceModuloPowerOfTwo	500000000	3.41 ns/op
BenchmarkReduceModuloNonPowerOfTwo	500000000	3.44 ns/op
BenchmarkReduceAlternativePowerOfTwo	2000000000	0.84 ns/op
BenchmarkReduceAlternativeNonPowerOfTwo	2000000000	0.84 ns/op

Generated using go version go1.8.1 darwin/amd64

This benchmark compares two different approaches for reducing an integer into a given range. The first two benchmarks use the traditional approach of taking the modulus of a given integer with the length of the range that we want to reduce the integer into. The latter two benchmarks implement an alternative approach that was described in A fast alternative to the modulo reduction. As the benchmarks show the alternative approach provides superior performance. This alternative implementation was invented because modulus division is a slow instruction on modern processors in comparison to other common instructions, and while one could replace a modulus division by a power of two with a bitwise AND one cannot do the same for a value which is not a power of two. This alternative approach is fair in that every integer in the range [0,N) will have either ceil(2^32/N) or floor(s^32/N) integers in the range [0,2^32) mapped to it. However, unlike modulus division which preserves the lower bits of information (so that k and k+1 will be mapped to different integers if N != 1) the alternative implementation instead preserves the higher order bits (so k and k+1 have a much higher likelohood of being mapped to the same integer) which means it can't be used in hashmaps which use probing to resolve collisions since probing often adds the probe bias to the lower bits (for example, linear probing adds 1 to the hash value) though one can certainly imagine using a probing function which adds the probe bias to the higher order bits.

Slice Initialization Append vs Index

slice_intialization_append_vs_index_test.go

Benchmark Name	Iterations	Per-Iteration	Bytes Allocated per Operation	Allocations per Operation
BenchmarkSliceInitializationAppend	10000000	132 ns/op	160 B/op	1 allocs/op
BenchmarkSliceInitializationIndex	10000000	119 ns/op	160 B/op	1 allocs/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at slice initialization with append versus using an explicit index. I ran this benchmark a few times and it seesawed back and forth. Ultimately, I think they compile down into the same code so there probably isn't any actual performance difference. I'd like to take an actual look at the assembly that they are compiled to and update this section in the future.

String Concatenation

string_concatenation_test.go

Benchmark Name	Iterations	Per-Iteration	Bytes Allocated per Operation	Allocations per Operation
BenchmarkStringConcatenation	20000000	83.9 ns/op	64 B/op	1 allocs/op
BenchmarkStringBuffer	10000000	131 ns/op	64 B/op	1 allocs/op
BenchmarkStringJoin	10000000	144 ns/op	128 B/op	2 allocs/op
BenchmarkStringConcatenationShort	50000000	25.4 ns/op	0 B/op	0 allocs/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at the three different ways to perform string concatenation, the first uses the builtin + operator, the second uses a bytes.Buffer and the third uses string.Join. It seems using + is preferable to either of the other approaches which are similar in performance.

The last benchmark highlights a neat optimization Go performs when concatenating strings with +. The documentation for the string concatenation function in runtime/string.go states:

// The constant is known to the compiler.
// There is no fundamental theory behind this number.
const tmpStringBufSize = 32

type tmpBuf [tmpStringBufSize]byte

// concatstrings implements a Go string concatenation x+y+z+...
// The operands are passed in the slice a.
// If buf != nil, the compiler has determined that the result does not
// escape the calling function, so the string data can be stored in buf
// if small enough.
func concatstrings(buf *tmpBuf, a []string) string {
  ...
}

That is, if the compiler determines that the resulting string does not escape the calling function it will allocate a 32 byte buffer on the stack which can be used as the underlying buffer for the string if it 32 bytes or less. In the last benchmark, the resulting string is in fact less than 32 bytes so it can be stored on the stack saving a heap allocation.

Type Assertion

type_assertion_test.go

Benchmark Name	Iterations	Per-Iteration	Bytes Allocated per Operation	Allocations per Operation
BenchmarkTypeAssertion	2000000000	0.97 ns/op	0 B/op	0 allocs/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at the performance cost of a type assertion. I was a little surprised to find it was so cheap.

Write Bytes vs String

write_bytes_vs_string_test.go

Benchmark Name	Iterations	Per-Iteration	Bytes Allocated per Operation	Allocations per Operation
BenchmarkWriteBytes	100000000	18.7 ns/op	0 B/op	0 allocs/op
BenchmarkWriteString	20000000	63.3 ns/op	64 B/op	1 allocs/op
BenchmarkWriteUnafeString	100000000	21.1 ns/op	0 B/op	0 allocs/op

Generated using go version go1.7.5 darwin/amd64

Go's io.Writer interface only has one Write method which takes a byte slice as an argument. To pass a string to it though requires a conversion to a byte slice which entails a heap allocation. These benchmarks look at the performance cost of writing a byte slice, converting a string to a byte slice and then writing it, and using the unsafe and reflect packages to create a byte slice to the data underlying the string without an allocation.