Home

Awesome

Golang Benchmarks

Various benchmarks for different patterns in Go. Some of these are implementations of or were inspired by this excellent article on performance in Go. Furthemore, the Golang wiki provides a list of compiler optimizations.

Lies, damned lies, and benchmarks.

Allocate on Stack vs Heap

allocate_stack_vs_heap_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkAllocateFooStack10000000002.27 ns/op0 B/op0 allocs/op
BenchmarkAllocateBarStack10000000002.27 ns/op0 B/op0 allocs/op
BenchmarkAllocateFooHeap5000000029.0 ns/op32 B/op1 allocs/op
BenchmarkAllocateBarHeap5000000030.2 ns/op32 B/op1 allocs/op
BenchmarkAllocateSliceHeapNoEscape5000000032.3 ns/op0 B/op0 allocs/op
BenchmarkAllocateSliceHeapEscape5000000260 ns/op1024 B/op1 allocs/op

Generated using go version go1.7.5 darwin/amd64

This benchmark just looks at the difference in performance between allocating a struct on the stack versus on the heap. As expected, allocating a struct on the stack is much faster than allocating it on the heap. The two structs I used in the benchmark are below:

type Foo struct {
	foo int64
	bar int64
	baz int64
}

type Bar struct {
	foo int64
	bar int64
	baz int64
	bah int64
}

One interesting thing is that although Foo is only 24 bytes, when we allocate it on the heap, 32 bytes are allocated and when Bar is allocated on the heap, 32 bytes are allocated for it as well. When I first saw this, my initial suspicion was that Go's memory allocator allocates memory in certain bin sizes instead of the exact size of the struct, and there is no bin size between 24 and 32 bytes, so Foo was allocated the next highest bin size, which was 32 bytes. This blog post examines a similar phenomenen in Rust and its memory allocator jemalloc. As for Go, I found the following in the file runtime/malloc.go:

// Allocating a small object proceeds up a hierarchy of caches:
//
//	1. Round the size up to one of the small size classes
//	   and look in the corresponding mspan in this P's mcache.
//  ...

The last two benchmarks look at an optimization the Go compiler performs. If it can prove through escape analysis that a slice does not escape the calling function, then it allocates the data for the slice on the stack instead of the heap. More information can be found on this golang-nuts post.

Append

append_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkAppendLoop5000002456 ns/op0 B/op0 allocs/op
BenchmarkAppendVariadic2000000097.1 ns/op0 B/op0 allocs/op

Generated using go version go1.8.1 darwin/amd64

This benchmark looks at the performance difference between appending the values of one slice into another slice one by one, i.e. dst = append(dst, src[i]), versus appending them all at once, i.e. dst = append(dst, src...). As the benchmarks show, using the variadic approach is faster. My suspicion is that this is because the compiler can optimize this away into a single memcpy.

Atomic Operations

atomic_operations_test.go

Benchmark NameIterationsPer-Iteration
BenchmarkAtomicLoad3220000000001.77 ns/op
BenchmarkAtomicLoad6410000000001.79 ns/op
BenchmarkAtomicStore325000000027.5 ns/op
BenchmarkAtomicStore6410000000025.2 ns/op
BenchmarkAtomicAdd325000000027.1 ns/op
BenchmarkAtomicAdd645000000027.8 ns/op
BenchmarkAtomicCAS325000000028.8 ns/op
BenchmarkAtomicCAS645000000028.6 ns/op

Generated using go version go1.7.5 darwin/amd64

These benchmarks look at various atomic operations on 32 and 64 bit integers. The only thing that really stands out is that loads are significantly faster than all other operations. I suspect that there's two reasons for this: there's no cache invalidation because only reads are performed and on x86_64 loads and stores using movq are atomic if performed on natural alignments. I took a look at the Load64 function in src/sync/atomic/asm_amd64.go:

TEXT ·LoadInt64(SB),NOSPLIT,$0-16
	JMP	·LoadUint64(SB)

TEXT ·LoadUint64(SB),NOSPLIT,$0-16
	MOVQ	addr+0(FP), AX
	MOVQ	0(AX), AX
	MOVQ	AX, val+8(FP)
	RET

It uses Go's assembly language which I'm not too familiar with, but it appears to move the address of the integer into the AX register in the first function, move the value pointed to by that address into the AX register in the second instruction, and then move that value into the return value of the function in the third instruction. On x86_64 the Go assembly likely can be translated exactly using the movq instruction and since this instruction is atomic if executed on natural alignments the load will be atomic as well.

Bit Tricks

bit_tricks_test.go

Benchmark NameIterationsPer-Iteration
BenchmarkBitTricksModPowerOfTwo20000000000.84 ns/op
BenchmarkBitTricksModNonPowerOfTwo20000000001.58 ns/op
BenchmarkBitTricksAnd20000000000.46 ns/op
BenchmarkBitTricksDividePowerOfTwo20000000000.72 ns/op
BenchmarkBitTricksDivideNonPowerOfTwo20000000001.09 ns/op
BenchmarkBitTricksShift20000000000.52 ns/op

Generated using go version go1.8.1 darwin/amd64

These benchmarks look at some micro optimizations that can be performed when doing division or modulo division. The first three benchmarks show the overhead of doing modulus division and how we can replace modulus division by a power of two with a bitwise and, which is a faster operation. Likewise the last three benchmarks show the overhead of division and how we can improve the speed of division by a power of two by performing a right shift.

Buffered vs Synchronous Channel

buffered_vs_unbuffered_channel_test.go

Benchmark NameIterationsPer-Iteration
BenchmarkSynchronousChannel5000000240 ns/op
BenchmarkBufferedChannel10000000108 ns/op

Generated using go version go1.7.5 darwin/amd64

This benchmark examines the speed with which one can put objects onto a channel and comes from this golang-nuts forum post. Using a buffered channels is over twice as fast as using a synchronous channel which makes sense since the goroutine that is putting objects into the channel need not wait until the object is taken out of the channel before placing another object into it.

Channel vs Ring Buffer

channel_vs_ring_buffer_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkChannelSPSC20000000102 ns/op8 B/op1 allocs/op
BenchmarkRingBufferSPSC2000000072.2 ns/op8 B/op1 allocs/op
BenchmarkChannelSPMC3000000464 ns/op8 B/op1 allocs/op
BenchmarkRingBufferSPMC10000001065 ns/op8 B/op1 allocs/op
BenchmarkChannelMPSC3000000447 ns/op8 B/op1 allocs/op
BenchmarkRingBufferMPSC3000005033 ns/op9 B/op1 allocs/op
BenchmarkChannelMPMC10000193557 ns/op8016 B/op1000 allocs/op
BenchmarkRingBufferMPMC3034618237 ns/op8000 B/op1000 allocs/op

Generated using go version go1.8.3 darwin/amd64

The blog post So You Wanna Go Fast? also took a look at using channels versus using a lock-free ring buffer. I decided to run similar benchmarks myself and the results are above. The benchmarks SPSC, SPMC, MPSC, and MPMC refer to Single Producer Single Consumer, Single Producer Mutli Consumer, Mutli Producer Single Consumer, and Mutli Producer Mutli Consumer respectively. The blog post found that for the SPSC case, a channel was faster than a ring buffer when the tests were run on a single thread (GOMAXPROCS=1) but the ring buffer was faster when the tests were on multiple threads (GOMAXPROCS=8). The blog post also examined the the SPMC and MPMC cases and found similar results. That is, channels were faster when run on a single thread and the ring buffer was faster when the tests were run with multiple threads. I ran all the test with GOMAXPROCS=4 which is the number of CPU cores on the machine I ran the tests on (a 2015 MacBook Pro with a 3.1 GHz Intel Core i7 Processor, it has 2 physical CPUs,sysctl hw.physicalcpu, and 4 logical CPUs, sysctl hw.logicalcpu). Ultimately, the benchmarks I ran produced different results. They show that in the SPSC and SPMC cases the performance of a channel and ring buffer are similar with the ring buffer holding a small advantage. However, in the MPSC and MPMC a channel performed much better than a ring buffer did.

defer

defer_test.go

Benchmark NameIterationsPer-Iteration
BenchmarkMutexUnlock5000000025.8 ns/op
BenchmarkMutexDeferUnlock2000000092.1 ns/op

Generated using go version go1.7.5 darwin/amd64

defer carries a slight performance cost, so for simple use cases it may be preferable to call any cleanup code manually. As this blog post notes, defer can be called from within conditional blocks and must be called if a functions panics as well. Therefore, the compiler can't simply add the deferred function wherever the function returns and instead defer must be more nuanced, resuling in the performance hit. There is, in fact, an open issue to address the performance cost of defer. Another discussion suggests calling defer mu.Unlock() before one calls mu.Lock() so the defer call will be moved out of the critical path:

defer mu.Unlock()
mu.Lock()

False Sharing

false_sharing_test.go

Benchmark NameIterationsPer-Iteration
BenchmarkIncrementFalseSharing3000453087 ns/op
BenchmarkIncrementNoFalseSharing5000246124 ns/op
BenchmarkIncrementNoFalseSharingLocalVariable2000071624 ns/op

go version go1.8.3 darwin/amd64

This example demonstrates the effects of false sharing when multiple goroutines are updating a variable. In the first benchmark, although the goroutines are each updating different variables because those variables are on the same cache line, the updates contend with one another. In the second example, however, we introduce some padding to ensure the integers are on different cache lines so the updates won't interfere with each other. Finally, the last example performs the increments locally and then writes the variable to the shared slice.

Function Call

function_call_test.go

Benchmark NameIterationsPer-Iteration
BenchmarkPointerToStructMethodCall20000000000.32 ns/op
BenchmarkInterfaceMethodCall20000000001.90 ns/op
BenchmarkFunctionPointerCall20000000001.91 ns/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at the overhead for three different kinds of function calls: calling a method on a pointer to a struct, calling a method on an interface, and calling a function through a function pointer field in a struct. As expected, the method call on the pointer to the struct is the fastest since the compiler knows what function is being called at compile time, whereas the others do not. For example, the interface method call relies on dynamic dispatch at runtime to determine which function call and likewise the function pointer to call is determined at runtime as well and has almost identical performance to the interface method call.

Interface conversion

interface_conversion_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkInterfaceConversion20000000001.32 ns/op0 B/op0 allocs/op
BenchmarkNoInterfaceConversion20000000000.85 ns/op0 B/op0 allocs/op

Generated using go version go1.8.1 darwin/amd64

This benchmark looks at the overhead of converting an interface to its concrete type. Surprisingly, the overhead of the type assertion, while not zero, it pretty minimal at only about 0.5 nanoseconds.

Map Lookup

map_lookup_test.go

Generated using go version go1.9.3 darwin/amd64

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkMapUint645000024322 ns/op0 B/op0 allocs/op
BenchmarkMapString110000022587 ns/op0 B/op0 allocs/op
BenchmarkMapString105000028892 ns/op0 B/op0 allocs/op
BenchmarkMapString1005000035785 ns/op0 B/op0 allocs/op
BenchmarkMapString10002000085820 ns/op0 B/op0 allocs/op
BenchmarkMapString1000020001046144 ns/op0 B/op0 allocs/op

This benchmark looks at the time taken to perform lookups in a map with different key types. The motivation for this benchmark comes from a talk given by Björn Rabenstein titled "How to Optimize Go Code for Really High Performance". In the talk, Björn includes a section looking at the performance of map lookups with different types of keys. He found that as the length of the string got longer the performance of the lookup got significantly worse. This isn't entirely unexpected since a map lookup requires us to hash the string, and the longer the string the longer that hash calculation will take, and also to compare the lookup with the key in the corresponding to verify that they match, and here again the longer the string is the longer that comparison will take. As a result of this degredation, Prometheus decided to use maps with int64 keys and perform collision detection themselves.

Memset optimization

memset_test.go

Benchmark NameIterationsPer-Iteration
BenchmarkSliceClearZero/1K10000000012.9 ns/op
BenchmarkSliceClearZero/16K10000000167 ns/op
BenchmarkSliceClearZero/128K3000003994 ns/op
BenchmarkSliceClearNonZero/1K3000000497 ns/op
BenchmarkSliceClearNonZero/16K2000007891 ns/op
BenchmarkSliceClearNonZero/128K2000079763 ns/op

Generated using go version go1.9.2 darwin/amd64

This benchmark looks at the Go compiler's optimization for clearing slices to the respective type's zero value. Specifically, if s is a slice or an array then the following loop is optimized with memclr calls:

for i := range s {
	a[i] = <zero value for element of s>
}

If the value is not the the zero value of the type though then the loop is not optimized as the benchmarks show. The library go-memset provides a function which optimizes clearing byte slices with any value not just zero.

Mutex

mutex_test.go

Benchmark NameIterationsPer-Iteration
BenchmarkNoMutexLock20000000001.18 ns/op
BenchmarkRWMutexReadLock3000000054.5 ns/op
BenchmarkRWMutexLock2000000096.0 ns/op
BenchmarkMutexLock2000000078.7 ns/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at the cost of acquiring different kinds of locks. In the first benchmark we don't acquire any lock. In the second benchmark we acquire a read lock on a RWMutex. In the third we acquire a write lock on a RWMutex. And in the last benchmark we acquire a regular Mutex lock.

Non-cryptographic Hash functions

non_cryptographic_hash_functions_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkHash32Fnv2000000070.3 ns/op0 B/op0 allocs/op
BenchmarkHash32Fnva2000000070.4 ns/op0 B/op0 allocs/op
BenchmarkHash64Fnv2000000071.1 ns/op0 B/op0 allocs/op
BenchmarkHash64Fnva2000000077.1 ns/op0 B/op0 allocs/op
BenchmarkHash32Crc3000000087.5 ns/op0 B/op0 allocs/op
BenchmarkHash64Crc10000000175 ns/op0 B/op0 allocs/op
BenchmarkHash32Adler3000000040.3 ns/op0 B/op0 allocs/op
BenchmarkHash32Xxhash3000000046.1 ns/op0 B/op0 allocs/op
BenchmarkHash64Xxhash3000000047.4 ns/op0 B/op0 allocs/op
BenchmarkHash32Murmur32000000059.4 ns/op0 B/op0 allocs/op
BenchmarkHash128Murmur32000000063.4 ns/op0 B/op0 allocs/op
BenchmarkHash64CityHash3000000057.4 ns/op0 B/op0 allocs/op
BenchmarkHash128CityHash20000000113 ns/op0 B/op0 allocs/op
BenchmarkHash32FarmHash3000000044.4 ns/op0 B/op0 allocs/op
BenchmarkHash64FarmHash5000000026.4 ns/op0 B/op0 allocs/op
BenchmarkHash128FarmHash3000000040.3 ns/op0 B/op0 allocs/op
BenchmarkHash64SipHash3000000039.3 ns/op0 B/op0 allocs/op
BenchmarkHash128SipHash3000000044.9 ns/op0 B/op0 allocs/op
BenchmarkHash64HighwayHash5000000036.9 ns/op0 B/op0 allocs/op
BenchmarkHash32SpookyHash3000000058.1 ns/op0 B/op0 allocs/op
BenchmarkHash64SpookyHash2000000062.7 ns/op0 B/op0 allocs/op
BenchmarkHash128SpookyHash3000000068.2 ns/op0 B/op0 allocs/op
BenchmarkHashMD510000000169 ns/op0 B/op0 allocs/op
BenchmarkHash64MetroHash10000000018.6 ns/op0 B/op0 allocs/op
BenchmarkHash128MetroHash3000000048.8 ns/op0 B/op0 allocs/op

Generated using go version go1.8.3 darwin/amd64

These benchmarks look at the speed of various non-cryptographic hash function implementations in Go.

Pass By Value vs Reference

pass_by_value_vs_reference_test.go

Benchmark NameIterationsPer-Iteration
BenchmarkPassByReferenceOneWord10000000002.20 ns/op
BenchmarkPassByValueOneWord10000000002.58 ns/op
BenchmarkPassByReferenceFourWords5000000002.71 ns/op
BenchmarkPassByValueFourWords10000000002.78 ns/op
BenchmarkPassByReferenceEightWords10000000002.32 ns/op
BenchmarkPassByValueEightWords3000000004.35 ns/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at the performance cost of passing a variable by reference vs passing it by value. For small structs there doesn't appear to be much of a difference, but as the structs gets larger we start to see a bit of difference which is to be expected since the larger the struct is the more words that have to be copied into the function's stack when passed by value.

Pool

pool_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkAllocateBufferNoPool20000000118 ns/op368 B/op2 allocs/op
BenchmarkChannelBufferPool10000000213 ns/op43 B/op0 allocs/op
BenchmarkSyncBufferPool5000000027.7 ns/op0 B/op0 allocs/op

Generated using go version go1.7.5 darwin/amd64

This benchmark compares three different memory allocation schemes. The first approach just allocates its buffer on the heap normally. After it's done using the buffer it will eventually be garbage collected. The second approach uses Go's sync.Pool type to pool buffers which caches objects between runs of the garbage collector. The last approach uses a channel to permanently pool objects. The difference between the last two approaches is the sync.Pool dynamically resizes itself and clears items from the pool during a GC run. Two good resources to learn more about pools in Go are these blog posts: Using Buffer Pools with Go and How to Optimize Garbage Collection in Go.

Pool Put Non Interface

pool_put_non_interface_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkPoolM3XPutSlice5000000282 ns/op32 B/op1 allocs/op
BenchmarkPoolM3XPutPointerToSlice5000000327 ns/op0 B/op0 allocs/op
BenchmarkPoolSyncPutSlice10000000184 ns/op32 B/op1 allocs/op
BenchmarkPoolSyncPutPointerToSlice10000000177 ns/op0 B/op0 allocs/op

Generated using go version go1.8.3 darwin/amd64

This benchmark looks at the cost of pooling slices. Since slices are three words they cannot be coerced into interfaces without an allocation, see the comments on this CL for more details. Consequently, putting a slice in a pool requires the three words for the slice to be allocated on the heap. This cost will admittedly likely be offset by the savings from pooling the actual data backing the slice, however this test looks performs the benchmarks to look at just that. Indeed we see that although the putting a slice on a pool does require an additional allocation, there does not appear to be a significant cost in speed.

Rand

rand_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkGlobalRandInt6320000000115 ns/op0 B/op0 allocs/op
BenchmarkLocalRandInt633000000003.95 ns/op0 B/op0 allocs/op
BenchmarkGlobalRandFloat642000000096.5 ns/op0 B/op0 allocs/op
BenchmarkLocalRandFloat642000000006.00 ns/op0 B/op0 allocs/op

Generated using go version go1.8.3 darwin/amd64

Go's math/rand package exposes various functions for generating random numbers, for example Int63. These functions use a global Rand struct which is created by the package. This struct uses a lock to serialize access to its random number source though which can lead to contention if multiple goroutines are all trying to generate random numbers using the global struct. Consequently, these benchmarks look at the performance improvement that comes from giving each goroutine its own Rand struct so they don't need to acquire a lock. This blog post explores the similar optimizations for using the math/rand package for those who are interested.

Random Bounded Numbers

random_bounded_test.go

Benchmark NameIterationsPer-Iteration
BenchmarkStandardBoundedRandomNumber10000000018.3 ns/op
BenchmarkBiasedFastBoundedRandomNumber10000000011.0 ns/op
BenchmarkUnbiasedFastBoundedRandomNumber5000000040.5 ns/op

Generated using go version go1.8.1 darwin/amd64

Benchmarks for three different algorithims for generating a random bounded number as discussed in the blog post Fast random shuffling. The top result is the standard approach of generating a random number and taking its modulus of the bound. The second approach implements the algorithim discussed in the aforementioned blog post which avoids using the modulus operator. The third algorithim is an implementation of the second algorithim which is unbiased. As mentioned in the article, for most applications the second algorithim will be sufficient enough as the bias introduced by the algorithim is likely less than the bias from the pseudo-random number generator which is used.

Range over Arrays and Slices

range_array_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkIndexRangeArray10000000010.6 ns/op0 B/op0 allocs/op
BenchmarkIndexValueRangeArray10000000014.1 ns/op0 B/op0 allocs/op
BenchmarkIndexValueRangeArrayPtr10000000010.1 ns/op0 B/op0 allocs/op
enchmarkIndexSlice10000000010.4 ns/op0 B/op0 allocs/op
BenchmarkIndexValueSlice10000000010.3 ns/op0 B/op0 allocs/op

Generated using go version go1.8.3 darwin/amd64

These tests look at three different ways to range over an array or slice. The first three benchmarks range over an array. The first uses just the index into the array (for i := range a), the second uses both the index and the value (for i, v := range a), and the third uses the index and value while ranging over a pointer to an array (for i, v := range &a). What's interesting to note is that the second benchmark is noticably slower than the other two. This is because go makes a copy of the array when you range over the index and value. Another example of this can been in this tweet by Damian Gryski and there is even a linter to catch this. The last two benchmarks look at ranging over a slice. The first uses just the index into the slice and the second uses both the index and the value. Unlike in the case of an array, there is no difference in performance here.

Reducing an Integer

reduction_test.go

Benchmark NameIterationsPer-Iteration
BenchmarkReduceModuloPowerOfTwo5000000003.41 ns/op
BenchmarkReduceModuloNonPowerOfTwo5000000003.44 ns/op
BenchmarkReduceAlternativePowerOfTwo20000000000.84 ns/op
BenchmarkReduceAlternativeNonPowerOfTwo20000000000.84 ns/op

Generated using go version go1.8.1 darwin/amd64

This benchmark compares two different approaches for reducing an integer into a given range. The first two benchmarks use the traditional approach of taking the modulus of a given integer with the length of the range that we want to reduce the integer into. The latter two benchmarks implement an alternative approach that was described in A fast alternative to the modulo reduction. As the benchmarks show the alternative approach provides superior performance. This alternative implementation was invented because modulus division is a slow instruction on modern processors in comparison to other common instructions, and while one could replace a modulus division by a power of two with a bitwise AND one cannot do the same for a value which is not a power of two. This alternative approach is fair in that every integer in the range [0,N) will have either ceil(2^32/N) or floor(s^32/N) integers in the range [0,2^32) mapped to it. However, unlike modulus division which preserves the lower bits of information (so that k and k+1 will be mapped to different integers if N != 1) the alternative implementation instead preserves the higher order bits (so k and k+1 have a much higher likelohood of being mapped to the same integer) which means it can't be used in hashmaps which use probing to resolve collisions since probing often adds the probe bias to the lower bits (for example, linear probing adds 1 to the hash value) though one can certainly imagine using a probing function which adds the probe bias to the higher order bits.

Slice Initialization Append vs Index

slice_intialization_append_vs_index_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkSliceInitializationAppend10000000132 ns/op160 B/op1 allocs/op
BenchmarkSliceInitializationIndex10000000119 ns/op160 B/op1 allocs/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at slice initialization with append versus using an explicit index. I ran this benchmark a few times and it seesawed back and forth. Ultimately, I think they compile down into the same code so there probably isn't any actual performance difference. I'd like to take an actual look at the assembly that they are compiled to and update this section in the future.

String Concatenation

string_concatenation_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkStringConcatenation2000000083.9 ns/op64 B/op1 allocs/op
BenchmarkStringBuffer10000000131 ns/op64 B/op1 allocs/op
BenchmarkStringJoin10000000144 ns/op128 B/op2 allocs/op
BenchmarkStringConcatenationShort5000000025.4 ns/op0 B/op0 allocs/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at the three different ways to perform string concatenation, the first uses the builtin + operator, the second uses a bytes.Buffer and the third uses string.Join. It seems using + is preferable to either of the other approaches which are similar in performance.

The last benchmark highlights a neat optimization Go performs when concatenating strings with +. The documentation for the string concatenation function in runtime/string.go states:

// The constant is known to the compiler.
// There is no fundamental theory behind this number.
const tmpStringBufSize = 32

type tmpBuf [tmpStringBufSize]byte

// concatstrings implements a Go string concatenation x+y+z+...
// The operands are passed in the slice a.
// If buf != nil, the compiler has determined that the result does not
// escape the calling function, so the string data can be stored in buf
// if small enough.
func concatstrings(buf *tmpBuf, a []string) string {
  ...
}

That is, if the compiler determines that the resulting string does not escape the calling function it will allocate a 32 byte buffer on the stack which can be used as the underlying buffer for the string if it 32 bytes or less. In the last benchmark, the resulting string is in fact less than 32 bytes so it can be stored on the stack saving a heap allocation.

Type Assertion

type_assertion_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkTypeAssertion20000000000.97 ns/op0 B/op0 allocs/op

Generated using go version go1.7.5 darwin/amd64

This benchmark looks at the performance cost of a type assertion. I was a little surprised to find it was so cheap.

Write Bytes vs String

write_bytes_vs_string_test.go

Benchmark NameIterationsPer-IterationBytes Allocated per OperationAllocations per Operation
BenchmarkWriteBytes10000000018.7 ns/op0 B/op0 allocs/op
BenchmarkWriteString2000000063.3 ns/op64 B/op1 allocs/op
BenchmarkWriteUnafeString10000000021.1 ns/op0 B/op0 allocs/op

Generated using go version go1.7.5 darwin/amd64

Go's io.Writer interface only has one Write method which takes a byte slice as an argument. To pass a string to it though requires a conversion to a byte slice which entails a heap allocation. These benchmarks look at the performance cost of writing a byte slice, converting a string to a byte slice and then writing it, and using the unsafe and reflect packages to create a byte slice to the data underlying the string without an allocation.