* blog posts - http://jmoiron.net/blog/go-performance-tales/ - use integer map keys if possible - hard to compete with Go's map implementation; esp. if your data structure has lots of pointer chasing - aes-ni instructions make string hashing much faster - prefer structs to maps if you know the map keys (esp. coming from perl, etc) - channels are useful, but slow; raw atomics can help with performance - cgo has overhead - profile before optimizing - http://slideshare.net/cloudflare/go-profiling-john-graham-cumming ( https://www.youtu.be/_41bkNr7eik ) - don't waste programmer cycles saving the wrong CPU cycles (or memory allocations) - bash$ time; time.Now()/time.Since(); pprof.StartCPUProfile/pprof.StopCPUProfile; go tool pprof http://.../profile - bash$ ps; runtime.ReadMemStats(); runtime.WriteHeapProfile(); go tool pprof http://.../heap - slice operations are sometimes O(n) - https://golang.org/pkg/runtime/debug/ - sync.Pool (basically) - https://methane.github.io/2015/02/reduce-allocation-in-go-code - 1. correctness is important - 2. BenchmarkXXX with b.ReportAllocs() (or -benchmem when running) - 3. allocfreetrace=1 produces stack trace on every allocation - strategies: - avoid string concat; use []byte+append() (+strconv.AppendInt(), ...) - benchcmp - avoid time.Format - avoid range when iterating strings ([]rune conversion + utf8 decoding) - can append string to []byte - write two versions, one for string, one for []byte (avoids conversion+copy (sometimes...)) - reuse existing buffers instead of creating new ones - http://bravenewgeek.com/so-you-wanna-go-fast/ - performance fast vs. delivery fast; make the right decision - lock-free ring buffer vs. channels: faster except with GOMAXPROCS=1 - defer has a cost (allocation+cpu) BenchmarkMutexDeferUnlock-8 20000000 96.6 ns/op BenchmarkMutexUnlock-8 100000000 19.5 ns/op - reflection+json - ffjson avoids reflection - msgp avoids json - interfaces have dynamic dispatch which can't be inlined - => use concrete types (+ code duplication) - heap vs. stack; escape analysis - lots of short-lived objects is expensive for the gc - sync.Pool reuses objects *between* gc runs - you need your own free list to hold onto things between gc runs (but now you're subverting the purpose of a garbage collector) - false sharing - custom lock-free data structures: fast but *hard* - "Speed comes at the cost of simplicity, at the cost of development time, and at the cost of continued maintenance. Choose wisely." - https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs - http://blog.golang.org/profiling-go-programs - https://medium.com/%40hackintoshrao/daily-code-optimization-using-benchmarks-and-profiling-in-golang-gophercon-india-2016-talk-874c8b4dc3c5 - If you're writing benchmarks, read http://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go - cache line explanation: http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast_22.html - avoiding false sharing: http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206 - how does this translate to go? http://www.catb.org/esr/structure-packing/ - https://en.wikipedia.org/wiki/Amdahl%27s_law - https://github.com/ardanlabs/gotraining/tree/master/topics/profiling - https://github.com/ardanlabs/gotraining/tree/master/topics/benchmarking - http://dave.cheney.net/2015/11/29/a-whirlwind-tour-of-gos-runtime-environment-variables - https://github.com/davecheney/high-performance-go-workshop cgo: cgo has overhead (which has only gotten more expensive over time) -- ~200 ns/call ssa backend means less difference in codegen really thing if you want cgo: http://dave.cheney.net/2016/01/18/cgo-is-not-go videos: https://gophervids.appspot.com/#tags=optimization -- figure out which of these are specifically worth listing "Profiling and Optimizng Go" (Uber) https://www.youtube.com/watch?v=N3PWzBeLX2M https://go-talks.appspot.com/github.com/davecheney/presentations/writing-high-performance-go.slide https://www.youtube.com/watch?v=zWp0N9unJFc Björn Rabenstein https://docs.google.com/presentation/d/1Zu0BdbhMRar7ycEwDi8jepGokTXTDXlKFf7C13tusuI/edit https://www.youtube.com/watch?v=ZuQcbqYK0BY https://go-talks.appspot.com/github.com/mkevac/golangmoscow2016/gomeetup.slide CppCon 2014: Chandler Carruth "Efficiency with Algorithms, Performance with Data Structures" https://www.youtube.com/watch?v=fHNmRkzxHWs Performance Engineering of Software Systems http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-172-performance-engineering-of-software-systems-fall-2010/ https://talks.golang.org/2013/highperf.slide#1 Machine Architecture: Things Your Programming Language Never Told You https://www.youtube.com/watch?v=L7zSU9HI-6I 7 Ways to Profile Go Applications https://www.youtube.com/watch?v=2h_NFBFrciI asm: https://golang.org/doc/asm https://goroutines.com/asm http://www.doxsey.net/blog/go-and-assembly posts: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html https://arxiv.org/abs/1509.05053 http://grokbase.com/t/gg/golang-nuts/155ea0t5hf/go-nuts-after-set-gomaxprocs-different-machines-have-different-bahaviors-some-speed-up-some-slow-down http://grokbase.com/t/gg/golang-nuts/14138jw64s/go-nuts-concurrent-read-write-of-different-parts-of-a-slice Escape Analysis Flaws https://docs.google.com/document/d/1CxgUBPlx9iJzkz9JWkb6tIpTe5q32QDmz8l0BouG0Cw/preview tools: https://github.com/golang/go/issues/14304 (creation of /x/perf ) https://godoc.org/github.com/aclements/go-perf https://godoc.org/rsc.io/benchstat https://github.com/uber/go-torch https://github.com/rakyll/gom https://github.com/tam7t/sigprof https://github.com/aybabtme/dpprof https://github.com/wblakecaldwell/profiler https://github.com/MiniProfiler/go https://perf.wiki.kernel.org/index.php/Main_Page https://github.com/dominikh/go-structlayout http://www.brendangregg.com/perf.html papers: https://www.akkadia.org/drepper/cpumemory.pdf https://software.intel.com/sites/default/files/article/392271/aos-to-soa-optimizations-using-iterative-closest-point-mini-app.pdf stackoverflow: https://stackoverflow.com/questions/19397699/why-struct-with-padding-fields-works-faster/19397791#19397791 https://stackoverflow.com/questions/10017026/no-speedup-in-multithread-program/10017482#10017482