a few more notes

2019-04-25 16:44:05 -07:00 · 2019-04-25 16:44:05 -07:00 · 302751cfc1
commit 302751cfc1
parent 1878a28b6a
2 changed files with 15 additions and 3 deletions
--- a/3
+++ b/3
@ -227,3 +227,6 @@ The key to performance is elegance, not battalions of special cases.
 You’re bound to be unhappy if you optimize everything.
        — Donald Knuth
 You'll never know how bad things are until you look.
 				- Howard Chu
--- a/performance.md
+++ b/performance.md
@ -206,7 +206,8 @@ also give you an idea of where to start. If you need only a 10-20%
 performance improvement, you can probably get that with some implementation
 tweaks and smaller fixes. If you need a factor of 10x or more, then just
 replacing a multiplication with a left-shift isn't going to cut it. That's
-probably going to call for changes up and down your stack.
+probably going to call for changes up and down your stack, possibly redesigning
 large portions of the system with these performance goals in mind.
 Good performance work requires knowledge at many different levels, from
 system design, networking, hardware (CPU, caches, storage), algorithms,
@ -467,6 +468,12 @@ sorting will pay off. On the other hand, if you're mostly doing lookups,
 maybe having an array was the wrong choice and you'd be better off paying the
 O(1) lookup cost for a map instead.
 Being able to analyze your problem in terms of big-O notation also means you can
 figure out if you're already at the limit for what is possible for your problem,
 and if you need to change approaches in order to speed things up.  For example,
 finding the minimum of an unsorted list is `O(n)`, because you have to look at
 every single item.  There's no way to make that faster.
 If your data structure is static, then you can generally do much better than
 the dynamic case. It becomes easier to build an optimal data structure
 customized for exactly your lookup patterns. Solutions like minimal perfect
@ -926,6 +933,7 @@ allocate it. But you also pay every time the garbage collection runs.
 * API design to limit allocations:
  * allow passing in buffers so caller can reuse rather than forcing an allocation
  * you can even modify a slice in place carefully while you scan over it
  * passing in a struct could allow caller to stack allocate it
 * reducing pointers to reduce gc scan times
  * pointer-free slices
  * maps with both pointer-free keys and values
@ -965,6 +973,7 @@ allocate it. But you also pay every time the garbage collection runs.
  * but "off-heap", so ignored by gc (but so would a pointerless slice)
 * need to think about serialization format: how to deal with pointers, indexing (mph, index header)
 * speedy de-serialization
 * binary wire protocol to struct when you already have the buffer
 * string <-> slice conversion, []byte <-> []uint32, ...
 * int to bool unsafe hack (but cmov) (but != 0 is also branch-free)
 * padding:
@ -1023,7 +1032,7 @@ Techniques specific to the architecture running the code
 * introduction to CPU caches
  * performance cliffs
  * building intuition around cache-lines: sizes, padding, alignment
-  * OS tools to view cache-misses
+  * OS tools to view cache-misses (perf)
  * maps vs. slices
  * SOA vs AOS layouts: row-major vs. column major; when you have an X, do you need another X or do you need a Y?
  * temporal and spacial locality: use what you have and what's nearby as much as possible
@ -1050,7 +1059,7 @@ Techniques specific to the architecture running the code
 * sorting data can help improve performance via both cache locality and branch prediction, even taking into account the time it takes to sort
 * function call overhead: inliner is getting better
-* reduce data copies
+* reduce data copies (including for repeated large lists of function params)
 * Comment about Jeff Dean's 2002 numbers (plus updates)
  * cpus have gotten faster, but memory hasn't kept up