go-perfbook/performance.md

736 lines
34 KiB
Markdown
Raw Normal View History

2018-01-16 14:44:31 +08:00
# Writing and Optimizing Go code
2016-05-22 14:21:23 +08:00
This document outlines best practices for writing high-performance Go code.
At the moment, it's a collection of links to videos, slides, and blog posts
("awesome-golang-performance"), but I would like this to evolve into a longer
book format where the content is here instead of external. The links should be
sorted into categories.
2016-05-22 14:21:23 +08:00
2018-01-16 13:54:49 +08:00
While some discussions will be made for individual services faster (caching,
2018-01-04 02:43:52 +08:00
etc), designing performant distributed systems is beyond the scope of this
work.
2016-05-22 19:14:31 +08:00
All the content will be licensed under CC-BY-SA.
2017-04-24 15:06:20 +08:00
This book is split into different sections:
1) basic tips for writing not-slow software
* CS 101-level stuff
2) tips for writing fast software
* Go-specific sections on how to get the best from Go
3) advanced tips for writing *really* fast software
* For when your optimized code isn't fast enough
2018-01-16 14:44:31 +08:00
2017-12-31 10:45:41 +08:00
### When and Where to Optimize
I'm putting this first because it's really the most important step. Should
you even be doing this at all?
Every optimization has a cost. Generally this cost is expressed in terms of
code complexity or cognitive load -- optimized code is rarely simpler than
the unoptimized version.
But there's another side that I'll call the economics of optimization. As a
programmer, your time is valuable. There's the opportunity cost of what else
you could be working on for your project, which bugs to fix, which features
to add. Optimizing things is fun, but it's not always the right task to
choose. Performance is a feature, but so is shipping, and so is correctness.
Choosing the most important thing to work on. Sometimes this isn't an
optimization at all. Sometimes it's not an actual CPU optimization, but a
user-experience one. Making something start up faster by doing computation in
the background after drawing the main window, for example.
2018-01-02 23:20:14 +08:00
Some times this will be obvious: an hourly report that completes in three hours
is probably less useful that one that completes in less than one.
2018-01-02 07:42:29 +08:00
Just because something is easy to optimize doesn't mean it's worth
optimizing. Ignoring low-hanging fruit is a valid development strategy.
2017-12-31 10:45:41 +08:00
Think of this as optimizing *your* time.
Choosing what to optimize. Choosing when to optimize.
2018-01-19 07:16:16 +08:00
Clarify "Premature optimization" quote. 97% of the time.
But work on those important 3%.
2018-01-02 23:20:14 +08:00
2018-01-02 23:51:22 +08:00
TPOP: Should you optimize? "Yes, but only if the problem is important, the
2018-01-16 13:54:49 +08:00
program is genuinely too slow, and there is some expectation that it can be
2018-01-02 23:51:22 +08:00
made faster while maintaining correctness, robustness, and clarity."
2017-12-31 10:45:41 +08:00
Fast software or fast deployment.
http://bitfunnel.org/strangeloop . has numbers. Hypothetical search engine
needing 30k machines @ $1k USD / year. Doubling the speed of your software
can save $15M/year. Even a developer spending an entire year to shave off 1%
will pay for itself
2018-01-21 01:58:54 +08:00
In the vast majority of cases, the size and speed of a program is not a concern.
2018-01-21 00:37:42 +08:00
Easiest optimization is not having to do it. The second easiest optimization
is just buying faster hardware.
Once you've decided you're going to change your program, keep reading.
2017-12-31 10:45:41 +08:00
### How to Optimize
2018-01-16 14:44:31 +08:00
## Optimization Workflow
2018-01-03 08:03:41 +08:00
Before we get into the specifics, lets talk about the general process of
optimization.
2018-01-02 07:42:29 +08:00
Optimization is a form of refactoring. But each step, rather than improving
some aspect of the source code (code duplication, clarity, etc), improves
2018-01-03 08:03:41 +08:00
some aspect of the performance: lower CPU, memory usage, latency, etc. This
2018-01-21 07:20:11 +08:00
improvement generally comes at the cost of readability. This means that in
addition to a comprehensive set of unit tests (to ensure your changes haven't
broken anything), you also need a good set of benchmarks to ensure your
changes are having the desired effect on performance. You must be able to
verify that your change really *is* lowering CPU. Sometimes a change you
thought would improve will actually turn out to have a zero or negative
change. Always make sure you undo your fix in these cases.
2018-01-07 14:18:06 +08:00
The benchmarks you are using must be correct and provide reproducible numbers
on representative workloads. If individual runs have too high a variance, it
will make small improvements more difficult to spot. You will need to use
benchstat or equivalent statistical tests and won't be able just eyeball it.
2018-01-16 13:54:49 +08:00
(Note that using statistical tests is a good idea anyways.) The steps to run
2018-01-07 14:18:06 +08:00
the benchmarks should be documented, and any custom scripts and tooling
2018-01-16 13:54:49 +08:00
should be committed to the repository with instructions for how to run them.
2018-01-07 14:18:06 +08:00
Be mindful of large benchmark suites that take a long time to run: it will
2018-01-16 13:54:49 +08:00
make the development iterations slower.
2018-01-07 14:18:06 +08:00
2018-01-16 14:44:31 +08:00
(Note also that anything that can be measured can be optimized. Make sure
you're measuring the right thing.)
2018-01-07 14:18:06 +08:00
The next step is to decide what you are optimizing for. If the goal is to
improve CPU, what is an acceptable speed. Do you want to improve the current
performance by 2x? 10x? Can you state it as "problem of size N in less than
time T"? Are you trying to reduce memory usage? By how much? How much slower
is acceptable for what change in memory usage? What are you willing to give
up in exchange for lower space?
Optimizing for service latency is a trickier proposition. Entire books have
been written on how to performance test web servers. The primary issue is
that for single-threaded code, the performance is fairly consistent for a
given problem size. For webservices, you don't have a single number. A proper
web-service benchmark suite will provide a latency distribution for a given
reqs/second level. ...
2017-12-31 10:45:41 +08:00
2018-01-16 14:44:31 +08:00
The performance goals must be specific. You will (almost) always be able to
make something faster. Optimizing is frequently a game of diminishing
returns. You need to know when to stop.
2018-01-20 15:13:31 +08:00
Dan Luu's talk also points out the advantage of rough calculations to
determine if your target performance figures are reasonable.
For greenfield development, you shouldn't leave all benchmarking and
performance numbers until the end. It's easy to say "we'll fix it later", but
if performance is really important it will be a design consideration from the
start. Any significant architectural changes required to fix performance
2018-01-21 00:37:42 +08:00
issues will be too risky near the deadline. Note that *during* development,
the focus should be on reasonable program design, algorithms, and data
2018-01-21 00:40:05 +08:00
structures. Optimizing at lower-levels of the stack should wait until later
2018-01-21 00:37:42 +08:00
in the development cycle when a more complete view of the system performance
2018-01-21 00:40:05 +08:00
is available. Any full-system profiles you do while the system is incomplete
will give a skewed view of where the bottlenecks will be in the finished system.
2018-01-20 15:13:31 +08:00
2018-01-21 09:56:29 +08:00
Write code that you can benchmark. Profiling you can do on larger systems.
Benchmarking you want to test isolated pieces. You need to be able to extract
and setup sufficient context that benchmarks test enough and are
representative.
2018-01-16 14:44:31 +08:00
The difference between what your target is and the current performance will
also give you an idea of where to start. If you need only a 10%-20%
performance improvement, you can probably get that with some implementation
tweaks and smaller fixes. If you need a factor of 10x or more, then just
replacing a multiplication with a left-shift isn't going to cut it. That's
probably going to call for changes up and down your stack.
2017-12-31 10:45:41 +08:00
2018-01-05 00:04:51 +08:00
Good performance work requires knowledge at many different levels, from
system design, networking, hardware (CPU, caches, storage), algorithms,
tuning, and debugging. With limited time and resources, consider which level
will give the most improvement: it won't always be algorithm or program
tuning.
2018-01-05 00:04:51 +08:00
2018-01-16 15:08:59 +08:00
In general, optimizations should proceed from top to bottom. Optimizations at
the system level will have more impact than expression-level ones. Make sure
you're solving the problem at the appropriate level.
2018-01-02 23:50:52 +08:00
This book is mostly going to talk about reducing CPU usage, reducing memory
2018-01-03 08:03:41 +08:00
usage, and reducing latency. It's good to point out that you can very rarely
2018-01-02 23:20:14 +08:00
do all three. Maybe CPU time is faster, but now your program uses more
memory. Maybe you need to reduce memory space, but now the program will take
longer.
2017-12-31 10:45:41 +08:00
2018-01-02 23:20:14 +08:00
Amdahl's Law tells us to focus on the bottlenecks. If you double the speed of
2018-01-02 07:42:29 +08:00
routine that only takes 5% of the runtime, that's only a 2.5% speedup in
2018-01-02 23:20:14 +08:00
total wall-clock. On the other hand, speeding up routine that takes 80% of
2018-01-16 15:08:59 +08:00
the time by only 10% will improve runtime by almost 8%. Profiles will help
2018-01-02 23:20:14 +08:00
identify where time is actually spent.
2017-12-31 10:45:41 +08:00
2018-01-16 15:08:59 +08:00
When optimizing, you want to reduce the amount of work the CPU has to do.
2018-01-25 05:14:55 +08:00
A smarter algorithm can drastically reduce CPU time.
30k x improvement
Compiler optimizations double performance every 18 years.
Program tuning, like compiler optimizations can only make a small dent in the
total runtime. Large wins will almost always come from an algorithmic change
or data structure change, a fundamental shift in how your program is
organized.
2018-01-16 14:44:31 +08:00
A profiler might show you that lots of time is spent in a particular routine.
It could be this is an expensive routine, or it could be a cheap routine that
is just called many many times. Rather than immediately trying to speed up
that one routine, see if you can reduce the number of times it's called or
2018-01-25 05:14:55 +08:00
eliminate it completely. We'll discuss more concrete optimization strategies in the next section.
2018-01-16 14:44:31 +08:00
2018-01-16 15:08:59 +08:00
The Three Optimization Questions:
2017-12-31 10:45:41 +08:00
2018-01-25 05:14:55 +08:00
- Do we have to do this at all? The fastest code is the code that's never run.
2018-01-16 14:44:31 +08:00
- If yes, is this the best algorithm.
- If yes, is this the best *implementation* of this algorithm.
2018-01-02 23:20:14 +08:00
2018-01-16 15:08:59 +08:00
### Concrete optimization tips
2017-12-31 10:45:41 +08:00
2018-01-16 15:08:59 +08:00
Jon Bentley's 1982 work "Writing Efficient Programs" approached program
optimization as an engineering problem: Benchmark. Analyze. Improve. Verify.
Iterate. A number of his tips are now done automatically by compilers. A
programmers job is to use the transformations compilers *can't* do.
2017-12-31 10:45:41 +08:00
2018-01-16 15:08:59 +08:00
There's a summary of this book:
http://www.crowl.org/lawrence/programming/Bentley82.html
2018-01-21 09:27:14 +08:00
http://www.geoffprewett.com/BookReviews/WritingEfficientPrograms.html
2018-01-16 15:08:59 +08:00
2018-01-21 07:20:11 +08:00
and the program tuning rules:
https://web.archive.org/web/20080513070949/http://www.cs.bell-labs.com/cm/cs/pearls/apprules.html
2018-01-16 15:08:59 +08:00
When thinking changes you can make to your program, there are two basic options:
you can either change your data or you can change your code.
2018-01-22 13:50:00 +08:00
## Data Changes
2018-01-16 15:08:59 +08:00
Changing your data means either adding to or altering the representation of
the data you're processing.
2017-12-31 10:45:41 +08:00
2018-01-22 13:50:00 +08:00
(Some of these rely on changing the O() associated with different aspects of
the data structure)
2018-01-16 15:30:54 +08:00
Ideas for augmenting your data structure:
2018-01-02 23:20:14 +08:00
2018-01-16 15:30:54 +08:00
- extra fields: For example, store the size of a linked lists rather than
iterating when asked for it. Or storing additional pointers to frequently
needed other nodes to multiple searches (for example, "backwards" links in a
doubly-linked list to make removal O(1) ). These sorts of changes are useful
when the data you need is cheap to store and keep up-to-date.
- extra search indexes: Most data structures are designed for a single type of query.
If you need two different query types, having an additional "view" onto your data can be large improvement.
For example, []struct, referenced by ID but sometimes string -> map[string]id (or \*struct)
- extra information about elements: for example, a bloom filter. These need to
be small and fast to not overwhelm the rest of the data structure.
2018-01-16 15:49:38 +08:00
- if queries are expensive, add a cache. We're all familiar with memcache, but there are in-process caches.
* over the wire, the network + cost of serialization will hurt
* in-process caches, but now you need to worry about expiration
* even a single item can help (logfile time parse example)
TODO: "cache" might not even be key-value, just a pointer to where you were
working. This can be as simple as a "search finger"
These are all clear examples of "do less work" at the data structure level.
2018-01-17 06:40:52 +08:00
They all cost space. Most of the time if you're optimizing for CPU, your
program will use more memory. This is the classic space-time trade-off:
https://en.wikipedia.org/wiki/Space%E2%80%93time_tradeoff
2018-01-05 00:04:51 +08:00
2018-01-17 06:40:52 +08:00
If your program uses too much memory, it's also possible to go the other way.
Reduce space usage in exchange for increased computation. Rather than storing
things, calculate them every time. You can also compress the data in memory
and decompress it on the fly when you need it.
2018-01-16 15:49:58 +08:00
2018-01-17 06:40:52 +08:00
There's a book available on line covering techniques for reducing the space
used by your programs. While it was originally written targetting embedded
developers, the ideas are applicable for programs on modern hardware dealing
with huge amounts of data. http://www.smallmemory.com/
2017-12-31 10:45:41 +08:00
2018-01-17 06:40:52 +08:00
Rearrange your data: Eliminate padding. Remove extra fields.
Change to a slower data structure.
Skip pointer-heavy tree structure and use slice and linear search instead.
Custom compression format for your data: floating point (go-tsz), integers (delta, xor + huffman)
We will talk more about data layouts later.
2017-12-31 10:45:41 +08:00
2018-01-17 06:40:52 +08:00
Modern computers and the memory hierarchy make the space/time trade-off less
2017-12-31 10:45:41 +08:00
clear. It's very easy for lookup tables to be "far away" in memory (and
2018-01-17 06:40:52 +08:00
therefore expensive to access) making it faster to just recompute a value
every time it's needed.
This also means that benchmarking will frequently show improvements that are
not realized in the production system due to cache contention (e.g., lookup
tables are in the processor cache during benchmarking but always flushed by
"real data" when used in a real system. Google's Jump Hash paper in fact
addressed this directly, comparing performance on both a contented and
uncontended processor cache. See graphs 4 and 5 in the Jump Hash paper:
https://arxiv.org/pdf/1406.2294.pdf )
TODO: how to simulate a contented cache, show incremental cost
Another aspect to consider is data-transfer time. Generally network and disk
access is very slow, and so being able to load a compressed chunk will be
much faster than the extra CPU time required to decompress the data once it
2018-01-21 00:37:42 +08:00
has been fetched. As always, benchmark. A binary format will generally
be smaller and faster to parse than a text one, but at the cost of no longer
being as human readable.
2017-12-31 10:45:41 +08:00
2018-01-22 13:50:00 +08:00
## Algorithmic Changes
2018-01-18 06:36:20 +08:00
If you're not changing the data, the other main option is to change the code.
2018-01-03 07:03:41 +08:00
2018-01-18 07:05:37 +08:00
The biggest improvement is likely to come from an algorithmic changes. This
is the equivalent of replacing bubble sort with quicksort to from O(n^2) sort
to O(n log n), or replacing a linear scan through an array that used to be
small O(n) with a map lookup (O(1)).
2018-01-20 15:13:31 +08:00
This is how software becomes slow. Structures originally designed for one use
is repurposed for something it wasn't designed for. This happens gradually.
2018-01-18 07:05:37 +08:00
It's important to have an intuitive grasp of the different big-O levels.
Choose the right data structure for your problem. You don't have to alway
shave cycles, but this just prevents dumb performance issues that might not
be noticed until much later.
The basic classes of complexity are:
2018-01-18 15:00:45 +08:00
* O(1): a field access, array or map lookup
* O(log n): binary search
* O(n): simple loop
* O(n\*m): nested loop
* O(n log n): divide-and-conquer
2018-01-18 07:05:37 +08:00
* combinatoric - look out!!
Link: bigocheatsheet.com
Let's say you need to search through of an unsorted set of data. "I should
use a binary search" you think, knowing that a binary search O(log n) which
is faster than the O(n) linear scan. However, a binary search requires that
the data is sorted, which means you'll need to sort it first, which will take
O(n log n) time. If you're doing lots of searches, then the upfront cost of
sorting will pay off. On the other hand, if you're mostly doing lookups,
maybe having an array was the wrong choice and you'd be better off paying the
O(1) lookup cost for a map instead.
2018-01-21 00:40:05 +08:00
Choose the simplest reasonable data structure and move on. CS 101, writing
"not-slow software". Don't be dumb. This should be your default development
2018-01-21 01:58:54 +08:00
mode. If you know you need random access, don't choose a linked-list.
If you know you need in-order traversal, don't use a map.
2018-01-21 00:40:05 +08:00
http://daslab.seas.harvard.edu/rum-conjecture/
2018-01-21 00:40:05 +08:00
When writing a package to be used to by others, avoid the temptation to
optimize up front for every single use case. This will result in unreadable
code. Data structures by design are effectively single-purpose. You can
neither read minds nor predict the future. If a user says "Your package is
too slow for this use case", a reasonable answer might be "Then use this
other package over here". A package should "do one this well".
2018-01-21 00:37:42 +08:00
2018-01-19 07:16:16 +08:00
Sometimes hybrid data structures will provide the performance improvement you
need. For example, by bucketing your data you can limit your search to a
single bucket. This still pays the theoretical cost of O(n), but the constant
will be smaller. We'll revisit these kinds of tweaks when we get to program
tuning.
2018-01-18 15:01:05 +08:00
Two things that people forget when discussion big-O notation
2018-01-18 07:05:37 +08:00
2018-01-18 15:01:05 +08:00
One: there's a constant factor involved. Two algorithms which have the same
algorithmic complexity can have different constant factors. Imagine running a
looping over a list 100 times vs just looping over it once Even though both
are O(n), one has a constant factor that's 100 times higher.
These constant factors are why even though merge sort, quicksort, and
heapsort all O(n log n), everybody uses quicksort because it's the fastest.
It has the smallest constant factor.
The second thing is that big-O only says "as n grows to infinity". It says
nothing about small n. "As the numbers get big, this is the growth factor
that will dominate the run time."
There's frequently a cut-off point below which a dumber algorithm is faster.
A nice example from the Go standard library's `sort` package. Most of the
time it's using quicksort, but it has a shell-sort pass then insertion sort
when the partition size drops below 12 elements.
2018-01-18 15:08:06 +08:00
The memory hierarchy in modern computers confuses the issue here a little
bit, in that caches prefer the predictable access of scanning a slice to the
2018-01-19 07:16:16 +08:00
effectively random access of chasing a pointer. Still, it's best to begin
with a good algorithm. We will talk about this in the hardware-specific
section.
"The fight may not always go to the strongest, nor the race to the fastest, but that's the way to bet." -- Rudyard Kipling.
2018-01-18 15:08:06 +08:00
2018-01-18 07:05:37 +08:00
Sometimes the best algorithm for a particular problem is not a single
algorithm, but a collection of algorithms specialized for slightly different
input classes. This "polyalgorithm" quickly detects what kind of input it
2018-01-18 15:01:05 +08:00
needs to deal with and then dispatches to the appropriate code path. This is
what the sorting package mentioned above does: determine the problem size and
choose a different algorithm. The `string` and `bytes` packages do something
2018-01-19 07:16:16 +08:00
similar, detecting and specializing for different cases. As with data
compression, the more you know about what your input looks like, the better
your custom solution can be. Even if an optimization is not always applicable,
complicating your code by determining that it's safe to use and executing
different logic can be worth it.
2018-01-06 04:54:13 +08:00
2018-01-23 06:11:42 +08:00
The `sort` package mentioned above is another example of a polyalgorithm. In
addition to combining quicksort, shell sort, and insertion sort, it also
tracks recursion depth of quicksort and calls heapsort if necessary.
2018-01-22 13:50:00 +08:00
## Benchmark Inputs
2017-12-31 10:45:41 +08:00
2018-01-22 13:50:00 +08:00
Know how big each of your input sizes is likely to be in production.
2017-12-31 10:45:41 +08:00
2018-01-22 13:50:00 +08:00
Your benchmarks must use appropriately-sized inputs. As we've seen, different
algorithms make sense at different input sizes. If your expected input range
in <100, then your benchmarks should reflect that. Otherwise, choosing an
algorithm which is optimal for n=10^6 might not be the fastest.
2018-01-21 09:56:29 +08:00
2018-01-22 13:50:00 +08:00
Be able to generate representative test data. Different distributions of data
can provoke different behaviours in your algorithm: think of the classic
"quicksort is O(n^2) when the data is sorted" example. Similarly,
interpolation search is O(log log n) for uniform random data, but O(n) worst
case. Knowing what your inputs look like is the key to both representative
benchmarks and for choosing the best algorithm. If the data you're using to
test isn't representative of real workloads, you can easily end up optimizing
for one particular data set, "overfitting" your code to work best with one specific
set of inputs.
2018-01-06 07:55:52 +08:00
2018-01-22 13:50:00 +08:00
Also note that some issues that are not apparent on your laptop might be
visible once you deploy to production and hitting 250k reqs/second on a 40
core server.
2017-12-31 10:45:41 +08:00
2018-01-22 13:50:00 +08:00
## Program Tuning
2018-01-03 07:03:41 +08:00
Program tuning used to be an art form, but then compilers got better. So now
it turns out that compilers can optimize straight-forward code better than
complicated code. The Go compiler still has a long way to go to match gcc and
clang, but it does mean that you need to be careful when tuning and
especially when upgrading that your code doesn't become "worse". There are
definitely cases where tweaks to work around the lack of a particular
compiler optimization became slower once the compiler was improved.
If you are working around a specific runtime or compiler code generation
issue, always document your change with a link to the upstream issue. This
will allow you to quickly revisit your optimization once the bug is fixed.
Fight the temptation to cargo cult folklore-based "performance tips".
2018-01-21 09:27:14 +08:00
Program tuning is an iterative process. Keep revisiting your code and seeing
what changes can be made. Ensure you're making progress at each step.
Frequently one improvement will enable others to be made. (Now that I'm not
doing A, I can simplify B by doing C instead.) This means you need to keep
looking at the entire picture and not get to obsessed with one small set of
lines.
2018-01-19 07:16:16 +08:00
Once you've settled on the right algorithm, program tuning is the process of
improving the implementation of that algorithm. In Big-O notation, this is
the process of reducing the constants associated with your program.
2018-01-21 09:27:14 +08:00
All program tuning is either making a slow thing fast, or doing a slow thing
2018-01-25 05:14:55 +08:00
fewer times. Algorithmic changes also fall into these categories, but we're
going to be looking at smaller changes. Exactly how you do this varies as
technologies change.
2018-01-21 09:27:14 +08:00
Making a slow thing fast might be replacing SHA1 or hash/fnv1 with a faster
hash function. Doing a slow thing fewer times might be saving the result of
the hash calculation of a large file so you don't have to do it a multiple
times.
2018-01-20 01:50:11 +08:00
2018-01-22 13:50:00 +08:00
Keep comments. If something doesn't need to be done, explain why. Frequently
when optimizing an algorithm you'll discover steps that don't need to be
performed under some circumstances. Document them. Somebody else might think
it's a bug and needs to be put back.
Empty program gives the wrong answer in no time at all.
It's easy to be fast if you don't have to be correct.
Cache common cases: Your cache doesn't even need to be huge.
Optimized a log processing script to cache the previous time passed to time.parse() for significant speedup
But beware cache invalidation, thread issues, etc
Random cache eviction is fast and sufficiently effective.
- only put "some" items in cache (probabilistically) to limit cache size to popular items with minimal logic
Compare cost of cache logic to cost of refetching the data.
2018-01-25 05:14:55 +08:00
I've done experiments with a network trace that showed even an optimal
cache wasn't worth it. Your expected hit ratio is important. You'll want to
export the ratio to your monitoring stack. Changing ratios will show a
shift in traffic. Then it's time to revisit the cache size or the
expiration policy.
2018-01-22 13:50:00 +08:00
This also means your benchmark data needs to be representative of the real
world. If repeated requests are sufficiently rare, it's more expensive to
keep them around than to recompute them. If your benchmark data consists of
only the same repeated request, your cache will give an inaccurate view of
the performance.
program tuning:
if possible, keep the old implementation around for testing
if not possible, generate sufficient golden test cases to compare output
2018-01-21 07:20:11 +08:00
exploit a mathematical identity: https://go-review.googlesource.com/c/go/+/85477, multiplication with addition, ...
just clearing the parts you used, rather than an entire array
best done in tiny steps, a few statements at a time
moving from floating point math to integer math
or mandelbrot removing sqrt, or lttb removing abs
cheap checks before more expensive checks:
e.g., strcmp before regexp, (q.v., bloom filter before query)
2018-01-21 07:20:11 +08:00
common cases before rare cases
i.e., avoid extra tests that always fail
remove branches from inner loops
Many folk-lore performance tips for tuning rely on poorly optimizing
2018-01-21 14:35:56 +08:00
compilers and encourage the programmer to do these transformations by hand:
2018-01-25 05:14:55 +08:00
hoisting invariant calculations out of loops, using shift instead of
multiply, loop unrolling, common sub-expression elimination, ...
2018-01-21 07:20:11 +08:00
2018-01-25 05:14:55 +08:00
The transformations the compiler can't do rely on you knowing things about
the algorithm, about your input data, about invariants in your system, and
other assumptions you can make, and factoring that implicit knowledge into
removing or altering steps in the data structure.
2018-01-21 14:35:56 +08:00
2018-01-25 05:14:55 +08:00
Every optimization codifies an assumption about your data. These *must* be
documented and even better tested for. These assumptions are going to be
where your program crashes, slow down, or starts returning incorrect data as
the system evolves.
2018-01-20 00:42:01 +08:00
Program tuning improvements are cumulative. 5x 3% improvements is a 15%
improvement. Making optimizations it's worth it to think about the expected
performance improvement. Replacing a hash function with a faster one is a
constant factor improvement.
2018-01-23 06:12:54 +08:00
Hashing map k/v: iterate over sorted keys/values, *or* track hashes as everything comes in and xor them together
2018-01-20 00:42:01 +08:00
Log parsing example:
- time parsing is slow
- adding a single item cache is good
- removing time parsing and doing some integer math by hand is against faster
- general algorithm is slow, you can be faster because you know more about your problem
2018-01-20 15:13:31 +08:00
- but the code is more closely tied to exactly what you need; harder to change
2018-01-20 00:42:01 +08:00
2018-01-21 01:58:54 +08:00
Optimization is specialization, and specialized code is more fragile to
change than general purpose code.
2018-01-22 13:50:00 +08:00
The standard library implementations need to be "fast enough" for most cases.
If you have higher performance needs you will probably need specialized
implementations.
2018-01-06 10:13:59 +08:00
Profile regularly to ensure the track the performance characteristics of your
system and be prepared to re-optimize as your traffic changes. Know the
limits of your system and have good metrics that allow you to predict when
you will hit those limits.
2018-01-19 07:16:16 +08:00
When the usage of your application changes, different pieces may become
hotspots. Revisit previous optimizations and decide if they're still worth
it, and revert to more readable code when possible. I had one system that I
had optimized process startup time with a complex set of mmap, reflect, and
unsafe. Once we changed how the system was deployed, this code was no longer
required and I replaced it with much more readable regular file operations.
2018-01-07 06:59:08 +08:00
2018-01-16 14:44:31 +08:00
## Optimization workflow summary
2018-01-19 04:20:44 +08:00
- All optimizations should follow these steps:
2018-01-16 14:44:31 +08:00
1. determine your performance goals and confirm you are not meeting them
1. profile to identify the areas to improve. This can be CPU, heap allocations, or goroutine blocking.
1. benchmark to determine the speed up your solution will provide using
the built-in benchmarking framework (<http://golang.org/pkg/testing/>)
Make sure you're benchmarking the right thing on your target operating system and architecture.
1. profile again afterwards to verify the issue is gone
1. use <https://godoc.org/golang.org/x/perf/benchstat> or
<https://github.com/codahale/tinystat> to verify that a set of timings
are 'sufficiently' different for an optimization to be worth the
added code complexity.
1. use <https://github.com/tsenart/vegeta> for load testing http services
1. make sure your latency numbers make sense: <https://youtu.be/lJ8ydIuPFeU>
The first step is important. It tells you when and where to start optimizing.
More importantly, it also tells you when to stop. Pretty much all
optimizations add code complexity in exchange for speed. And you can *always*
make code faster. It's a balancing act.
## Tooling
2017-04-24 15:06:20 +08:00
2016-05-22 18:50:16 +08:00
## Introductory Profiling
Techniques applicable to source code in general
1. introduction to pprof
2018-01-19 04:20:44 +08:00
- go tool pprof (and <https://github.com/google/pprof>)
2016-05-26 16:39:34 +08:00
1. Writing and running (micro)benchmarks
2018-01-19 04:20:44 +08:00
- profile, extract hot code to benchmark, optimize benchmark, profile.
- -cpuprofile / -memprofile / -benchmem
- 0.5 ns/op means it was optimized away -> how to avoid
- tips for writing good microbenchmarks (remove unnecessary work, but add baselines)
2016-05-26 16:39:34 +08:00
1. How to read it pprof output
2016-05-22 18:50:16 +08:00
1. What are the different pieces of the runtime that show up
2016-05-26 16:39:34 +08:00
1. Macro-benchmarks (Profiling in production)
2018-01-19 04:20:44 +08:00
- net/http/pprof
2016-05-22 18:50:16 +08:00
2017-04-24 15:06:20 +08:00
## Tracer
2016-05-22 18:50:16 +08:00
## Advanced Techniques
2018-01-19 04:20:44 +08:00
- Techniques specific to the architecture running the code
- introduction to CPU caches
- performance cliffs
- building intuition around cache-lines: sizes, padding, alignment
- false-sharing
- true sharing -> sharding
- OS tools to view cache-misses
- maps vs. slices
- SOA vs AOS layouts
- reducing pointer chasing
- branch prediction
- function call overhead
- Comment about Jeff Dean's 2002 numbers (plus updates)
- cpus have gotten faster, but memory hasn't kept up
2016-05-22 18:50:16 +08:00
2018-01-06 10:13:59 +08:00
## Garbage Collection
2018-01-19 04:20:44 +08:00
- Stack vs. heap allocations
- What causes heap allocations?
- Understanding escape analysis (and the current limitation)
- API design to limit allocations: allow passing in buffers so caller can reuse rather than forcing an allocation
2018-01-05 14:18:15 +08:00
- you can even modify a slice in place carefully while you scan over it
2018-01-19 04:20:44 +08:00
- reducing pointers to reduce gc scan times
2018-01-25 05:14:29 +08:00
- pointer-free map keys
2018-01-19 04:20:44 +08:00
- GOGC
- buffer reuse (sync.Pool vs or custom via go-slab, etc)
2016-09-21 09:02:26 +08:00
2018-01-18 23:46:15 +08:00
## Runtime and compiler
2018-01-19 04:20:44 +08:00
- cost of calls via interfaces (indirect calls on the CPU level)
- runtime.convT2E / runtime.convT2I
- type assertions vs. type switches
- defer
- special-case map implementations for ints, strings
- bounds check elimination
- []byte <-> string copies, map optimizations
2016-05-22 18:50:16 +08:00
## Common gotchas with the standard library
2018-01-19 04:20:44 +08:00
- time.After() leaks until it fires
- Reusing HTTP connections...
- ....
- rand.Int() and friends are 1) mutex protected and 2) expensive to create
2018-01-18 23:46:15 +08:00
- consider alternate random number generation (go-pcgr, xorshift)
2016-05-22 18:50:16 +08:00
## Unsafe
2018-01-19 04:20:44 +08:00
- And all the dangers that go with it
- Common uses for unsafe
- mmap'ing data files
2018-01-06 05:56:09 +08:00
- struct padding
2018-01-19 04:20:44 +08:00
- speedy de-serialization
- string <-> slice conversion, []byte <-> []uint32, ...
2016-05-22 18:50:16 +08:00
2016-09-21 09:03:33 +08:00
## cgo
2018-01-19 04:20:44 +08:00
- Performance characteristics of cgo calls
- Tricks to reduce the costs: batching
- Rules on passing pointers between Go and C
- syso files
2016-09-21 09:03:33 +08:00
2016-05-22 18:50:16 +08:00
## Assembly
2018-01-19 04:20:44 +08:00
- Stuff about writing assembly code for Go
2018-01-21 14:35:56 +08:00
- compilers improve; the bar is high
2018-01-19 04:20:44 +08:00
- replace as little as possible to make an impact
- very important to benchmark: improvements can be huge (10x for go-highway) zero (go-speck), or even slower (no inlining)
2018-01-21 14:35:56 +08:00
- always have pure-Go version (noasm build tag): testing, arm, gccgo
2018-01-19 04:20:44 +08:00
- brief intro to syntax
- calling convention
- using opcodes unsupported by the asm
- notes about why intrinsics are hard
- all the tooling to make this easier: asmfmt, peachpy, c2goasm, ...
2016-05-22 18:50:16 +08:00
## Alternate implementations
2018-01-19 04:20:44 +08:00
- Popular replacements for standard library packages:
- encoding/json -> ffjson
- net/http -> fasthttp (but incompatible API)
- regexp -> ragel (or other regular expression package)
- serialization
- encoding/gob -> <https://github.com/alecthomas/go_serialization_benchmarks>
- protobuf -> <https://github.com/gogo/protobuf>
- all formats have trade-offs: choose one that matches what you need
2018-01-04 02:36:41 +08:00
encoded space, decoding speed, language/tooling compatibility, ...
2018-01-19 04:20:44 +08:00
- database/sql -> jackx/pgx, ...
- gccgo
2016-05-22 18:50:16 +08:00
## Tooling
Look at some more interesting/advanced tooling
2018-01-22 13:50:00 +08:00
- perf (perf2pprof)
## Appendix: Implementing Research Papers
Tips for implementing papers: (For `algorithm` read also `data structure`)
2018-01-23 01:39:38 +08:00
* Don't. Start with the obvious solution and reasonable data structures.
* "Modern" algorithms tend to have lower theoretical complexities but high constant factors and lots of implementation complexity.
The fastest algorithm can frequently be replaced by one that is almost as fast and much easier to understand.
Douglas W. Jones
University of Iowa
2018-01-23 06:11:42 +08:00
Treap vs. RB/AVL trees
Raft was "easier" to understand Paxos
Fibonacci heaps are notoriously difficult to get right *and* have a huge constant factor
2018-01-23 01:39:38 +08:00
The added complexity has to be enough that the payoff is actually worth it.
Cache algorithms for example; much higher complexity for only a small
improvement in hit ratio Of course, you may not be able to test this until
you have a working implementation and have integrated it into your program.
Sometimes the paper will have graphs, but much like the trend towards
publishing only positive results, these will tend to be skewed in favour of
showing how good the new algorithm is.
* Choose the right paper.
2018-01-22 13:50:00 +08:00
* Look for the paper their algorithm claims to beat and implement that.
2018-01-23 01:39:38 +08:00
Frequently earlier papers will easier to understand and necessarily have
simpler algorithms.
Not all papers are good.
Look at the context the paper was written in. Determine assumptions about
the hardware: disk space, memory usage, etc. Some older papers make
different tradeoffs that were reasonable in the 70s or 80s but don't
necessarily apply to your use case. For example, some streaming algorithms
are designed for router hardware, which make it a pain to translate into
software.
Make sure the assumptions the algorithm makes about your data hold.
This will take some digging. You probably don't want to implement the
first paper you find.
2018-01-22 13:50:00 +08:00
* Make sure you understand the algorithm. This sounds obvious, but it will be impossible to debug otherwise.
2018-01-23 01:39:38 +08:00
https://blizzard.cs.uwaterloo.ca/keshav/home/Papers/data/07/paper-reading.pdf
2018-01-25 05:14:55 +08:00
A good understanding may allow you to extract the key idea from the paper
2018-01-23 01:39:38 +08:00
and possibly apply just that to your problem, which may be simpler than
reimplementing the entire thing.
2018-01-22 13:50:00 +08:00
* The original paper for a data structure or algorithm isn't always the best. Later papers may have better explanations.
2018-01-23 01:39:38 +08:00
2018-01-22 13:50:00 +08:00
* Some papers release reference source code which you can compare against, but
- 1) academic code is almost universally terrible
- 2) beware licensing restrictions
- 3) beware bugs
Also look out for other implementations on GitHub: they may have the same (or different!) bugs as yours.
https://www.youtube.com/watch?v=8eRx5Wo3xYA
2018-01-23 01:39:38 +08:00
http://codecapsule.com/2012/01/18/how-to-implement-a-paper/