Want to dive even deeper?

Take the course Extreme Java - Concurrency & Performance for Java 8 by Heinz Kabutz and become an expert!
Extreme Java - Concurrency & Performance for Java 8
by Heinz Kabutz

Check it out!
You're watching a preview of this video, click the button on the left to puchase the full version from Devoxx'11.

Java Microbenchmark Harness: The Lesser of Two Evils

Measuring the performance is the fine art, and measuring the performance on microbenchmarks is double so. There are multiple caveats one should take a great care of while designing the experiment involving microbenchmarks. The advanced experience with VM technology is required, the exposure with particular nits on exact VM is also a plus. Some say the benchmarking is inherently evil. We agree with that assertion, but also realize the benchmarking is nevertheless essential, and we need to learn how to benchmark without shooting ourselves in the foot all the time. In this session, we take the crash course in fine microbenchmarking, and introducing OpenJDK's Java Microbenchmark Harness (JMH) along the way.

Published on
  • 707
  • 0
  • 0
  • 3
  • 1
  • Java Microbenchmark Harness (the lesser of two evils) Aleksey Shipilev aleksey.shipilev@oracle.com, @shipilev
  • The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. c Slide 2/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Intro: Why would we even listen to this guy? ex-«Intel, Apache Harmony performance geek» ex-«SPEC tech. representative for Oracle» in-«Oracle/OpenJDK performance geek» Guilty for: 1. 2. 3. 4. 5. Lots of shameful internal stuff SPECjbb2013 Concurrency improvements (e.g. @Contended) Java Concurrency Stress Tests (jcstress) Java Microbenchmark Harness (jmh) c Slide 4/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Intro: Obligatory JVMLS reference This talk was also well received at JVMLS 2013. c Slide 5/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Intro: Obligatory JVMLS reference This talk was also well received at JVMLS 2013. c Slide 5/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics c Slide 6/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Benchmarks are experiments Computer Science → Software Engineering Build software to meet functional requirements Mostly don’t care about HW and data specifics Abstract and composable, «formal science» Software Performance Engineering «Real world strikes back!» Exploring complex interactions between hardware, software, and data Based on empirical evidence, i.e. «natural science» c Slide 7/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Experimental Control Any experiment requires the control Sometimes, just a few baseline measurements Sometimes, vast web of support experiments c Slide 8/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Experimental Control Any experiment requires the control Sometimes, just a few baseline measurements Sometimes, vast web of support experiments Software-specific: peek under the hood! c Slide 8/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Experimental Control Any experiment requires the control Sometimes, just a few baseline measurements Sometimes, vast web of support experiments Software-specific: peek under the hood! Experiments assume the hypothesis (model), against which we do the control c Slide 8/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Common Wisdom Microbenchmarks are bad c Slide 9/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Common Wisdom Microbenchmarks are bad c Slide 10/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: The Root Cause «Premature optimization is the root of all evil» (Khuth, 1974) c Slide 11/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: The Root Cause «Premature Optimization is the root of all evil» (Shipilev, 2013) c Slide 12/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Evil Optimizations Optimizations distort the performance models! Applied in «common» (>50%) cases Unclear interdependencies «Black box» abstraction fails big time c Slide 13/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Evil Optimizations Optimizations distort the performance models! Applied in «common» (>50%) cases Unclear interdependencies «Black box» abstraction fails big time Examples: «new MyObject()» c Slide 13/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Evil Optimizations Optimizations distort the performance models! Applied in «common» (>50%) cases Unclear interdependencies «Black box» abstraction fails big time Examples: «new MyObject()»: allocated in TLAB? allocated in LOB? scalarized? eliminated? c Slide 13/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Know Thy Optimizations Understanding the performance model is the road to awe This is the endgame result for benchmarking Benchmarking is for exploring the performance models (which also helps to get better at benchmarking) Every new optimization ⇒ new hassle for everyone c Slide 14/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Benchmarks vs. Optimization Ground Rule Benchmarking is the (endless) fight against the optimizations Collorary Benchmarking harness #1 priority: managing the optimizations c Slide 15/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: JMH Java Microbenchmark Harness: http://openjdk.java.net/projects/code-tools/jmh/ Works around pitfalls tailored to HotSpot/OpenJDK specifics Bug fixes as VM evolves, or we discover more We (perfteam) validate benches by rewriting them with JMH Facilitates peer review c Slide 16/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: JMH API Sneak Peek Let users declare the benchmark body: @GenerateMicroBenchmark public void helloWorld () { // do something here } ...then generate lots of supporting synthetic code around that body. (At this point, simply generating the auxiliary subclass works fine, but it is limiting for some cases) c Slide 17/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Getting the units right *Benchmarks: micro: c Slide 18/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Getting the units right *Benchmarks: micro: 1...1000 us, single webapp request c Slide 18/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Getting the units right *Benchmarks: micro: 1...1000 us, single webapp request nano: 1...1000 ns, single operations c Slide 18/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Getting the units right *Benchmarks: kilo: > 1000 s, Linpack _____: 1...1000 s, SPECjvm2008, SPECjbb2013 milli: 1...1000 ms, SPECjvm98, SPECjbb2005 micro: 1...1000 us, single webapp request nano: 1...1000 ns, single operations pico: 1...1000 ps, pipelining c Slide 18/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: ...increaseth sorrow Benchmarks amplify all the effects visible at the same scale. Millibenchmarks are not really hard Microbenchmarks are challenging, but OK Nanobenchmarks are the damned beasts! Picobenchmarks... c Slide 19/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Warmup Definition c Slide 20/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Warmup Definition «Warmup» = waiting for the transient responses to settle down c Slide 20/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Warmup Definition «Warmup» = waiting for the transient responses to settle down Every online optimization requires warmup JIT compilation is NOT the only online optimization c Slide 20/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Warmup Definition «Warmup» = waiting for the transient responses to settle down Every online optimization requires warmup JIT compilation is NOT the only online optimization Ok, «Watch -XX:+PrintCompilation»? c Slide 20/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Basics: Warmup plateaus c Slide 21/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Major pitfalls c Slide 22/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Major pitfalls: The Goal The goal for this section is to scare you away from: (blindly) (blindly) (blindly) (blindly) building the benchmark harnesses trusting the benchmark harnesses trusting the benchmarks being generally blind about benchmarks c Slide 23/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Optimization Quiz (A) Let us run the empty benchmark. System reports 4 online CPUs. Threads Ops/nsec Scale 1 3.06 ± 0.10 2 5.72 ± 0.10 1.87 ± 0.03 4 5.87 ± 0.02 1.91 ± 0.03 c Slide 24/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Optimization Quiz (A) Let us run the empty benchmark. System reports 4 online CPUs. Threads Ops/nsec Scale 1 3.06 ± 0.10 2 5.72 ± 0.10 1.87 ± 0.03 4 5.87 ± 0.02 1.91 ± 0.03 Question 1: Why no change going for 2 → 4 threads? c Slide 24/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Optimization Quiz (A) Let us run the empty benchmark. System reports 4 online CPUs. Threads Ops/nsec Scale 1 3.06 ± 0.10 2 5.72 ± 0.10 1.87 ± 0.03 4 5.87 ± 0.02 1.91 ± 0.03 Question 1: Why no change going for 2 → 4 threads? Question 2: Why only 1.87x change going for 1 → 2 threads? c Slide 24/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Power management Running dummy benchmark, + Down-clocking to 2.0 GHz Threads Ops/nsec Scale 1 1.97 ± 0.02 2 3.94 ± 0.05 2.00 ± 0.02 4 4.03 ± 0.04 2.04 ± 0.02 c Slide 25/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Power management Many subsystems balance power-vs-performance (Ex.: cpufreq, SpeedStep, Cool&Quiet, TurboBoost) Downside: breaks the homogeneity of time Remedy: disable power management, fix CPU clock frequency JMH Remedy: run longer, do not park threads c Slide 26/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Optimization Quiz (A) Let us run the empty benchmark. System reports 4 online CPUs. Threads Ops/nsec Scale 1 3.06 ± 0.10 2 5.72 ± 0.10 1.87 ± 0.03 4 5.87 ± 0.02 1.91 ± 0.03 Question 1: Why no change going for 2 → 4 threads? Question 2: Why only 1.87x change going for 1 → 2 threads? c Slide 24/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Power management Many subsystems balance power-vs-performance (Ex.: cpufreq, SpeedStep, Cool&Quiet, TurboBoost) Downside: breaks the homogeneity of time Remedy: disable power management, fix CPU clock frequency JMH Remedy: run longer, do not park threads c Slide 26/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: OS Schedulers OS schedulers balance affinity-vs-power (Ex.: Solaris schedulers, Linux power-efficient taskqueues) Downside: breaks the processing symmetry Remedy: tight up scheduling policies JMH Remedy: run longer, do not park threads c Slide 27/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Time Sharing Time sharing systems balance utilization (Ex.: everywhere) Downside: thread start/stop is not instantaneous, thread run time is non-deterministic, the load is non-uniform Remedy: make sure everything runs before measuring JMH Remedy: «bogus iterations» c Slide 28/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Time Sharing, #2 JMH provides the remedy – bogus iterations: c Slide 29/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Time Sharing, Quiz (B) public void measure () { long startTime = System . nanoTime (); while (! isDone ) { work (); } println ( System . nanoTime () - startTime ); } c Slide 30/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Time Sharing, Quiz (B) «Is there a problem, officer?» public void measure () { long realTime = 0; while (! isDone ) { setup (); // skip this long time = System . nanoTime (); work (); realTime += ( System . nanoTime () - time ); } println ( realTime ); } c Slide 31/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Time Sharing, Quiz (B) throughput, ops/us Measuring the reciprocal throughput via total/iteration time: 600 400 200 0 0 10 # Threads Timing the entire loop 20 30 Timing the sum[iterations] The throughput grows past the CPU count – WTF?! c Slide 32/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Time Sharing, Quiz (B) public void measure () { long startTime = System . nanoTime (); long realTime = 0; while (! isDone ) { setup (); // skip this long time = System . nanoTime (); work (); realTime += ( System . nanoTime () - time ); ...WHOOPS, WE DE-SCHEDULE HERE... } println ( realTime ); println ( System . nanoTime () - startTime ); } c Slide 33/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • System: Time Sharing Time sharing gives the illusion of running multiple threads simultaneously Downside: this illusion is broken for performance Remedy: do NOT overload the system! JMH Remedy: big red warning sign c Slide 34/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (C) @GenerateMicroBenchmark public void baseline () { } @GenerateMicroBenchmark public void measureWrong () { Math . log ( x ); } @GenerateMicroBenchmark public double measureRight () { return Math . log ( x ); } c Slide 35/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved. 0.5 ± 0.1 ns 0.5 ± 0.1 ns 34.0 ± 1.0 ns
  • VM: Dead-code elimination Compilers are good at eliminating the redundant code. Downside: can remove (parts of) the benchmarked code Remedy: consume the results, depend on the results, provide the side effect JMH Remedy: API support c Slide 36/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Avoiding dead-code elimination DCE is somewhat easy to avoid for primitives: Primitives have binary combinators! Caveat #1: Combinator cost? Caveat #2: Low-range primitives enable speculation (boolean) int sum = 0; for ( int i = 0; i < 100; i ++) { sum += op ( i ); } return sum ; // consume in caller c Slide 37/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Avoiding dead-code elimination DCE is hard to avoid for references: Caveat #1: Fast object combinator, anyone? Caveat #2: Need to escape object to limit thread-local optimizations. Caveat #3: Publishing the object ⇒ reference heap write ⇒ store barrier c Slide 38/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: DCE, Blackholes JMH provides «Blackholes». Blackhole consumes the value. class Blackhole { void consume ( int v ) { doMagic ( v ); } void consume ( Object o ) { doMagic ( o ); } } Returns are implicitly fed into the blackhole User can request additional blackhole ⇒ heap writes again, dammit! c Slide 39/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Avoiding dead-code elimination, Blackholes Relatively easy for primitives: class Blackhole { static volatile Wrapper NULL ; volatile int g1 = 1 , g2 = 2; void consume ( int v ) { if ( v == g1 & v == g2 ) { NULL . field = 0; // implicit NPE } } } c Slide 40/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: DCE, Blackholes Harder for references: class Blackhole { Object sink ; int prngState ; int prngMask ; void consume ( Object v ) { if (( next ( prngState ) & prngMask ) == 0) { sink = v ; // store barrier here prngMask = ( prngMask << 1) + 1; } } } c Slide 41/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (D) @GenerateMicroBenchmark public void baseline () { } @GenerateMicroBenchmark public double measureWrong () { return Math . log (42); } private double x = 42; @GenerateMicroBenchmark public double measureRight () { return Math . log ( x ); } c Slide 42/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved. 0.5 ± 0.1 ns 1.0 ± 0.1 ns 34.0 ± 1.0 ns
  • VM: Constant folding, etc. Compilers are good at partial evaluation1 Downside: can remove (parts of) the benchmarked code Remedy: make the sources unpredictable JMH Remedy: API support 1 All right, all right! It is not really the PE. c Slide 43/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: CSE JMH prevents load commoning across @GMB calls double x ; @GenerateMicroBenchmark double doWork () { doStuff ( x ); } volatile boolean done ; void doMeasure () { while (! done ) { doWork (); } } (i.e. read everything from heap ⇒ you are good!) c Slide 44/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: DCE, CSE... Same thing! Losing either a source or a sink loses the part of the benchmark. Silently. c Slide 45/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: DCE, CSE... Same thing! Losing either a source or a sink loses the part of the benchmark. Silently. c Slide 45/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: DCE, CSE... Same thing! Losing either a source or a sink loses the part of the benchmark. Silently. c Slide 45/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (E) // changing N , will performance differ ? static int N = 100; @GenerateMicroBenchmark public int test () { return doWork ( N ); } int x = 1 , y = 2; private int doWork ( int reps ) { int s = 0; for ( int i = 0; i < reps ; i ++) s += ( x + y ); return s ; } c Slide 46/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (E), #2 ns/call N 1 1.5 ± 0.1 10 2.0 ± 0.1 100 2.7 ± 0.2 1000 68.8 ± 0.9 10000 410.3 ± 2.1 100000 3836.1 ± 40.6 c Slide 47/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved. ns/add 1.5 ± 0.1 0.1 ± 0.01 0.05 ± 0.02 0.07 ± 0.01 0.04 ± 0.01 0.04 ± 0.01
  • VM: Optimization Quiz (E), #2 ns/call N 1 1.5 ± 0.1 10 2.0 ± 0.1 100 2.7 ± 0.2 1000 68.8 ± 0.9 10000 410.3 ± 2.1 100000 3836.1 ± 40.6 ns/add 1.5 ± 0.1 0.1 ± 0.01 0.05 ± 0.02 0.07 ± 0.01 0.04 ± 0.01 0.04 ± 0.01 Which one to believe? 0.04 ns/add ⇒ 25 adds/ns ⇒ GTFO! c Slide 47/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Loop unrolling Loop unrolling greatly expands the scope of optimizations Downside: assume the single loop iteration is 𝑀 ns. After unrolling the effective cost is 𝛼𝑀 ns, where 𝛼 ∈ [0; +∞) Remedy: avoid unrollable loops, limit the effect of unrolling JMH Remedy: proper handling for CSE/DCE nils loop unrolling effects c Slide 48/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (F) interface M { void inc (); } abstract class AM implements M { int c ; void inc () { c ++; } } class M1 extends AM {} class M2 extends AM {} c Slide 49/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (F), #2 M m1 = new M1 (); M m2 = new M2 (); @GenerateMicroBenchmark public void testM1 () { test ( m1 ); } @GenerateMicroBenchmark public void testM2 () { test ( m2 ); } void test ( M m ) { for ( int i = 0; i < 100; i ++) m . inc (); } c Slide 50/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (F), #3 test ns/op testM1 4.6 ± 0.1 testM2 36.0 ± 0.4 c Slide 51/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (F), #2 M m1 = new M1 (); M m2 = new M2 (); @GenerateMicroBenchmark public void testM1 () { test ( m1 ); } @GenerateMicroBenchmark public void testM2 () { test ( m2 ); } void test ( M m ) { for ( int i = 0; i < 100; i ++) m . inc (); } c Slide 50/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (F), #3 test ns/op testM1 4.6 ± 0.1 testM2 36.0 ± 0.4 repeat testM1 35.8 ± 0.4 c Slide 51/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (F), #3 test ns/op testM1 4.6 ± 0.1 testM2 36.0 ± 0.4 repeat testM1 35.8 ± 0.4 forked testM1 4.5 ± 0.1 forked testM2 4.5 ± 0.1 c Slide 51/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Profile feedback Dynamic optimizations can use runtime information (Ex.: call profile, type profile, CHA info) Downside: Big difference in running multiple benchmarks, or a single benchmark in the VM Remedy: Warmup all benchmarks together; OR fork the JVMs JMH Remedy: Bulk warmup support; forking c Slide 52/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (G) c Slide 53/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Optimization Quiz (G), #2 c Slide 54/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Run-to-run variance Many scalable algos are inherently non-deterministic! (Ex.: memory allocators, profiler counters, non-fair locks, concurrent data structures, some other intelligent tricks up our sleeve...) Downside: (potentially) (devastatingly) large run-to-run variance Remedy: replays withing every subsystem, multiple JVM runs JMH Remedy: multiple forks c Slide 55/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Inlining budgets Inlining is the uber-optimization Downside: You can not inline everything ⇒ subtle inlining budget considerations Remedy: Smaller methods, smaller loops, examining -XX:+PrintInlining, forcing inlining JMH Remedy: Generated code peels potentially hot loops, user-friendly @CompileControl c Slide 56/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • VM: Inlining example Small hot method: inlining budget starts here. public void testLong_loop ( Loop loop , Result r , MyBenchmark bench ) { long ops = 0; r . start = System . nanoTime (); do { bench . testLong (); // @GenerateMicroBenchmark ops ++; } while (! loop . isDone ); r . end = System . nanoTime (); r . ops = ops ; } c Slide 57/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • CPU: Optimization Quiz (H) @State public class TreeMapBench { Map < String , String > map = new TreeMap < >(); @Setup public void setup () { populate ( map ); } @GenerateMicroBenchmark public void test ( BlackHole bh ) { for ( String key : map . keySet ()) { String value = map . get ( key ); bh . consume ( value ); } } } c Slide 58/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • CPU: Optimization Quiz (H), #2 @GenerateMicroBenchmark public void test ( BlackHole bh ) { for ( String key : map . keySet ()) { String value = map . get ( key ); bh . consume ( value ); } } Throughput, op/sec c Slide 59/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved. Exclusive Shared 615 ± 12 828 ± 21
  • CPU: Optimization Quiz (H), #2 @GenerateMicroBenchmark public void test ( BlackHole bh ) { for ( String key : map . keySet ()) { String value = map . get ( key ); bh . consume ( value ); } } Throughput, op/sec Threads c Slide 59/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved. Exclusive Shared 615 ± 12 828 ± 21 4 4
  • CPU: Optimization Quiz (H), #2 @GenerateMicroBenchmark public void test ( BlackHole bh ) { for ( String key : map . keySet ()) { String value = map . get ( key ); bh . consume ( value ); } } Throughput, op/sec Threads Maps c Slide 59/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved. Exclusive Shared 615 ± 12 828 ± 21 4 4 4 1
  • CPU: Optimization Quiz (H), #2 @GenerateMicroBenchmark public void test ( BlackHole bh ) { for ( String key : map . keySet ()) { String value = map . get ( key ); bh . consume ( value ); } } Throughput, op/sec Threads Maps Footprint, Kb c Slide 59/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved. Exclusive Shared 615 ± 12 828 ± 21 4 4 4 1 1024 256
  • CPU: Cache capacity DRAM memory is too far and too slow. Cache the hottest stuff on-die SRAM cache! Downside: Remarkably different performance for memory accesses, depending on your luck Remedy: Track the memory footprint; multiple experiments with different problem sizes; shared/distinct data for the worker threads JMH Remedy: @State scopes c Slide 60/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • CPU: Optimization Quiz (I) How scalable is this? @State ( Scope . Benchmark ) class Shared { final int [] c = new int [64]; } @State ( Scope . Thread ) class Local { static final AtomicInteger COUNTER = ...; final int index = COUNTER . incrementAndGet (); } @GenerateMicroBenchmark void work ( Shared s , Local l ) { s . c [ l . index ]++; } c Slide 61/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • CPU: Optimization Quiz (I), #2 Threads Average ns/call Hit 1 2.0 ± 0.1 2 18.5 ± 2.4 9x 4 32.9 ± 6.2 16x 8 85.4 ± 13.4 42x 16 208.9 ± 52.1 104x 32 464.2 ± 46.1 232x Why? c Slide 62/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • CPU: Bulk method transfers Memory subsystem tracks data in cache-line quantas. Cache lines are 32, 64, 128 bytes long. Downside: the dense inter-thread accesses are very hard for memory subsystem (false sharing) Remedy: padding, subclass juggling, @Contended JMH Remedy: control structures are heavily padded, auto-padding for @State c Slide 63/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • CPU: Optimization Quiz (J)2 Exhibit B. Exhibit P. int sum = 0; for ( int x : a ) { if ( x < 0) { sum -= x ; } else { sum += x ; } } return sum ; int sum = 0; for ( int x : a ) { sum += Math . abs ( x ); } return sum ; Which one is faster? 2 Credits: Sergey Kuksenko (@kuksenk0) c Slide 64/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • CPU: Optimization Quiz (J) E. Branched L0 : mov test jl add jmp L1 : sub L2 : inc cmp jl 0 xc (% ecx ,% ebp ,4) ,% ebx %ebx,%ebx L1 % ebx ,% eax L2 % ebx ,% eax % ebp % edx ,% ebp L0 E. Predicated L0 : mov mov neg test cmovl add inc cmp jl 0 xc (% ecx ,% ebp ,4) ,% ebx % ebx ,% esi % esi % ebx ,% ebx %esi,%ebx % ebx ,% eax % ebp % edx ,% ebp Loop Which one is faster? c Slide 65/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • CPU: Optimization Quiz (J) Regular Pattern = (+, –)* NHM Bldzr C-A93 SNB branch_regular 0.9 0.8 5.0 0.5 6.2 2.8 9.4 1.0 branch_shuffled branch_sorted 0.9 1.0 5.0 0.6 predicated_regular 2.0 1.0 5.3 0.8 predicated_shuffled 2.0 1.0 9.3 0.8 2.0 1.0 5.7 0.8 predicated_sorted time, nsec/op 3 Using client compiler c Slide 66/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • CPU: Branch Prediction Out-of-Order engines speculate a lot. Most of the time (99%+) correct! Downside: Vastly different performance when speculation fails Remedy: Realistic data! Multiple diverse datasets c Slide 67/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Conclusion c Slide 68/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Conclusion: not as simple as it sounds You should be scared by now! Resist the urge to: believe the pleasant results reject the unpleasant results write the throw-away benchmarks write the «generic» benchmark harnesses believe the fancy reports and beautiful APIs trust the code c Slide 69/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Conclusion: Benchmarking is serious More rigor is never a bad thing! The intuition is almost always wrong (unless you rock) Never trust anything (unless checked before) Ever challenge everything (especially these slides) Embrace failure (especially your failures) Grind your teeth, and redo the tests (especially yours) c Slide 70/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Conclusion: Things on list to do JMH does one thing and does it right: gets you less «back to square one» moments Other things to improve usability: Java API (in progress) Bindings to reporters (in progress) Bindings to the other JVM languages @Param-eters c Slide 71/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Thanks! c Slide 72/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Conclusion: But wait... c Slide 73/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Conclusion: Alternative Evil Don’t do any performance assessments at all You should already know why it is far worse. ...right? c Slide 74/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.
  • Thanks! c Slide 72/75. Copyright ○ 2013, Oracle and/or its affiliates. All rights reserved.