PerformanceTests/JetStream/docs/JetStreamBlogPost.html - WebKit - Git at Google

 <h1>Introducing the JetStream benchmark suite</h1>

 <p>Today we are introducing a new WebKit JavaScript benchmark test suite,
 <a href="http://www.browserbench.org/JetStream">JetStream</a>. JetStream codifies what
 our de facto process has been &mdash; to combine latency and throughput benchmarks with roughly
 equal weighting, and capturing both metrics of traditional JavaScript programming styles as
 well as new JavaScript-based technologies that have captured our imaginations. Scores on
 JetStream are a good indicator of the performance users would see in advanced web applications
 like games.</p>

 <p>Optimizing the performance of our JavaScript engine is a high priority for the WebKit
 project. Examples of some of the improvements we introduced in the last year include
 <a href="https://bugs.webkit.org/show_bug.cgi?id=112839">concurrent compilation</a>,
 <a href="https://bugs.webkit.org/show_bug.cgi?id=121074">generational GC</a>, and the
 <a href="https://bugs.webkit.org/show_bug.cgi?id=112840">FTL JIT</a>. Engineering such
 improvements requires focus: we try to prioritize high-impact projects over building and
 maintaining complex optimizations that have smaller benefits.
 Thus, we motivate performance work with
 benchmarks that illustrate the kinds of workloads that WebKit users will likely encounter.
 This philosophy of benchmark-driven development has long been part of WebKit.</p>

 <h2>The previous state of JavaScript benchmarking</h2>

 <p>As we made enhancements to the WebKit JavaScript engine, we found that no single
 benchmark suite was entirely representative of the scenarios that we wanted to improve. We
 like <a href="https://www.webkit.org/perf/sunspider/sunspider.html">SunSpider</a> for its
 coverage of commonly-used language constructs and for the fact that its running time is
 representative of the running time of code on the web, but it falls short for measuring
 peak throughput. We like <a href="https://developers.google.com/octane/">Octane</a>, but it
 skews too far in the other direction: it's useful for determining our engine's peak
 throughput but isn't sensitive enough to the performance you'd be most likely to see on typical
 web workloads. It also downplays novel JavaScript technologies like asm.js; only one of
 Octane's 15 benchmarks was an asm.js test, and this test ignores floating point
 performance.</p>

 <p>Finding good asm.js benchmarks is difficult.  Even though
 <a href="https://github.com/kripken/emscripten">Emscripten</a> is gaining
 mindshare, its tests are long-running and until recently, lacked a web harness.
 So we built our own asm.js benchmarks by using tests
 from the <a href="http://llvm.org/">LLVM</a>
 <a href="http://llvm.org/docs/TestingGuide.html#test-suite">test suite</a>.
 These C and C++ tests are
 used by LLVM developers to track performance improvements of the clang/LLVM compiler stack.
 Emscripten itself uses LLVM to
 generate JavaScript code. This makes the LLVM test suite particularly appropriate for testing
 how well a JavaScript engine handles native code. Another benefit of our new tests is that they
 are much quicker to run than the Emscripten test suite.</p>

 <p>Having good JavaScript benchmarks allows us to confidently pursue ambitious improvements to
 WebKit. For example, SunSpider guided our
 concurrent compilation work, while the asm.js tests and Octane's throughput tests motivated
 our work on the FTL JIT. But allowing our testing to be based on
 a hodgepodge of these different benchmark suites has become impractical. It's
 difficult
 to tell contributors what they should be testing if there is no unified test suite
 that can tell them if their change had the desired effect on performance. We want one test
 suite that can report one score in the end, and we want this one score to be representative
 of WebKit's future direction.</p>

 <h2>Designing the new JetStream benchmark suite</h2>

 <p>Different WebKit components require different approaches to measuring performance.
 In some cases, the obvious approach works pretty well: for example, many layout and
 rendering optimizations can be driven by measuring page load time on representative web pages.
 But measuring the performance of a programming language implementation requires
 more subtlety. We want to increase the benchmarks' sensitivity to core engine improvements, but
 not so much so that we lose perspective on how those engine improvements play out in real
 web sites. We want to minimize the opportunities for system noise to throw off our measurements,
 but anytime a workload is inherently prone to noise, we want a benchmark to show this.
 We want our benchmarks to represent a high-fidelity approximation of the workloads that
 WebKit users are likely to care about.</p>

 <p>JetStream combines a variety of JavaScript benchmarks, covering a variety of advanced
 workloads and programming techniques, and reports a single score that balances them using a
 geometric mean. Each test is run three times and scores are reported with 95% confidence
 intervals. Each benchmark measures a distinct workload, and no single optimization technique
 is sufficient to speed up all benchmarks. Some benchmarks demonstrate tradeoffs, and aggressive
 or specialized optimization for one benchmark might make another benchmark slower. Demonstrating
 trade-offs is crucial for our work. As discussed in my
 <a href="https://www.webkit.org/blog/3362/introducing-the-webkit-ftl-jit/">previous post about
 our new JIT compiler</a>, WebKit tries to dynamically adapt to workload using different
 execution tiers. But this is never perfect. For example, while our new FTL JIT compiler
 gives us fantastic speed-ups on peak throughput tests, it does cause slight regressions in
 some ramp-up tests. New optimizations for advanced language runtimes often run into such
 trade-offs, and our goal with JetStream is to have a benchmark that informs us about
 the trade-offs that we are making.</p>

 <p>JetStream includes benchmarks from the SunSpider 1.0.2 and Octane 2 JavaScript
 benchmark suites. It also includes benchmarks from the LLVM compiler open source
 project, compiled to JavaScript using Emscripten 1.13. It also includes a
 benchmark based on the Apache Harmony open source project's HashMap, hand-translated to
 JavaScript. More information about the benchmarks included in JetStream is
 available on the <a href="http://www.browserbench.org/JetStream-1.0/in-depth.html">JetStream
 In Depth</a> page.</p>

 <p>We're excited to be introducing this new benchmark. To run it, simply visit
 <a href="http://www.browserbench.org/JetStream/">browserbench.org/JetStream</a>. You can
 <a href="http://bugs.webkit.org/">file bugs</a> against the benchmark using WebKit's bug
 management system under the Tools/Tests component.</p>
	<h1>Introducing the JetStream benchmark suite</h1>

	<p>Today we are introducing a new WebKit JavaScript benchmark test suite,
	<a href="http://www.browserbench.org/JetStream">JetStream</a>. JetStream codifies what
	our de facto process has been — to combine latency and throughput benchmarks with roughly
	equal weighting, and capturing both metrics of traditional JavaScript programming styles as
	well as new JavaScript-based technologies that have captured our imaginations. Scores on
	JetStream are a good indicator of the performance users would see in advanced web applications
	like games.</p>

	<p>Optimizing the performance of our JavaScript engine is a high priority for the WebKit
	project. Examples of some of the improvements we introduced in the last year include
	<a href="https://bugs.webkit.org/show_bug.cgi?id=112839">concurrent compilation</a>,
	<a href="https://bugs.webkit.org/show_bug.cgi?id=121074">generational GC</a>, and the
	<a href="https://bugs.webkit.org/show_bug.cgi?id=112840">FTL JIT</a>. Engineering such
	improvements requires focus: we try to prioritize high-impact projects over building and
	maintaining complex optimizations that have smaller benefits.
	Thus, we motivate performance work with
	benchmarks that illustrate the kinds of workloads that WebKit users will likely encounter.
	This philosophy of benchmark-driven development has long been part of WebKit.</p>

	<h2>The previous state of JavaScript benchmarking</h2>

	<p>As we made enhancements to the WebKit JavaScript engine, we found that no single
	benchmark suite was entirely representative of the scenarios that we wanted to improve. We
	like <a href="https://www.webkit.org/perf/sunspider/sunspider.html">SunSpider</a> for its
	coverage of commonly-used language constructs and for the fact that its running time is
	representative of the running time of code on the web, but it falls short for measuring
	peak throughput. We like <a href="https://developers.google.com/octane/">Octane</a>, but it
	skews too far in the other direction: it's useful for determining our engine's peak
	throughput but isn't sensitive enough to the performance you'd be most likely to see on typical
	web workloads. It also downplays novel JavaScript technologies like asm.js; only one of
	Octane's 15 benchmarks was an asm.js test, and this test ignores floating point
	performance.</p>

	<p>Finding good asm.js benchmarks is difficult. Even though
	<a href="https://github.com/kripken/emscripten">Emscripten</a> is gaining
	mindshare, its tests are long-running and until recently, lacked a web harness.
	So we built our own asm.js benchmarks by using tests
	from the <a href="http://llvm.org/">LLVM</a>
	<a href="http://llvm.org/docs/TestingGuide.html#test-suite">test suite</a>.
	These C and C++ tests are
	used by LLVM developers to track performance improvements of the clang/LLVM compiler stack.
	Emscripten itself uses LLVM to
	generate JavaScript code. This makes the LLVM test suite particularly appropriate for testing
	how well a JavaScript engine handles native code. Another benefit of our new tests is that they
	are much quicker to run than the Emscripten test suite.</p>

	<p>Having good JavaScript benchmarks allows us to confidently pursue ambitious improvements to
	WebKit. For example, SunSpider guided our
	concurrent compilation work, while the asm.js tests and Octane's throughput tests motivated
	our work on the FTL JIT. But allowing our testing to be based on
	a hodgepodge of these different benchmark suites has become impractical. It's
	difficult
	to tell contributors what they should be testing if there is no unified test suite
	that can tell them if their change had the desired effect on performance. We want one test
	suite that can report one score in the end, and we want this one score to be representative
	of WebKit's future direction.</p>

	<h2>Designing the new JetStream benchmark suite</h2>

	<p>Different WebKit components require different approaches to measuring performance.
	In some cases, the obvious approach works pretty well: for example, many layout and
	rendering optimizations can be driven by measuring page load time on representative web pages.
	But measuring the performance of a programming language implementation requires
	more subtlety. We want to increase the benchmarks' sensitivity to core engine improvements, but
	not so much so that we lose perspective on how those engine improvements play out in real
	web sites. We want to minimize the opportunities for system noise to throw off our measurements,
	but anytime a workload is inherently prone to noise, we want a benchmark to show this.
	We want our benchmarks to represent a high-fidelity approximation of the workloads that
	WebKit users are likely to care about.</p>

	<p>JetStream combines a variety of JavaScript benchmarks, covering a variety of advanced
	workloads and programming techniques, and reports a single score that balances them using a
	geometric mean. Each test is run three times and scores are reported with 95% confidence
	intervals. Each benchmark measures a distinct workload, and no single optimization technique
	is sufficient to speed up all benchmarks. Some benchmarks demonstrate tradeoffs, and aggressive
	or specialized optimization for one benchmark might make another benchmark slower. Demonstrating
	trade-offs is crucial for our work. As discussed in my
	<a href="https://www.webkit.org/blog/3362/introducing-the-webkit-ftl-jit/">previous post about
	our new JIT compiler</a>, WebKit tries to dynamically adapt to workload using different
	execution tiers. But this is never perfect. For example, while our new FTL JIT compiler
	gives us fantastic speed-ups on peak throughput tests, it does cause slight regressions in
	some ramp-up tests. New optimizations for advanced language runtimes often run into such
	trade-offs, and our goal with JetStream is to have a benchmark that informs us about
	the trade-offs that we are making.</p>

	<p>JetStream includes benchmarks from the SunSpider 1.0.2 and Octane 2 JavaScript
	benchmark suites. It also includes benchmarks from the LLVM compiler open source
	project, compiled to JavaScript using Emscripten 1.13. It also includes a
	benchmark based on the Apache Harmony open source project's HashMap, hand-translated to
	JavaScript. More information about the benchmarks included in JetStream is
	available on the <a href="http://www.browserbench.org/JetStream-1.0/in-depth.html">JetStream
	In Depth</a> page.</p>

	<p>We're excited to be introducing this new benchmark. To run it, simply visit
	<a href="http://www.browserbench.org/JetStream/">browserbench.org/JetStream</a>. You can
	<a href="http://bugs.webkit.org/">file bugs</a> against the benchmark using WebKit's bug
	management system under the Tools/Tests component.</p>