PerformanceTests/SunSpider/TODO - WebKit - Git at Google


 * Add more test cases. Categories we'd like to cover (with reasonably
   real-world tests, preferably not microbenchmarks) include:

   (X marks the ones that are fairly well covered now).

     X math (general)
     X bitops
     X 3-d (the math bits)
     - crypto / encoding
     X string processing
     - regexps
     - date processing
     - array processing
     - control flow
     - function calls / recursion
     - object access (unclear if it is possible to make a realistic
       benchmark that isolates this)

   I'd specifically like to add all the computer language shootout
   tests that Mozilla is using.

 * Normalize tests. Most of the test cases available have a repeat
   count of some sort, so the time they take can be tuned. The tests
   should be tuned so that each category contributes about the same
   total, and so each test in each category contributes about the same
   amount. The question is, what implementation should be the baseline?
   My current thought is to either pick some specific browser on a
   specific platform (IE 7 or Firefox 2 perhaps), or try to target the
   average that some set of same-generation release browsers get on
   each test. The latter is more work. IE7 is probably a reasonable
   normalization target since it is the latest version of the most
   popular browser, so results on this benchmark will tell you how much
   you have to gain or lose by using a different browser.

 * Instead of using the standard error, the correct way to calculate
   a 95% confidence interval for a small sample is the t-test.
   <http://en.wikipedia.org/wiki/Student%27s_t-test>. Basically this involves
   using values from a 2-tailed t-distribution table instead of 1.96 to
   multiply by the error function, a table is available at
   <http://www.medcalc.be/manual/t-distribution.php>

 * Add support to compare two different engines (or two builds of the
   same engine) interleaved.

 * Add support to compare two existing sets of saved results.

 * Allow repeat count to be controlled from the browser-hosted version
   and the WebKitTools wrapper script.

 * Add support to run only a subset of the tests (both command-line and
   web versions).

 * Add a profile mode for the command-line version that runs the tests
   repeatedly in the same command-line interpreter instance, for ease
   of profiling.

 * Make the browser-hosted version prettier, both in general design and
   maybe using bar graphs for the output.

 * Make it possible to track change over time and generate a graph per
   result showing result and error bar for each version.

 * Hook up to automated testing / buildbot infrastructure.

 * Possibly... add the ability to download iBench from its original
   server, pull out the JS test content, preprocess it, and add it as a
   category to the benchmark.

 * Profit.

	* Add more test cases. Categories we'd like to cover (with reasonably
	real-world tests, preferably not microbenchmarks) include:

	(X marks the ones that are fairly well covered now).

	X math (general)
	X bitops
	X 3-d (the math bits)
	- crypto / encoding
	X string processing
	- regexps
	- date processing
	- array processing
	- control flow
	- function calls / recursion
	- object access (unclear if it is possible to make a realistic
	benchmark that isolates this)

	I'd specifically like to add all the computer language shootout
	tests that Mozilla is using.

	* Normalize tests. Most of the test cases available have a repeat
	count of some sort, so the time they take can be tuned. The tests
	should be tuned so that each category contributes about the same
	total, and so each test in each category contributes about the same
	amount. The question is, what implementation should be the baseline?
	My current thought is to either pick some specific browser on a
	specific platform (IE 7 or Firefox 2 perhaps), or try to target the
	average that some set of same-generation release browsers get on
	each test. The latter is more work. IE7 is probably a reasonable
	normalization target since it is the latest version of the most
	popular browser, so results on this benchmark will tell you how much
	you have to gain or lose by using a different browser.

	* Instead of using the standard error, the correct way to calculate
	a 95% confidence interval for a small sample is the t-test.
	<http://en.wikipedia.org/wiki/Student%27s_t-test>. Basically this involves
	using values from a 2-tailed t-distribution table instead of 1.96 to
	multiply by the error function, a table is available at
	<http://www.medcalc.be/manual/t-distribution.php>

	* Add support to compare two different engines (or two builds of the
	same engine) interleaved.

	* Add support to compare two existing sets of saved results.

	* Allow repeat count to be controlled from the browser-hosted version
	and the WebKitTools wrapper script.

	* Add support to run only a subset of the tests (both command-line and
	web versions).

	* Add a profile mode for the command-line version that runs the tests
	repeatedly in the same command-line interpreter instance, for ease
	of profiling.

	* Make the browser-hosted version prettier, both in general design and
	maybe using bar graphs for the output.

	* Make it possible to track change over time and generate a graph per
	result showing result and error bar for each version.

	* Hook up to automated testing / buildbot infrastructure.

	* Possibly... add the ability to download iBench from its original
	server, pull out the JS test content, preprocess it, and add it as a
	category to the benchmark.

	* Profit.