This article tells the story of the new CPython test.bisect tool to identify failing tests in the CPython test suite.
Modify manually a test file
I am fixing reference leaks since many years. When the test file contains more than 200 tests and is longer than 5,000 lines, it's just not possible to spot a reference leak. Each time, I modified the long test file and actually removes enough code until the file becomes short enough so I can read it.
This method works, but it usually took me 20 to 30 minutes, and so it was common that I made mistakes... and usually had to restart from the start...
First failed attempt
In october 2014, while fixing yet another reference leak in test_capi, Xavier de Gaye was surprised that I identified quickly the leak and wanted to want how I proceeded. I explained my method removing code, but I also asked for a tool.
Xavier created bpo-22607 at 2014-10-11 and wrote a patch based on an integer range to run a subset of tests and did something special on the subTest() context manager. But Georg Brandl wasn't convinced by this approach and... I forgot this issue.
New design: list tests, run a subset
During this quarter, I had to fix dozens of reference leaks but also tests failing with "environment changed": one test method modified "something". It was really painful to identify each time the failing test.
So I created bpo-29512 at 2017-02-09 to ask again the same tool. Technically, I just wanted to run a subset of tests.
While working on OpenStack, I enjoyed the testr tool, a test runner able to list tests and to run a subset of tests. testr also provides a bisection tool to identify a subset of tests enough to reproduce a bug. The subset can contain more than a single test. Sometimes you need to run two tests sequentially to trigger a specific bug, and it's usually long and boring to identify manually these two tests.
I proposed a similar design for my bisection tool. Start by listing all tests, and then:
- create a pure random sample of tests: subset with half the size of the current test set
- If tests still fail, use the subset as the new set. Otherwise, throw the subset.
- Loop until the subset is small enough or the process run longer than 100 iterations.
To list tests, I created bpo-30523 and wrote a patch for the unittest module. Modifying unittest didn't work well with doctests and the command line interface (CLI) didn't work as I wanted. I proposed to modify regrtest instead of unittest.
I proposed to Louie Lu to implement my new idea. I was impressed that he implemented it so quickly and that it worked so well! I just asked him to not exclude doctest test cases, since these test cases were working as expected! I quickly merged his modified patch which adds the --list-cases option to regrtest.
Note: regrtest already had a --list-tests which lists test files, whereas --list-cases lists test methods and doctests.
I created bpo-30540 to add a --matchfile option to regrtest. regrtest already had a --match option, but it was only possible to use the option once, and I wanted to use a text files for my list of tests.
Again, I was surprised that it was so simple to implement the feature. By the way, I modified regrtest --match to allow to specific the option multiple times, to run multiple tests instead of a single one.
New test.bisect tool
Since I had the two key features: regrtest --list-cases and regrtest --matchfile, it became trivial to implement the bisection tool. I wrote a first prototype. The "prototype" worked much better than expected.
My first version required a text file listing test cases. I modified it to run automatically the new --list-cases command.
I extended the tool to not only track reference leaks, but also "environment changed" failures like finding a test which creates a file but doesn't remove it.
I was asked to add this tool in the Python stdlib, so I added it as Lib/test/bisect.py to use it with:
python3 -m test.bisect ...
The test.bisect CLI is similar to the test CLI on purpose.
Reference leak example
I modified test_access() of test_os to add manually a reference leak:
$ ./python -m test -R 3:3 test_os (...) test_os leaked [1, 1, 1] references, sum=3 test_os leaked [1, 1, 1] memory blocks, sum=3 test_os failed in 33 sec (...)
Just replace -m test with -m test.bisect in the command, and you get the guilty method:
$ ./python -m test.bisect -R 3:3 test_os Start bisection with 257 tests Test arguments: -R 3:3 test_os Bisection will stop when getting 1 or less tests (-n/--max-tests option), or after 100 iterations (-N/--max-iter option) [+] Iteration 1: run 128 tests/257 + /home/haypo/prog/python/master/python -m test --matchfile /tmp/tmpvbraed7h -R 3:3 test_os (...) Tests succeeded: skip this subtest, try a new subbset [+] Iteration 2: run 128 tests/257 + /home/haypo/prog/python/master/python -m test --matchfile /tmp/tmpcjqtzgfe -R 3:3 test_os (...) Tests failed: use this new subtest [+] Iteration 3: run 64 tests/128 (...) [+] Iteration 15: run 1 tests/2 (...) Tests (1): * test.test_os.FileTests.test_access Bisection completed in 16 iterations and 0:03:10
The test.bisect command found the bug I introduced: test.test_os.FileTests.test_access.
The command takes a few minutes, but I don't care of its performance as soon as its fully automated! If you use the -o file option, each time the tool is able to reduce the size of the test set, it writes the new list of tests on disk. So even if the tool crashs or fails to find a single failure test, it already helps!
I am now very happy that test.bisect works better than I expected. So I backported it to 2.7, 3.5, 3.6 and master branches, since I want to fix all buildbot failures on all maintained branches.
Environment changed example
While running the previous example, I noticed the following warning:
Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2)
Using the new --fail-env-changed option, it is now posible to check which test of test_os emits such warning:
haypo@selma$ ./python -m test.bisect --fail-env-changed -R 3:3 test_os (...) Tests (1): * test.test_os.TestSendfile.test_keywords Bisection completed in 14 iterations and 0:03:27
I never trust anything, so let's confirm the bug:
haypo@selma$ ./python -m test --fail-env-changed -R 3:3 test_os -m test.test_os.TestSendfile.test_keywords Run tests sequentially 0:00:00 load avg: 0.33 [1/1] test_os Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2) beginning 6 repetitions 123456 Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2) . Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2) .Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2) .Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2) .Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2) .Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2) . test_os failed (env changed) 1 test altered the execution environment: test_os Total duration: 21 sec Tests result: ENV CHANGED
Ok right, there is something wrong with test_keywords(). I just opened the bpo-30908.