Python 3.7 perf_counter() nanoseconds

Since 2012, I have been trying to convert all Python clocks to use internally nanoseconds. The last clock which still used floating point internally was time.perf_counter(). INADA Naoki's new importtime tool was an opportunity for me to have a new look on a tricky integer overflow issue.

Modify importtime to use time.perf_counter() clock

INADA Naoki added to Python 3.7 a new cool -X importtime command line option to analyze the Python import performance. This tool can be used optimize the startup time of your application. Example:

vstinner@apu$ ./python -X importtime -c pass
import time: self [us] | cumulative | imported package
(...)
import time:       901 |       1902 | io
import time:       374 |        374 |       _stat
import time:       663 |       1037 |     stat
import time:       617 |        617 |       genericpath
import time:       877 |       1493 |     posixpath
import time:      3840 |       3840 |     _collections_abc
import time:      2106 |       8474 |   os
import time:       674 |        674 |   _sitebuiltins
import time:       922 |        922 |   sitecustomize
import time:       598 |        598 |   usercustomize
import time:      1444 |      12110 | site

Read Naoki's article How to speed up Python application startup time (Jan 19, 2018) for a concrete analysis of pipenv performance.

Naoki chose to use the time.monotonic() clock internally to measure elapsed time. On Windows, this clock (GetTickCount64() function) has a resolution around 15.6 ms, whereas most Python imports take less than 10 ms, and so most numbers are just zeros. Example:

f:\dev\3x>python -X importtime -c "import idlelib.pyshell"
Running Debug|Win32 interpreter...
import time: self [us] | cumulative | imported package
import time:         0 |          0 |     _codecs
import time:         0 |          0 |   codecs
import time:         0 |          0 |   encodings.aliases
import time:     15000 |      15000 | encodings
import time:         0 |          0 | encodings.utf_8
import time:         0 |          0 | _signal
import time:         0 |          0 | encodings.latin_1
import time:         0 |          0 |     _weakrefset
import time:         0 |          0 |   abc
import time:         0 |          0 | io
import time:         0 |          0 |       _stat
(...)

In bpo-31415, I fixed the issue by adding a new C function _PyTime_GetPerfCounter() to access the time.perf_counter() clock at the C level and I modified "importtime" to use it.

Problem solved! ... almost...

Double integer-float conversions

My commit a997c7b4 of bpo-31415 adding _PyTime_GetPerfCounter() moved the C code from Modules/timemodule.c to Python/pytime.c, but also changed the internal type storing time from floatting point number (C double) to integer number (_PyTyime_t which is int64_t in practice).

The drawback of this change is that time.perf_counter() now converts QueryPerformanceCounter() / QueryPerformanceFrequency() float into a _PyTime_t (integer) and then back to a float, and these conversions cause a precision loss. I computed that the conversions start to loose precision starting after a single second with QueryPerformanceFrequency() equals to 3,579,545 (3.6 MHz).

To fix the precision loss, I modified again time.clock() and time.perf_counter() to not use _PyTime_t anymore, only double.

Grumpy Victor

My change to replace _PyTime_t with double made me grumpy. I have been trying to convert all Python clocks to _PyTime_t since 6 years (2012).

Being blocked by a single clock made me grumpy, especially because the issue is specific to the Windows implementation. The Linux implementation of time.perf_counter() uses clock_gettime() which directly returns nanoseconds as integers, no division needed to get time as _PyTime_t.

I looked at the clock sources in the Linux kernel source code: kernel/time/clocksource.c. Linux clocks only use integers and support nanosecond resolution. I'm always impressed by the quality of the Linux kernel source code, the code is straightforward C code. If Linux is able to use integers for various kinds of clocks, I should be able to use integers for my specific Windows implementations of time.perf_counter(), no?

In practice, the _PyTime_t is a number of nanoseconds, so the computation is:

(QueryPerformanceCounter() * 1_000_000_000) / QueryPerformanceFrequency()

where 1_000_000_000 is the number of nanoseconds in one second. The problem is preventing integer overflow on the first part, using _PyTime_t which is int64_t in practice:

QueryPerformanceCounter() * 1_000_000_000

Some maths to avoid the precision loss

Using a pencil, a sheet of paper and some maths, I found a solution!

(a * b) / q == (a / q) * b + ((a % q) * b) / q

This prevents the risk of integer overflow. C implementation:

Py_LOCAL_INLINE(_PyTime_t)
_PyTime_MulDiv(_PyTime_t ticks, _PyTime_t mul, _PyTime_t div)
{
    _PyTime_t intpart, remaining;
    /* Compute (ticks * mul / div) in two parts to prevent integer overflow:
       compute integer part, and then the remaining part.

       (ticks * mul) / div == (ticks / div) * mul + (ticks % div) * mul / div

       The caller must ensure that "(div - 1) * mul" cannot overflow. */
    intpart = ticks / div;
    ticks %= div;
    remaining = ticks * mul;
    remaining /= div;
    return intpart * mul + remaining;
}

Simplified Windows implementation of perf_counter():

_PyTime_t win_perf_counter(void)
{
    LARGE_INTEGER freq;
    LONGLONG frequency;
    LARGE_INTEGER now;
    LONGLONG ticksll;
    _PyTime_t ticks;

    (void)QueryPerformanceFrequency(&freq);
    frequency = freq.QuadPart;

    QueryPerformanceCounter(&now);
    ticksll = now.QuadPart;
    ticks = (_PyTime_t)ticksll;

    return _PyTime_MulDiv(ticks, SEC_TO_NS, (_PyTime_t)frequency);
}

On Windows, I added the following sanity checks to make sure that integer overflows cannot occur:

/* Check that frequency can be casted to _PyTime_t.

   Make also sure that (ticks * SEC_TO_NS) cannot overflow in
   _PyTime_MulDiv(), with ticks < frequency.

   Known QueryPerformanceFrequency() values:

   * 10,000,000 (10 MHz): 100 ns resolution
   * 3,579,545 Hz (3.6 MHz): 279 ns resolution

   None of these frequencies can overflow with 64-bit _PyTime_t, but
   check for overflow, just in case. */
if (frequency > _PyTime_MAX
    || frequency > (LONGLONG)_PyTime_MAX / (LONGLONG)SEC_TO_NS) {
    PyErr_SetString(PyExc_OverflowError,
                    "QueryPerformanceFrequency is too large");
    return -1;
}

Since I also modified the macOS implementation of time.monotonic() to use _PyTime_MulDiv(), I also added this check for macOS:

/* Make sure that (ticks * timebase.numer) cannot overflow in
   _PyTime_MulDiv(), with ticks < timebase.denom.

   Known time bases:

   * always (1, 1) on Intel
   * (1000000000, 33333335) or (1000000000, 25000000) on PowerPC

   None of these time bases can overflow with 64-bit _PyTime_t, but
   check for overflow, just in case. */
if ((_PyTime_t)timebase.numer > _PyTime_MAX / (_PyTime_t)timebase.denom) {
    PyErr_SetString(PyExc_OverflowError,
                    "mach_timebase_info is too large");
    return -1;
}

pytime.c source code

If you are curious, the full code lives at Python/pytime.c and is currently around 1,100 lines of C code.

Conclusion

INADA Naoki's importtime tool was using time.monotonic() clock which failed to measure short import times on Windows. I modified it to use time.perf_counter() internally to get better precision on Windows. I identified a precision loss caused by my internal _PyTime_t type to store time as nanoseconds. Thanks to maths, I succeeded to use nanoseconds and prevent any risk of integer overflow.

Victor Stinner blog 3