Victor Stinner blog 3

Add PyUnicodeWriter C API

2024-07-04T18:00:00+02:00

In May, I designed a new C API to build a Python str object: the PyUnicodeWriter API. Many people were involved in the design and the discussion was quite long. The C API Working Group helped to design a better and more convenient API. It took me basically a whole month to get the design done and fully implement the API.

Painting: La Danse by Matisse (1910).

Initial API

Building a Python str object in C is not easy. I wrote the private _PyUnicodeWriter C API 9 years ago (see my previous article), but it's not usable outside Python since it's a private API. So I proposed to make it public.

On May 19, I create an issue and a pull request to discuss the API. The initial API was:

typedef struct PyUnicodeWriter PyUnicodeWriter;

PyAPI_FUNC(PyUnicodeWriter*) PyUnicodeWriter_Create(void);
PyAPI_FUNC(void) PyUnicodeWriter_Free(PyUnicodeWriter *writer);
PyAPI_FUNC(PyObject*) PyUnicodeWriter_Finish(PyUnicodeWriter *writer);
PyAPI_FUNC(void) PyUnicodeWriter_SetOverallocate(
    PyUnicodeWriter *writer,
    int overallocate);

PyAPI_FUNC(int) PyUnicodeWriter_WriteChar(
    PyUnicodeWriter *writer,
    Py_UCS4 ch);
PyAPI_FUNC(int) PyUnicodeWriter_WriteStr(
    PyUnicodeWriter *writer,
    PyObject *str);
PyAPI_FUNC(int) PyUnicodeWriter_WriteSubstring(
    PyUnicodeWriter *writer,
    PyObject *str,
    Py_ssize_t start,
    Py_ssize_t stop);
PyAPI_FUNC(int) PyUnicodeWriter_WriteASCIIString(
    PyUnicodeWriter *writer,
    const char *ascii,
    Py_ssize_t len);

API changes

PyUnicodeWriter_WriteUTF8()

My first implementation made the assumption that the caller would only pass ASCII characters to PyUnicodeWriter_WriteASCIIString() which is a bold assumption. It would crash if non-ASCII characters would be passed by mistake. UTF-8 is more common and Python has a fast UTF-8 decoder. The first change was to replace PyUnicodeWriter_WriteASCIIString() with PyUnicodeWriter_WriteUTF8().

PyUnicodeWriter_WriteStr()

I really wanted PyUnicodeWriter_WriteStr() to only accept a Python str object. Others insisted to accept any Python object and write str(obj) instead. I changed PyUnicodeWriter_WriteStr() to implement that.

PyUnicodeWriter_WriteRepr()

Since str(obj) was there, repr(obj) becomes the next question: should we added it? It was decided to add PyUnicodeWriter_WriteRepr(obj) to write repr(obj). It's convenient to use.

PyUnicodeWriter_Format()

While discussing, it was proposed to add many functions to write various formats. I proposed to add PyUnicodeWriter_FromFormat(format, ...) similiar to PyUnicode_FromFormat(). It was decided to add it under the name: PyUnicodeWriter_Format(). Its implementation is efficient since multiple formats write directly into the writer, without having to create a temporary string object.

PyUnicodeWriter_Create()

The initial version of PyUnicodeWriter_Create() had no argument. It was asked to add a size parameter to preallocate the internal buffer: PyUnicodeWriter_Create(size).

Remove PyUnicodeWriter_SetOverallocate()

I tried to justify that calling PyUnicodeWriter_SetOverallocate(0) before the last write was a killer feature for performance, but it looked too complicated to others and it was decided to simply remove this API.

C API Working Group discussion

On May 24, once most of the API was stable, I created a decision issue for the API to the C API Working Group.

On June 7, the API was approved by a majority vote.

On June 10, Marc-Andre Lemburg reopened the issue since he had concerns about the incomplete UTF-8 Decoder API and the fact that the functions were not atomic: on error, the behavior was undefined.

I modified my implementation to make all functions atomic: either the whole string is written, or nothing is written (restore the writer to its previous state).

I also proposed to extend the PyUnicodeWriter API once we agreed on an minimum API.

On June 17, issue was closed again and I merged my implementation.

Extensions

PyUnicodeWriter_WriteWideChar()

I added a function to write wide strings (wchar_t*) which are common on Windows.

PyUnicodeWriter_DecodeUTF8Stateful()

I added a stateful UTF-8 decoder as an answer to Marc-Andre's request. API:

int PyUnicodeWriter_DecodeUTF8Stateful(
    PyUnicodeWriter *writer,
    const char *string,
    Py_ssize_t length,
    const char *errors,
    Py_ssize_t *consumed);

PyUnicodeWriter_WriteUCS4()

While less common, UCS-4 strings are convenient to manipulate Unicode code points. I added an API to support natively this string format.

Documentation

Read the PyUnicodeWriter API documentation.

Example of contextvar_tp_repr()

Simplified code:

static PyObject *
contextvar_tp_repr(PyContextVar *self)
{
    // "<ContextVar name='a' at 0x1234567812345678>"
    Py_ssize_t estimate = 43;
    PyUnicodeWriter *writer = PyUnicodeWriter_Create(estimate);
    if (writer == NULL) {
        return NULL;
    }

    if (PyUnicodeWriter_WriteUTF8(writer, "<ContextVar name=", 17) < 0) {
        goto error;
    }
    if (PyUnicodeWriter_WriteRepr(writer, self->var_name) < 0) {
        goto error;
    }
    if (PyUnicodeWriter_Format(writer, " at %p>", self) < 0) {
        goto error;
    }
    return PyUnicodeWriter_Finish(writer);

error:
    PyUnicodeWriter_Discard(writer);
    return NULL;
}

Conclusion

Thanks for great discussions, the final PyUnicodeWriter API is better, more convenient, less error-prone, and maybe even a little bit more efficient!

Thanks to everyone who was involved in these discussions!

Status of the Python Limited C API (March 2024)

2024-03-20T17:00:00+01:00

In Python 3.13, I made multiple enhancements to make the limited C API more usable:

Add 14 functions to the limited C API.
Make the special debug build Py_TRACE_REFS compatible with the limited C API.
Enhance Argument Clinic to generate C code using the limited C API.
Add an convenient API to format a type fully qualified name using the limited C API (PEP 737).
Add _testlimitedcapi extension.
Convert 16 stdlib extensions to the limited C API.

What's Next?

PEP 741: Python Configuration C API.
Py_GetConstant().
Cython and PyO3.

Drawing: Ghibli - Spirited Away by Daniel Azconegui.

New Functions

I added 14 functions to the limited C API:

PyDict_GetItemRef()
PyDict_GetItemStringRef()
PyImport_AddModuleRef()
PyLong_AsInt()
PyMem_RawCalloc()
PyMem_RawFree()
PyMem_RawMalloc()
PyMem_RawRealloc()
PySys_Audit()
PySys_AuditTuple()
PyType_GetFullyQualifiedName()
PyType_GetModuleName()
PyWeakref_GetRef()
Py_IsFinalizing()

It makes code using these functions compatible with the limited C API.

Py_TRACE_REFS

I modified the special debug build Py_TRACE_REFS. Instead of adding two members to PyObject to create a double linked list of all objects, I added an hash table to track all objects.

Since the PyObject structure is no longer modified, this special debug build is now ABI compatible with the release build! Moreover, it also becomes compatible with the limited C API!

Argument Clinic

I modified Argument Clinic (AC) to generate C code compatible with the limited C API.

First, I moved private functions used by Argument Clinic to the internal C API and modified Argument Clinic to generate #include to get these functions. Then I modified Argument Clinic to use only the limited C API and to not generate these #include.

At the beginning, only some converteres were supported and only the slower METH_VARARGS calling convention was supported.

Now, more and more converters and formats are supported, and the regular efficient METH_FASTCALL calling convention is used.

Example

Example from the grp extension:

/*[clinic input]
grp.getgrgid

    id: object

Return the group database entry for the given numeric group ID.

Python 3.12 uses the private _PyArg_UnpackKeywords() functions:

args = _PyArg_UnpackKeywords(args, nargs, NULL, kwnames, &_parser, 1, 1, 0, argsbuf);
if (!args) {
    goto exit;
}
id = args[0];
return_value = grp_getgrgid_impl(module, id);

Python 3.13 now uses the public PyArg_ParseTupleAndKeywords() function of the limited C API:

if (!PyArg_ParseTupleAndKeywords(args, kwargs, "O:getgrgid", _keywords,
    &id))
    goto exit;
return_value = grp_getgrgid_impl(module, id);

PEP 737: Format Type Name

One issue that I had with Argument Clinic was to format an error message with the limited C API. I cannot use the private _PyArg_BadArgument() function, nor access to PyTypeObject.tp_name (opaque structure in the limited C API) to format a type name. While the limited C API provides PyType_GetName() and PyType_GetQualName(), it's still different than how Python formats type names in error messages.

I proposed different APIs but there was no agreement. So I decided to write PEP 737 "C API to format a type fully qualified name".

After four months of discussions, the Steering Council decided to accept it in Python 3.13.

Changes:

Add PyType_GetFullyQualifiedName() function.
Add PyType_GetModuleName() function.
Add %T, %#T, %N and %#N formats to PyUnicode_FromFormat().

I also proposed adding a new type.__fully_qualified_name__ attribute, and a few methods to format a the fully qualified name of type in Python. But the Steering Council was not convinced and asked me to remove these Python changes until someone comes with a strong use case for this attribute and methods.

In 2018, I made a first attempt: I made a similar change, but I had to revert it. I created a discussion on the python-dev mailing list, but we failed to reach a consensus.

In 2011, I already asked to stop the cargo cult of truncating type names, but I didn't implement my idea by proactively stop truncating type names.

Example

Example of the code generating an error message in the pwd extension.

Python 3.12 uses the private _PyArg_BadArgument() private:

_PyArg_BadArgument("getpwnam", "argument", "str", arg);

Python 3.13 now uses the new %T format (PEP 737) of the limited C API:

PyErr_Format(PyExc_TypeError,
             "getpwnam() argument must be str, not %T",
             arg);

Add _testlimitedcapi extension

In Python 3.12, C API tests are splitted in two categories:

_testcapi: public C API
_testinternalcapi: internal C API (Py_BUILD_CORE)

I added a third _testlimitedcapi extension to test the limited C API (Py_LIMITED_API). I moved tests using the limited C C API from _testcapi to _testlimitedcapi.

The difference between _testcapi and _testlimitedcapi is that the _testlimitedcapi extension is built with the Py_LIMITED_API macro defined, and so can only access the internal C API.

Convert stdlib extensions to the limited C API

At August 2023, I proposed to: Use the limited C API for some of our stdlib C extensions.

In March 2024, there are now 16 C extensions built with the limited C API:

_ctypes_test
_multiprocessing.posixshmem
_scproxy
_stat
_statistics
_testimportmultiple
_testlimitedcapi
_uuid
errno
fcntl
grp
md5
pwd
resource
termios
winsound

Other stdlib C extensions use the internal C API for various reasons or are using functions which are missing in the limited C API. Remaining issues should be analyzed on a case by case basis.

This work shows that non-trivial C extensions can be written using only the limited C API version 3.13.

What's Next?

PEP 741: Python Configuration C API

In Python 3.8, I added the PyConfig API to configure the Python initialization. Problem: it has no stable ABI and is excluded from the limited C API.

Recently, I proposed PEP 741: Python Configuration C API which is built on top of the PyConfig, provides a stable ABI, and is compatible with the limited C API. I submitted PEP 741 to the Steering Council.

Py_GetConstant()

Accessing constants reads private ABI symbols. For example, Py_None API reads the private _Py_NoneStruct symbol at the stable ABI level.

I proposed to change the constant implementations to use function calls instead. For example, reading Py_None would call Py_GetConstant(Py_CONSTANT_NONE). The advantage is that it adds 5 more constants: zero, one, empty string, empty bytes string, and empty tuple. For example, Py_GetConstant(Py_CONSTANT_ZERO) gives the number 0 and the function cannot fail.

Cython and PyO3

Cython and PyO3 projects are two big consumers of the C API.

While Cython has an experimental build mode for the limited C API, it's still incomplete. It would be nice to complete it to cover more use cases and more APIs.

PyO3 can use the limited API but can still use the non-limited API for some use cases. It would be interersting to only use the limited C API. The PEP 741 to embed Python in Rust would be interesting for that.

Remove private C API functions

2023-12-15T23:00:00+01:00

In Python 3.13 alpha 1, I removed more than 300 private C API functions. Even if I announced my plan early in July, users didn't "embrace" my plan and didn't agree with the rationale. I reverted 50 functions in the alpha 2 release to calm down the situation and have more time to replace private functions with public functions.

Painting: The Seasons by Czech visual artist Alphonse Mucha (1900)

Remove private functions

On June 25th, I created issue gh-106084: "Remove private C API functions from abstract.h".

Over the years, we accumulated many private functions as part of the public C API in abstract.h header file. I propose to remove them: move them to the internal C API.

On July 1st, I created the meta issue gh-106320: "Remove private C API functions". The issue has 63 pull requests (a lot!), 53 comments and more than 300 events (created by commits and pull requests) which make the issue hard to navigate.

On July 3rd, Petr Viktorin shared his concerns:

Please be careful about assuming that the underscore means a function is private. AFAIK, that rule first appears for 3.10, and was only properly formalized in PEP 689, for Python 3.12.

For older functions, please consider if they should be added to the unstable API. IMO it's better to call them “underscored” than “private”.

See also: historical note in the devguide.

On July 4th, Petr posted on Discourse: (pssst) Let's treat all API in public headers as public.

Remove more private functions

On July 4th, I removed 181 private functions so far.

On July 4th, I identified that 34 projects on PyPI top 5,000 are affected by these removals.

On July 7th, I added PyObject_Vectorcall() to the pythoncapi-compat project.

On July 9th, I started the discussion: C API: How much private is the private _Py_IDENTIFIER() API?

On July 13th, I asked if the PyComplex API should be made private or not. Petr noticed that this API was documented.

On July 23th, I tried to build numpy, but I was blocked by Cython which was broken by my changes. I created the issue gh-107076: "C API: Cython 3.0 uses private functions removed in Python 3.13 (numpy 1.25.1 fails to build)".

On July 23th, I found that the private _PyTuple_Resize() function is documented. I proposed adding a new internal _PyTupleBuilder API to replace _PyTuple_Resize().

On July 23th, I proposed: C API: My plan to clarify private vs public functions in Python 3.13.

Private API has multiple issues: they are usually not documented, not tested, and so their behavior may change without any warning or anything. Also, they can be removed anytime without any notice.

Phase 1: Remove as many private API as possible
Phase 2 (Python 3.13 alpha 1): revert removals if needed to make sure that Cython, numpy and pip work.
Phase 3 (Python 3.13 beta 1): consider reverting more removals if needed.

On July 24th, I created the PR Remove private _PyCrossInterpreterData API. Eric Snow asked me to keep this private API since it's used by 3rd party C extensions.

On August 24th, I created issue gh-108444 to add PyLong_AsInt() public function, replacing the removed _PyLong_AsInt() function.

On September 4th, I looked at the _PyArg API. I started the discussion: Use the limited C API for some of our stdlib C extensions.

On September 4th, I declared:

I declare that the Python 3.13 season of “removing as many private C API as possible” ended! I stop here until Python 3.14.

Python 3.12 exports 385 private functions. After the cleanup, Python 3.13 only exported 86 private functions: I removed 299 functions. I closed the issue.

Python 3.13 alpha 1 negative feedback

On October 13th, Python 3.13 alpha 1 was released with my changes removing around 300 private C API functions.

On October 14th, Guido van Rossum asked:

Thanks for the list. Should we encourage various projects to test 3.13a1, which just came out? Is there a way we can encourage them more?

On October 30th, Stefan Behnel, Cython creator, posted the message: Python 3.13 alpha 1 contains breaking changes, what's the plan?. He also commented the issue. Extract:

I just came across this issue. Let me express my general disapproval regarding deliberate breakage, which this issue appears to be entirely about. As far as I can see, none of these removals was motivated. The mere idea of removing existing API "because we can" is entirely foreign to me.

On October 31th, Petr asked the Steering Council: Is it OK to remove _PyObject_Vectorcall? about the removal of old aliases with underscore, such as _PyObject_Vectorcall. I didn't know that these names were part of PEP 590 – Vectorcall: a fast calling protocol for CPython, nothing was written about that in the header files.

On November 2nd, Guido wrote (where WG stands for C API Working Group):

We can talk till we’re blue in the face but please no more action (i.e., no more moving/removing APIs) until the full WG has had a chance to discuss this and make a decision.

(Restoring removed APIs at users’ requests is fine.)

On November 3rd, Gregory P. Smith wrote:

I'd much prefer 'revert' for any API anyone is found using in 3.13.

We need to treat 3.13 as a more special than usual release and aim to minimize compatibility headaches for existing project code. That way more things that build and run on 3.12 build can run on 3.13 as is or with minimal work.

This will enable ecosystem code owners to focus on the bigger picture task of enabling existing code to be built and tested on an experimental pep703 free-threading build rather than having a pile of unrelated cleanup trivia blocking that.

On November 7th, my colleague Karolina Surma posted a report: Ongoing packages' rebuild with Python 3.13 in Fedora. She did a great bug triage work on counting build failures per C API issue by recompiling 4000+ Python packages in Fedora with Python 3.13.

On November 13th, Petr also identified that the private PyComplex API, such as _Py_c_sum() function, was documented. Moreover, the issue gh-112019 was created to ask to revert these APIs.

Revert in Python 3.13 alpha 2

On November 13th, I created issue gh-112026: "[C API] Revert of private functions removed in Python 3.13 causing most problems". I made 4 changes:

Add again <unistd.h> include in Python.h
Restore removed private C API
Restore removed _PyDict_GetItemStringWithError()
Add again _PyThreadState_UncheckedGet() function

I selected functions by looking at bug reports, Karolina's report, and by trying to build numpy and cffi. With my reverts, numpy built successfully, and cffi built successfully with a minor change that I reported upstream (cffi: Use PyErr_FormatUnraisable() on Python 3.13).

In total, I restored 50 private functions.

On November 22th, Python 3.13 alpha 2 was released with these restored functions. It seems like the situation is calmer now.

Reverting was part of my initial plan, it was clearly announced since the beginning. But I didn't expect that so many people would test Python 3.13 alpha 1 as soon as it was released (October)! Usually, we only start to get feedback around beta 1 (May). I had like 2 weeks to fix most issues instead of 7 months. It was really stressful for me.

I posted a message to apologize and to give the context of this work. Extract:

Following the announced plan 22, I reverted 50 private APIs 20 which were removed in Python 3.13 alpha 1. These APIs will be available again in the incoming Python 3.13 alpha 2 (scheduled next Tuesday).

I planned to make Cython, numpy and cffi compatible with Python 3.13 alpha 1. Well, I missed this release. With reverted changes, numpy 1.26.2 can be built successfully, and cffi 1.16.0 just requires a single change 13. So we should be good (or almost good) for Python 3.13 alpha 2.

(...)

I’m sorry if some people felt that this C API work was forced on them and their opinion was not taken in account. We heard you and we took your feedback in account. It took me time to adjust my plan according to early received feedback. I expected to have 6 months to work step by step. Well, I had 2 weeks instead 🙂

Add public functions

On October 30th, I created issue gh-111481: "[C API] Meta issue: add new public functions with doc+tests to replace removed private functions".

So far, I added 7 public functions to Python 3.13:

PyDict_Pop()
PyDict_PopString()
PyList_Clear()
PyList_Extend()
PyLong_AsInt()
Py_HashPointer()
Py_IsFinalizing()

More functions are coming soon, I have many open pull requests!

Adding new functions is slower than what I expected. The good part is that many people are reviewing the APIs, and that the new public APIs are way better than the old private ones: less error prone, can be more efficient, etc. At least, the conversion of private to public is moving steadily, functions are added one by one.

Design the API of a new PyDict_GetItemRef() function

2023-11-16T20:00:00+01:00

Last June, I proposed adding a new PyDict_GetItemRef() function to Python 3.13 C API. Every aspect of the API design was discussed in length. I will explain how the API was designed, to finish with the future creation of C API Working Group.

Photo: Psyche Revived by Cupid's Kiss sculpture by Antonio Canova.

Add PyImport_AddModuleRef() function

In June, while reading Python C code, I found a surprising code: the PyImport_AddModuleObject() function creates a weak reference on the module returned by import_add_module(), call Py_DECREF() on the module, and then try to get the module back from the weak reference: it can be NULL if the reference count was one. I expected to have just Py_DECREF(), but no, complicated code involving a weak reference is needed to prevent a crash.

So I added the new PyImport_AddModuleRef() function to return directly the strong reference, and avoid having to create a temporary weak reference.

Note: The API of the new PyImport_AddModuleObject() function is still being discussed and may change in the near future.

Add PyWeakref_GetRef() function

Shortly after, I added the new PyWeakref_GetRef() function. It is similar to PyWeakref_GetObject(), but returns a strong reference instead of a borrowed reference.

Since I listed Bad C API in my "Design a new better C API for Python" project in 2018, I am now fighting against borrowed references since they cause multiple issues such as:

Subtle crashes in C extensions.
Make the C API implementation in PyPy more complicated: see Inside cpyext: Why emulating CPython C API is so Hard (2018) by Antonio Cuni.
Unknown objects lifetime preventing optimization opportunities.
Make the C API less regular and harder to use: some functions return a new reference, others return borrowed reference.

In 2020, my first attempt to add a new PyTuple_GetItemRef() function was rejected.

PyDict_GetItemRef(): easy!

Since it went well (quick discussion, no major disagreement) to add PyImport_AddModuleRef() and PyWeakref_GetRef() functions, I felt lucky and proposed adding a new PyDict_GetItemRef() function. It should be easy as well, right? The discussion started in the issue and continued in the associated pull request.

The idea of PyDict_GetItemRef() is to replace the PyDict_GetItem() function which returns a borrowed reference and ignore all errors: hash(key) error, key == key2 comparison error, KeyboardInterrupt, etc.

There is already the PyDict_GetItemWithError() function which reports errors. But it returns a borrowed reference and its API has an issue: when it returns NULL, the caller must check PyErr_Occurred() to know if an exception is set, or if the key is missing. This problem was the very first issue created in the Problems project of the C API Working Group.

This Problems project is a collaborative work to collect C API issues. By the way, the PEP 733 – An Evaluation of Python’s Public C API was published at October 16: summary of these problems.

PyDict_GetItemRef(): API version 1

I proposed the API:

int PyDict_GetItemRef(PyObject *mp, PyObject *key, PyObject **pvalue)
int PyDict_GetItemStringRef(PyObject *mp, const char *key, PyObject **pvalue)

Return 0 on success, or -1 on error. Simple, right?

Gregory Smith was supportive:

I'm in favor of this because I don't think we should have public APIs that (a) require a value check + PyErr_Occurred() call pattern - a frequent source of lurking bugs - or (b) return borrowed references. Yes I know we already have them, that's missing the point. The point is that with these in place, we can promote their use over the others because these are better in all respects.

Later, I discovered that the draft PEP 703 – Making the Global Interpreter Lock Optional in CPython proposed adding a PyDict_FetchItem() similar to my proposed PyDict_GetItemRef() function.

API version 2: Change the Return Value

Mark Shannon asked:

What's the rationale for not distinguishing between found and not found in the return value? See: Document the preferred style for API functions with three, four or five-way returns.

I modified the API to return 1 if the key is present and return 0 if the key is missing.

By the way, Erlend Aasland added C API guidelines in the Python Developer Guide (devguide) about function return values.

Function Name

Serhiy Storchaka had concerns about the name:

The only problem is that functions with so similar names have completely different interface. It is pretty confusing. Would not be better to name it PyDict_LookupItem or like? It may be worth to add also PyMapping_LookupItem for convenience.

Mark Shannon added:

Can we come up with a better name than PyDict_GetItemRef? I see why you are adding Ref to the end, but all API functions should return new references, so it is a bit like calling the function PyDict_GetItemNotWrong.

Obviously, the ideal name [PyDict_GetItem()] is already taken. Anyone have any suggestions for a better name?

Sam Gross wrote:

In the context of PEP 703, I think it would be better to have variations that only change one axis of the semantics (e.g., new vs. borrowed, error vs. no error) and have the naming reflect that. For example, PEP 703 proposes:

PyDict_FetchItem for PyDict_GetItem and PyDict_FetchItemWIthError for PyDict_GetItemWithError.

I created Naming convention for new C API functions to discuss the Ref suffix for new functions returning a strong refererence.

PEP 703 proposes PyDict_FetchItem() name.

First Argument Type

Mark Shannon had concerns about the first argument type:

Using PyObject* is needlessly throwing away type information.

Erlend Aasland added:

Why not strongly typed, since it is a PyDict_ API?

Pull Request Approvals And The Function Name Strikes Back

Erlend and Gregory approved my pull request.

Erlend wrote:

I'm approving this. A new naming scheme makes sense for a new API; I'm not sure it makes sense to try and enforce a new scheme in the current API. For now, there is already precedence of the Ref suffix in the current API; I'm ok with that. Also, the current API uses PyObject* all over the place. If we are to change this, we practically will end up with a completely new API; AFAICS, there is no problem with sticking to the current practice.

Then the discussion about the function name came back. So Gregory asked the Steering Council: Should we add non-borrowed-ref public C APIs, if so, is there a naming convention?. He asked two questions:

Q1: Should we add non-borrowed-reference public C APIs where only borrowed-reference ones exist.
Q2: if yes to Q1, is there a preferred naming convention to use for new public C APIs that return a strong reference when the earlier APIs these would be parallel versions of only returned a borrowed reference.

Later, Serhiy Storchaka also approved the pull request:

In general, I support adding this function. The benefits:

Returns a strong reference. It will save from some errors and may be better for PyPy.

Save CPU time for calling PyErr Occurred().

The PR had a total of 3 approvals.

API version 3: use PyDictObject

When I asked again Mark his opinion on the API, he wrote:

I'm opposed because making ad-hoc changes like this is going to make the C-API worse, not better.

I made the change asked by Mark, change the first parameter type from PyObject* to PyDictObject*. API version 3:

int PyDict_GetItemRef(PyDictObject *op, PyObject *key, PyObject **pvalue)

Disagreement On The PyDictObject Type

Serhiy was against the change:

I dislike using concrete struct types instead of PyObject* in API, especially in public API. Isn't there a rule forbidding this?

In May, Mark created The C API is weakly typed discussion in the Problems project.

During the discussion, Erlend created Document guidelines for when to use dynamically typed APIs in the devguide to try to find a consensus regarding guidelines for weakly/stronly typed APIs.

There are two questions:

Use PyObject* or PyDictObject* type for the parameter.
Check the type at runtime, or don't check for best performance (use an assertion in debug mode).

Serhiy wrote:

It is not about runtime checking.

It is about requiring to cast the argument to PyDictObject* every time you use the function: PyDict_GetItemRef((PyDictObject*)foo, bar, &baz).

It is tiresome, and it is unsafe, because the compiler will not reject the code if foo is int or const char*.

Gregory added:

Our C API only accepts plain PyObject* as input to all our public APIs. Otherwise user code will be littered with typecasts all over the place.

Gregory removed his approval.

Revert: Back To PyObject Type (API Version 2)

Since Serhiy and Gregory were against the change, I reverted it to move back to the PyObject* type. Serhiy and Erlend confirmed their approval.

I created the issue Design a brand new C API with new PyCAPI_ prefix where all functions respect new guidelines in the Problems project to discuss the creation of a branch new API. I suggested Mark to only consider changing weakly type PyObject* type to strongly typed PyDictObject* in such new API.

More changes? API version 4

Petr Viktorin joined the discussion and proposed a late change:

FWIW, here's a possible new variant: you could set result to NULL in which case the result isn't stored/incref'd. And that would start a convention of how to turn a get operation into a membership test. (And the Lookup name would fit that better.)

I didn't take Petr's suggestion since Serhiy pointed out that there is already the PyDict_Contains() function to test is a dictionary contains a key.

Mark Shannon wrote:

If this function is to take PyObject*, as Erlend seems to insist, then it shouldn't raise a SystemError when passed something other than a dict. It should raise a TypeError.

I modified the API (version 4) to raise SystemError if the first argument is not a dictionary, instead raising TypeError.

Merge The Change

After around 1 month of intense discussions, I merged my change adding the PyDict_GetItemRef() function (commit) with a summary of the discussion.

I also added the function to pythoncapi-compat project.

Final API:

int PyDict_GetItemRef(PyObject *p, PyObject *key, PyObject **result)
int PyDict_GetItemStringRef(PyObject *p, const char *key, PyObject **result)

Documentation:

Using the pythoncapi-compat project, you can use this new API right now on all Python versions!

How To Take Decisions?

The discussions occurred at many multiple places:

My Python issue
My Python pull request
Multiple Problems issues
Multiple devguide issues
Steering Council issue

The discussion was heated. Erlend decided to take a break:

I'm taking a break from the C API discussions; I'm removing myself from this PR for now

While the change was approved by 3 core developers, there was not strictly a consensus since Mark did not formally approve the change. Some people asked to wait until some general guidelines for new APIs are decided, before making further C API changes.

Gregory opened a Steering Council issue at July 2. I asked for an update at July 17. Three meetings later, they didn't have the opportunity to visit the question. They were busy discussing the heavy PEP 703 – Making the Global Interpreter Lock Optional in CPython. I merged my changed before the Steering Council spoke up. I proposed to revert the change if needed. At July 25, Gregory replied in the name of the Steering Council:

The steering council chatted about non-borrowed-ref and naming conventions today. We want to delegate this to the C API working group to come back with a broader recommendation. Irit Katriel has put together the initial draft of An Evaluation of Python's Public C API for example.

The problem was that the C API Working Group was just a GitHub organization, it was not an organized group with designated members.

C API Working Group

From October 9 to 14, there was a Core Dev Sprint at Brno (Czech Republic). I gave a talk about the C API status and my C API agenda: slides of my C API talk. At the end, I called to create a formal C API Working Group to unblock the situation.

During the sprint, after my talk, Guido van Rossum wrote PEP 731 – C API Working Group Charter with 5 members:

Steve Dower
Irit Katriel
Guido van Rossum
Victor Stinner (me)
Petr Viktorin

Once the PEP was published, it was discussed on discuss.python.org. Two weeks later, Guido submitted the PEP to the Steering Council: PEP 731 -- C API Working Group Charter.

The Steering Council didn't take a decision yet. Previously, the Steering Council expressed their desire to delegate some C API decisions to a C API Working Group.

My contributions to Python (July 2023)

2023-07-08T23:00:00+02:00

In 2023, between May 4 and July 8, I made 144 commits in the Python main branch. In this article, I describe the most important Python contributions that I made to Python 3.12 and Python 3.13 in these months.

Drawing: Foxes in Love: Cuddle

Summary

Add PyImport_AddModuleRef() and PyWeakref_GetRef().
Py_INCREF() and Py_DECREF() as opaque function call in limited C API.
PyList_SET_ITEM() and PyTuple_SET_ITEM() checks index bounds.
Define "Soft Deprecation" in PEP 387; getopt and optparse are soft deprecated.
Document how to replace imp with importlib.
Remove 19 stdlib modules.
Remove locale.resetlocale() and logging.Logger.warn().
Remove 181 private C API functions.

PEP 594

In Python 3.13, I removed 19 modules deprecated in Python 3.11 by PEP 594:

aifc
audioop
cgi
cgitb
chunk
crypt
imghdr
mailcap
nis
nntplib
ossaudiodev
pipes
sndhdr
spwd
sunau
telnetlib
uu
xdrlib

Zachary Ware removed the last deprecated module, msilib, so the PEP 594 is now fully implemented in Python 3.13!

I announced the change: PEP 594 has been implemented: Python 3.13 removes 20 stdlib modules.

Removing imghdr caused me some troubles with building the Python documentation. Sphinx uses imghdr, but recent Sphinx versions no longer use it. I updated the Sphinx version to workaround this issue.

C API: Strong reference

tl; dr I added PyImport_AddModuleRef() and PyWeakref_GetRef() to Python 3.13 to return strong references, instead of borrowed references.

When I analyzed issues of Python C API., I quickly identified that the usage of borrowed references is causing a lot of troubles. By the way, I recently updated the list of the 41 functions returning borrowed refererences. This issue is also tracked as Returning borrowed references is fundamentally unsafe in the recently created Problems project of the new C API workgroup.

In Python 3.10, I added Py_NewRef() and Py_XNewRef() functions which have a better semantics: they create a new strong reference to a Python object. I also added the PyModule_AddObjectRef() function, variant of PyModule_AddObject(), which returns a strong reference. And I added borrowed reference and strong reference terms to the glossary.

In Python 3.13, I added two functions:

PyImport_AddModuleRef(): variant of PyImport_AddModule()
PyWeakref_GetRef(): variant of PyWeakref_GetObject(). I also deprecated PyWeakref_GetObject() and PyWeakref_GET_OBJECT() functions.

I updated pythoncapi-compat to provide these functions to Python 3.12 and older.

I also added Py_TYPE() to Doc/data/refcounts.dat: file listing how C functions handle references, it's maintained manually.

Now I'm working on adding PyDict_GetItemRef() but the API and the function name are causing more frictions: see the pull request. Recently, PyDict_GetItemRef() API was raised to the Steering Council: decision: Should we add non-borrowed-ref public C APIs, if so, is there a naming convention?

C API: PyList_SET_ITEM()

tl;dr In Python 3.13, PyList_SET_ITEM() and PyTuple_SET_ITEM() now checks index bounds.

In Python 3.9, Include/cpython/listobject.h was created for the PyList API excluded from the limited C API. PyList_SET_ITEM() was implemented as:

#define PyList_SET_ITEM(op, i, v) (_PyList_CAST(op)->ob_item[i] = (v))

In Python 3.10, the return value was removed to fix as bug by adding (void) cast:

#define PyList_SET_ITEM(op, i, v) ((void)(_PyList_CAST(op)->ob_item[i] = (v)))

In Python 3.11, PEP 670: Convert macros to functions in the Python C API was accepted and I converted the macro to a static inline function:

static inline void
PyList_SET_ITEM(PyObject *op, Py_ssize_t index, PyObject *value) {
    PyListObject *list = _PyList_CAST(op);
    list->ob_item[index] = value;
}

I tried to add an assertion in PyTuple_SET_ITEM() to check index bounds , but I got assertion failures when running the Python test suite related to PyStructSequence which inherits from PyTuple.

Recently, I tried again. I updated the PyStructSequence API to check the index bounds differently. The tricky part is that getting the number of fields of a PyStructSequence requires to get an item of dictionary, and PyDict_GetItemWithError() can raise an exception. Moreover, PyStructSequence_SET_ITEM() was still implemented as a macro in Python 3.12:

#define PyStructSequence_SET_ITEM(op, i, v) PyTuple_SET_ITEM((op), (i), (v))

Old PyStructSequence_SetItem() implementation:

void
PyStructSequence_SetItem(PyObject* op, Py_ssize_t i, PyObject* v)
{
    PyStructSequence_SET_ITEM(op, i, v);
}

New implementation:

void
PyStructSequence_SetItem(PyObject *op, Py_ssize_t index, PyObject *value)
{
    PyTupleObject *tuple = _PyTuple_CAST(op);
    assert(0 <= index);
#ifndef NDEBUG
    Py_ssize_t n_fields = REAL_SIZE(op);
    assert(n_fields >= 0);
    assert(index < n_fields);
#endif
    tuple->ob_item[index] = value;
}

The REAL_SIZE() macro is only available in Objects/structseq.c. Exposing it in the public C API would be a bad idea. So I just converted PyStructSequence_SET_ITEM() macro to an alias to PyStructSequence_SetItem():

#define PyStructSequence_SET_ITEM PyStructSequence_SetItem

This way, PyStructSequence_SET_ITEM() and PyStructSequence_SetItem() are implemented as opaque function calls.

So it became possible to check index bounds in PyList_SET_ITEM():

static inline void
PyList_SET_ITEM(PyObject *op, Py_ssize_t index, PyObject *value) {
    PyListObject *list = _PyList_CAST(op);
    assert(0 <= index);
    assert(index < Py_SIZE(list));
    list->ob_item[index] = value;
}

I had to modify code calling PyList_SET_ITEM() before setting the list size: list_extend() and _PyList_AppendTakeRef() functions. The size is now set before calling PyList_SET_ITEM().

I made a similar change to PyTuple_SET_ITEM() to also checks the index.

These bound checks are implemented with an assertion if Python is built in debug mode or if Python is built with assertions.

C API: Python 3.12 Py_INCREF()

tl; dr I changed Py_INCREF() and Py_DECREF() implementation as opaque function calls in any version of the limited C API if Python is built in debug mode.

In Python 3.12, PEP 683 – Immortal Objects, Using a Fixed Refcount was implemented. It made Py_INCREF() and Py_DECREF() static inline functions even more complicated than before. The implementation required to expose private _Py_IncRefTotal_DO_NOT_USE_THIS() and _Py_DecRefTotal_DO_NOT_USE_THIS() functions in the stable ABI, whereas the function names say "DO NOT USE THIS", for debug builds of Python.

In Python 3.10, I modified Py_INCREF() and Py_DECREF() to implement them as opaque function calls in the limited C API version 3.10 or newer if Python is built in debug mode (if Py_REF_DEBUG macro is defined). Thanks to this change, the limited C API is supported if Python is built in debug mode since Python 3.10.

In Python 3.12, I modified Py_INCREF() and Py_DECREF() to implement them as opaque function calls in all limited C API version, not only in the limited C API version 3.10 and newer, if Python is built in debug mode. This way, implementation details are now hidden and no longer leaked in the stable ABI. I removed _Py_NegativeRefcount() in the limited C API and I removed _Py_IncRefTotal_DO_NOT_USE_THIS() and _Py_DecRefTotal_DO_NOT_USE_THIS() in the stable ABI.

Later, I discovered that my fix broke backward compatibility with Python 3.9. My implementation used _Py_IncRef() and _Py_DecRef() that I added to Python 3.10. I updated the implementation to use Py_IncRef() and Py_DecRef() on Python 3.9 and older, these functions are available since Python 2.4.

C API: Py_INCREF() opaque function call

tl; dr I changed Py_INCREF() and Py_DECREF() implementation as opaque function calls in the limited C API version 3.12. (also in the regular release build, not only in the debug build)

In Python 3.8, I converted Py_INCREF() and Py_DECREF() macros to static inline functions. I already wanted to convert them as opaque function calls, but it can have an important cost on performance and so I left them as static inline functions.

As a follow-up of my Python 3.12 Py_INCREF() fix for the debug build, I modified Py_INCREF() and Py_DECREF() in Python 3.12 to always implemented them as opaque function calls in the limited C API version 3.12 and newer.

For me, it's a major enhancement to make the stable ABI more future proof by leaking less implementation details.

Code:

static inline Py_ALWAYS_INLINE void Py_INCREF(PyObject *op)
{
#if defined(Py_LIMITED_API) && (Py_LIMITED_API+0 >= 0x030c0000 || defined(Py_REF_DEBUG))
    // Stable ABI implements Py_INCREF() as a function call on limited C API
    // version 3.12 and newer, and on Python built in debug mode. _Py_IncRef()
    // was added to Python 3.10.0a7, use Py_IncRef() on older Python versions.
    // Py_IncRef() accepts NULL whereas _Py_IncRef() doesn't.
#  if Py_LIMITED_API+0 >= 0x030a00A7
    _Py_IncRef(op);
#  else
    Py_IncRef(op);
#  endif
#else
   ...
#endif
}

Tests

The Python test runner regrtest has specific constraints because tests are run in subprocesses, on different platforms, with custom encodings and options. Over the last year, an annoying regrtest came and go: if a subprocess standard output (stdout) cannot be decoded, the test is treated as a success! I fixed the bug and I made the code more reliable by marking this bug class as "test failed".

I fixed test_counter_optimizer() of test_capi when run twice: create a new function at each call, so each run starts in a known state. Previously, the second run was in a different state since the function was already optimized.

I cleaned up old test_ctypes. My main goal was to remove from ctypes import * to be able to use pyflakes on these tests. I found many skipped tests: I reenabled 3 of them, and removed the other ones. I also removed dead code.

I removed test_xmlrpc_net: it was skipped since 2017. The public buildbot.python.org server has no XML-RPC interface anymore, and no replacement public XML-RPC server was found in 6 years.

I fixed dangling threads in test_importlib.test_side_effect_import(): the import spawns threads, wait until they complete.

C API: Deprecate

I listed pending C API removals in the What's New in Python 3.13 document.

I deprecated multiple APIs:

Py_UNICODE and PY_UNICODE_TYPE
PyImport_ImportModuleNoBlock()
Py_HasFileSystemDefaultEncoding

I deprecated legacy Python initialization functions:

PySys_ResetWarnOptions()
Py_GetExecPrefix()
Py_GetPath()
Py_GetPrefix()
Py_GetProgramFullPath()
Py_GetProgramName()
Py_GetPythonHome()

I removed the PyArg_Parse() deprecation. In 2007, the deprecation was added as a comment to the documentation, but the function remains relevant in Python 3.13 for some specific use cases.

Soft Deprecation

tl; dr The getopt module is now soft deprecated.

I updated PEP 387: Backwards Compatibility Policy to add Soft Deprecation:

A soft deprecation can be used when using an API which should no longer be used to write new code, but it remains safe to continue using it in existing code. The API remains documented and tested, but will not be developed further (no enhancement).

The main difference between a “soft” and a (regular) “hard” deprecation is that the soft deprecation does not imply scheduling the removal of the deprecated API.

I converted optparse deprecation to a soft deprecation.

I soft deprecated the getopt module: it remains available and maintained, but argparse should be preferred for new projects.

Deprecate

I deprecated the getmark(), setmark() and getmarkers() methods of the Wave_read and Wave_write classes. These methods only existed for compatibility with the aifc module, but they did nothing or always failed, and the aifc module was removed in Python 3.13.

I also deprecated SetPointerType() and ARRAY() functions of ctypes.

C API: Remove

I removed the following old functions to configure the Python initialization, that I deprecated in Python 3.11:
- PySys_AddWarnOptionUnicode()
- PySys_AddWarnOption()
- PySys_AddXOption()
- PySys_HasWarnOptions()
- PySys_SetArgvEx()
- PySys_SetArgv()
- PySys_SetPath()
- Py_SetPath()
- Py_SetProgramName()
- Py_SetPythonHome()
- Py_SetStandardStreamEncoding()
- _Py_SetProgramFullPath()
I also deprecated removed "call" functions:
- PyCFunction_Call()
- PyEval_CallFunction()
- PyEval_CallMethod()
- PyEval_CallObject()
- PyEval_CallObjectWithKeywords()
I removed deprecated PyEval_AcquireLock() and PyEval_InitThreads() functions.
Remove old aliases which were kept backwards compatibility with Python 3.8:
- _PyObject_CallMethodNoArgs()
- _PyObject_CallMethodOneArg()
- _PyObject_CallOneArg()
- _PyObject_FastCallDict()
- _PyObject_Vectorcall()
- _PyObject_VectorcallMethod()
- _PyVectorcall_Function()

Remove

I removed locale.resetlocale() function, but I failed to remove locale.getdefaultlocale() in Python 3.13: INADA-san asked me to keep it.

I removed the untested and not documented logging.Logger.warn() method.

Oh, I forgot to remove cafile, capath and cadefault parameters of the urllib.request.urlopen() function: it's now also done in Python 3.13. I removed similar parameters in many other modules in Python 3.12.

Cleanup

As usual, I removed a bunch of unused imports (in the stdlib, tests and tools).

I reimplemented xmlrpc.client _iso8601_format() function with datetime.datetime.isoformat(). The timezone is ignored on purpose: the XML-RPC specification doesn't explain how to handle it, many implementations ignore it.

Port imp code to importlib

The importlib module was added to Python 3.1 and it became the default in Python 3.3. The imp module was deprecated in Python 3.4 but was only removed in Python 3.12. Replacing imp code with importlib is not trivial: importlib has a different design and API.

I wrote documentation on how to port imp code to importlib in What's New in Python 3.12.

I proposed adding importlib.util.load_source_path() function, but I understood that the devil is in details: it's hard to decide how to handle the sys.modules cache. I gave up and instead added a recipe in the What's New in Python 3.12 documentation:

import importlib.util
import importlib.machinery

def load_source(modname, filename):
    loader = importlib.machinery.SourceFileLoader(modname, filename)
    spec = importlib.util.spec_from_file_location(modname, filename, loader=loader)
    module = importlib.util.module_from_spec(spec)
    # The module is always executed and not cached in sys.modules.
    # Uncomment the following line to cache the module.
    # sys.modules[module.__name__] = module
    loader.exec_module(module)
    return module

There are many projects affected by the imp removal and porting them is not easy. See How do I migrate from imp? discussion.

C API: Remove private functions

Last but not least, in issue #106320, I removed not less than 181 private C API functions.

As a reaction to my changes, a discussion was started to propose treating private functions as public functions.

I'm now working on identifying projects affected by these removals and on proposing solutions for the most commonly used removed functions like the _PyObject_Vectorcall() alias.

The list of the 181 removed private C API functions:

_PyArg_NoKwnames()
_PyBytesWriter_Alloc()
_PyBytesWriter_Dealloc()
_PyBytesWriter_Finish()
_PyBytesWriter_Init()
_PyBytesWriter_Prepare()
_PyBytesWriter_Resize()
_PyBytesWriter_WriteBytes()
_PyCodecInfo_GetIncrementalDecoder()
_PyCodecInfo_GetIncrementalEncoder()
_PyCodec_DecodeText()
_PyCodec_EncodeText()
_PyCodec_Forget()
_PyCodec_Lookup()
_PyCodec_LookupTextEncoding()
_PyComplex_FormatAdvancedWriter()
_PyDeadline_Get()
_PyDeadline_Init()
_PyErr_CheckSignals()
_PyErr_FormatFromCause()
_PyErr_GetExcInfo()
_PyErr_GetHandledException()
_PyErr_GetTopmostException()
_PyErr_ProgramDecodedTextObject()
_PyErr_SetHandledException()
_PyException_AddNote()
_PyImport_AcquireLock()
_PyImport_FixupBuiltin()
_PyImport_FixupExtensionObject()
_PyImport_GetModuleAttr()
_PyImport_GetModuleAttrString()
_PyImport_GetModuleId()
_PyImport_IsInitialized()
_PyImport_ReleaseLock()
_PyImport_SetModule()
_PyImport_SetModuleString()
_PyInterpreterState_Get()
_PyInterpreterState_GetConfig()
_PyInterpreterState_GetConfigCopy()
_PyInterpreterState_GetMainModule()
_PyInterpreterState_HasFeature()
_PyInterpreterState_SetConfig()
_PyLong_AsTime_t()
_PyLong_FromTime_t()
_PyModule_CreateInitialized()
_PyOS_URandom()
_PyOS_URandomNonblock()
_PyObject_CallMethod()
_PyObject_CallMethodId()
_PyObject_CallMethodIdNoArgs()
_PyObject_CallMethodIdObjArgs()
_PyObject_CallMethodIdOneArg()
_PyObject_CallMethodNoArgs()
_PyObject_CallMethodOneArg()
_PyObject_CallOneArg()
_PyObject_FastCallDict()
_PyObject_HasLen()
_PyObject_MakeTpCall()
_PyObject_RealIsInstance()
_PyObject_RealIsSubclass()
_PyObject_Vectorcall()
_PyObject_VectorcallMethod()
_PyObject_VectorcallMethodId()
_PySequence_BytesToCharpArray()
_PySequence_IterSearch()
_PyStack_AsDict()
_PyThreadState_GetDict()
_PyThreadState_Prealloc()
_PyThread_CurrentExceptions()
_PyThread_CurrentFrames()
_PyTime_Add()
_PyTime_As100Nanoseconds()
_PyTime_AsMicroseconds()
_PyTime_AsMilliseconds()
_PyTime_AsNanoseconds()
_PyTime_AsNanosecondsObject()
_PyTime_AsSecondsDouble()
_PyTime_AsTimespec()
_PyTime_AsTimespec_clamp()
_PyTime_AsTimeval()
_PyTime_AsTimevalTime_t()
_PyTime_AsTimeval_clamp()
_PyTime_FromMicrosecondsClamp()
_PyTime_FromMillisecondsObject()
_PyTime_FromNanoseconds()
_PyTime_FromNanosecondsObject()
_PyTime_FromSeconds()
_PyTime_FromSecondsObject()
_PyTime_FromTimespec()
_PyTime_FromTimeval()
_PyTime_GetMonotonicClock()
_PyTime_GetMonotonicClockWithInfo()
_PyTime_GetPerfCounter()
_PyTime_GetPerfCounterWithInfo()
_PyTime_GetSystemClock()
_PyTime_GetSystemClockWithInfo()
_PyTime_MulDiv()
_PyTime_ObjectToTime_t()
_PyTime_ObjectToTimespec()
_PyTime_ObjectToTimeval()
_PyTime_gmtime()
_PyTime_localtime()
_PyTraceMalloc_ClearTraces()
_PyTraceMalloc_GetMemory()
_PyTraceMalloc_GetObjectTraceback()
_PyTraceMalloc_GetTraceback()
_PyTraceMalloc_GetTracebackLimit()
_PyTraceMalloc_GetTracedMemory()
_PyTraceMalloc_GetTraces()
_PyTraceMalloc_Init()
_PyTraceMalloc_IsTracing()
_PyTraceMalloc_ResetPeak()
_PyTraceMalloc_Start()
_PyTraceMalloc_Stop()
_PyUnicodeTranslateError_Create()
_PyUnicodeWriter_Dealloc()
_PyUnicodeWriter_Finish()
_PyUnicodeWriter_Init()
_PyUnicodeWriter_PrepareInternal()
_PyUnicodeWriter_PrepareKindInternal()
_PyUnicodeWriter_WriteASCIIString()
_PyUnicodeWriter_WriteChar()
_PyUnicodeWriter_WriteLatin1String()
_PyUnicodeWriter_WriteStr()
_PyUnicodeWriter_WriteSubstring()
_PyUnicode_AsASCIIString()
_PyUnicode_AsLatin1String()
_PyUnicode_AsUTF8String()
_PyUnicode_CheckConsistency()
_PyUnicode_Copy()
_PyUnicode_DecodeRawUnicodeEscapeStateful()
_PyUnicode_DecodeUnicodeEscapeInternal()
_PyUnicode_DecodeUnicodeEscapeStateful()
_PyUnicode_EQ()
_PyUnicode_EncodeCharmap()
_PyUnicode_EncodeUTF16()
_PyUnicode_EncodeUTF32()
_PyUnicode_EncodeUTF7()
_PyUnicode_Equal()
_PyUnicode_EqualToASCIIId()
_PyUnicode_EqualToASCIIString()
_PyUnicode_FastCopyCharacters()
_PyUnicode_FastFill()
_PyUnicode_FindMaxChar ()
_PyUnicode_FormatAdvancedWriter()
_PyUnicode_FormatLong()
_PyUnicode_FromASCII()
_PyUnicode_FromId()
_PyUnicode_InsertThousandsGrouping()
_PyUnicode_JoinArray()
_PyUnicode_ScanIdentifier()
_PyUnicode_TransformDecimalAndSpaceToASCII()
_PyUnicode_WideCharString_Converter()
_PyUnicode_WideCharString_Opt_Converter()
_PyUnicode_XStrip()
_PyVectorcall_Function()
_Py_AtExit()
_Py_CheckFunctionResult()
_Py_CoerceLegacyLocale()
_Py_FatalErrorFormat()
_Py_FdIsInteractive()
_Py_FreeCharPArray()
_Py_GetConfig()
_Py_IsCoreInitialized()
_Py_IsFinalizing()
_Py_IsInterpreterFinalizing()
_Py_LegacyLocaleDetected()
_Py_RestoreSignals()
_Py_SetLocaleFromEnv()
_Py_VaBuildStack()
_Py_add_one_to_index_C()
_Py_add_one_to_index_F()
_Py_c_abs()
_Py_c_diff()
_Py_c_neg()
_Py_c_pow()
_Py_c_prod()
_Py_c_quot()
_Py_c_sum()
_Py_gitidentifier()
_Py_gitversion()

Convert macros to functions in the Python C API

2022-12-12T23:00:00+01:00

Drawing: "L'oeil du cyclone" by Théo Grosjean.

Convert macros to functions

For 4 years, between Python 3.7 (2018) and Python 3.12 (2022), I made many changes on macros in the Python C API to make the API less error prone (avoid macro pitfalls) and better define the API (parameter types and return types, variable scope, etc.). PEP 670 "Convert macros to functions in the Python C API" describes in length the rationale of these changes.

I moved private functions to the internal C API to reduce the C API size.

Some changes are also related to preparing the API to make members of structures like PyObject or PyTypeObject private.

Converting macros and static inline functions to regular functions hides implementation details and bends the API towards the limited C API and the stable ABI (build a C extension once, use the binary on multiple Python versions). Regular functions are usable in programming languages and use cases which cannot use C macros and C static inline functions.

Most macros are converted to static inline functions, rather regular functions, to have no impact on performance.

This work was made incrementally in 5 Python versions (3.8, 3.9, 3.10, 3.11 and 3.12) to limit the number of impacted projects at each Python release.

Changing Py_TYPE() and Py_SIZE() macros impacted most projects. Python 3.11 contains the change. During Python 3.10 development cycle, the change has to be reverted since it impacted too many projects.

Note: I didn't modify all macros and functions listed in this article, it's a collaborative work as usual.

Statistics

Statistics on public functions:

Python 3.7: 893 regular functions, 315 macros.
Python 3.12: 943 regular functions, 246 macros, 69 static inline functions.

Cumulative changes on macros between Python 3.7 and Python 3.12 on public, private and internal APIs:

Converted 88 macros to static inline functions
Converted 11 macros to regular functions
Converted 3 static inline functions to regular functions:
Removed 47 macros

See Statistics on the Python C API for more numbers.

Python 3.12

Convert 39 macros to static inline functions:

PyCell_GET()
PyCell_SET()
PyCode_GetNumFree()
PyDict_GET_SIZE()
PyFloat_AS_DOUBLE()
PyFunction_GET_ANNOTATIONS()
PyFunction_GET_CLOSURE()
PyFunction_GET_CODE()
PyFunction_GET_DEFAULTS()
PyFunction_GET_GLOBALS()
PyFunction_GET_KW_DEFAULTS()
PyFunction_GET_MODULE()
PyInstanceMethod_GET_FUNCTION()
PyMemoryView_GET_BASE()
PyMemoryView_GET_BUFFER()
PyMethod_GET_FUNCTION()
PyMethod_GET_SELF()
PySet_GET_SIZE()
Py_UNICODE_HIGH_SURROGATE()
Py_UNICODE_ISALNUM()
Py_UNICODE_ISSPACE()
Py_UNICODE_IS_HIGH_SURROGATE()
Py_UNICODE_IS_LOW_SURROGATE()
Py_UNICODE_IS_SURROGATE()
Py_UNICODE_JOIN_SURROGATES()
Py_UNICODE_LOW_SURROGATE()
_PyGCHead_FINALIZED()
_PyGCHead_NEXT()
_PyGCHead_PREV()
_PyGCHead_SET_FINALIZED()
_PyGCHead_SET_NEXT()
_PyGCHead_SET_PREV()
_PyGC_FINALIZED()
_PyGC_SET_FINALIZED()
_PyObject_GC_IS_TRACKED()
_PyObject_GC_MAY_BE_TRACKED()
_PyObject_SIZE()
_PyObject_VAR_SIZE()
_Py_AS_GC()

Remove 5 macros:

PyUnicode_AS_DATA()
PyUnicode_AS_UNICODE()
PyUnicode_GET_DATA_SIZE()
PyUnicode_GET_SIZE()
PyUnicode_WSTR_LENGTH()

The following 4 macros can be used as l-values in Python 3.12:

PyList_GET_ITEM()
PyTuple_GET_ITEM():
PyDescr_NAME()
PyDescr_TYPE()

Code pattern like &PyTuple_GET_ITEM(tuple, 0) and &PyList_GET_ITEM(list, 0) is still commonly used to get a direct access to items as PyObject**. PyDescr_NAME() and PyDescr_TYPE() are used by SWIG: see https://bugs.python.org/issue46538

Python 3.11

Convert 33 macros to static inline functions:

PyByteArray_AS_STRING()
PyByteArray_GET_SIZE()
PyBytes_AS_STRING()
PyBytes_GET_SIZE()
PyCFunction_GET_CLASS()
PyCFunction_GET_FLAGS()
PyCFunction_GET_FUNCTION()
PyCFunction_GET_SELF()
PyList_GET_SIZE()
PyList_SET_ITEM()
PyTuple_GET_SIZE()
PyTuple_SET_ITEM()
PyUnicode_AS_DATA()
PyUnicode_AS_UNICODE()
PyUnicode_CHECK_INTERNED()
PyUnicode_DATA()
PyUnicode_GET_DATA_SIZE()
PyUnicode_GET_LENGTH()
PyUnicode_GET_SIZE()
PyUnicode_IS_ASCII()
PyUnicode_IS_COMPACT()
PyUnicode_IS_COMPACT_ASCII()
PyUnicode_IS_READY()
PyUnicode_MAX_CHAR_VALUE()
PyUnicode_READ()
PyUnicode_READY()
PyUnicode_READ_CHAR()
PyUnicode_WRITE()
PyWeakref_GET_OBJECT()
Py_SIZE(): Py_SET_SIZE() must be used to set an object size
Py_TYPE(): Py_SET_TYPE() must be used to set an object type
_PyUnicode_COMPACT_DATA()
_PyUnicode_NONCOMPACT_DATA()

Convert 2 macros to regular functions:

PyType_SUPPORTS_WEAKREFS()
Py_GETENV()

Remove 11 macros:

Moved to the internal C API:
- PyHeapType_GET_MEMBERS(): renamed to _PyHeapType_GET_MEMBERS()
- _Py_InIntegralTypeRange()
- _Py_IntegralTypeMax()
- _Py_IntegralTypeMin()
- _Py_IntegralTypeSigned()
PyFunction_AS_FRAME_CONSTRUCTOR()
Py_FORCE_DOUBLE()
Py_OVERFLOWED()
Py_SET_ERANGE_IF_OVERFLOW()
Py_SET_ERRNO_ON_MATH_ERROR()
_Py_SET_EDOM_FOR_NAN()

Add _Py_RVALUE() to 7 macros to disallow using them as l-value:

_PyGCHead_SET_FINALIZED()
_PyGCHead_SET_NEXT()
asdl_seq_GET()
asdl_seq_GET_UNTYPED()
asdl_seq_LEN()
asdl_seq_SET()
asdl_seq_SET_UNTYPED()

Note: the PyCell_SET() macro was modified to use _Py_RVALUE(), but it already used (void) in Python 3.10.

Python 3.10

Convert 3 macros to regular functions:

PyDescr_IsData()
PyExceptionClass_Name()
PyIter_Check()

Convert 2 macros to static inline functions:

PyObject_TypeCheck()
Py_REFCNT(): Py_SET_REFCNT() must be used to set an object reference count

Remove 6 macros:

PyAST_Compile()
PyParser_SimpleParseFile()
PyParser_SimpleParseString()
PySTEntry_Check(): moved to the internal C API
_PyErr_OCCURRED()
_PyList_ITEMS(): moved to the internal C API

Modify 3 macros to disallow using them as l-values by adding (void) cast:

PyCell_SET()
PyList_SET_ITEM()
PyTuple_SET_ITEM()

Python 3.9

Convert 6 macros to regular functions:

PyIndex_Check()
PyObject_CheckBuffer()
PyObject_GET_WEAKREFS_LISTPTR()
PyObject_IS_GC()
Py_EnterRecursiveCall()
Py_LeaveRecursiveCall()

Convert 5 macros to static inline functions:

PyType_Check()
PyType_CheckExact()
PyType_HasFeature()
Py_UNICODE_COPY()
Py_UNICODE_FILL()

Convert 3 static inline functions to regular functions:

_Py_Dealloc()
_Py_ForgetReference()
_Py_NewReference()

Remove 18 macros:

Moved to the internal C API:
- PyDoc_STRVAR_shared():
- PyObject_GC_IS_TRACKED()
- PyObject_GC_MAY_BE_TRACKED()
- Py_AS_GC()
- _PyGCHead_FINALIZED()
- _PyGCHead_NEXT()
- _PyGCHead_PREV()
- _PyGCHead_SET_FINALIZED()
- _PyGCHead_SET_NEXT()
- _PyGCHead_SET_PREV()
- _PyGC_SET_FINALIZED()
Py_UNICODE_MATCH()
_Py_DEC_TPFREES()
_Py_INC_TPALLOCS()
_Py_INC_TPFREES()
_Py_MakeEndRecCheck()
_Py_MakeRecCheck()
_Py_RecursionLimitLowerWaterMark()

Python 3.8

Convert 9 macros to static inline functions:

Py_DECREF()
Py_INCREF()
Py_XDECREF()
Py_XINCREF()
_PyObject_CallNoArg()
_PyObject_FastCall()
_Py_Dealloc()
_Py_ForgetReference()
_Py_NewReference()

Remove 7 macros:

_PyGCHead_DECREF()
_PyGCHead_REFS()
_PyGCHead_SET_REFS()
_PyGC_REFS()
_PyObject_GC_TRACK(): moved to the internal C API
_PyObject_GC_UNTRACK(): moved to the internal C API
_Py_CHECK_REFCNT()

Debug a Python reference leak

2022-11-04T13:00:00+01:00

This morning, I got this email from the buildbot-status mailing list:

The Buildbot has detected a new failure on builder PPC64LE Fedora Rawhide Refleaks 3.x while building Python.

I get many of buildbot failures per month (by email), but I like to debug reference leaks: they are more challenging :-) I decided to write this article to document and explain my work on maintaining Python (buildbots).

I truncated most the output of most commands in this article to make it easier to read.

Drawing: Childhood memories in the countryside by Djamila Knopf.

Reproduce the bug

I look into buildbot logs:

test_int leaked [1, 1, 1] references, sum=3

Aha, interesting: the test_int test leaks Python strong references, each test iteration leaks exactly one reference. Well, in short, it leaks memory.

I build Python to check if the refleak is still there:

git switch main
make clean
./configure --with-pydebug
make

The main branch is currently at this commit:

$ git show main
commit 2844aa6a8eb1d486b5c432f0ed33a2082998f41e
(...)

I run the test with -R 3:3 to check for reference leaks:

$ ./python -m test -R 3:3 test_int
(...)
test_int leaked [1, 1, 1] references, sum=3
(...)
Total duration: 4.8 sec

Great! It's still there, it's real regression. I told you, I love this kind of bugs :-)

Identify which test leaks (test.bisect_cmd)

$ ./python -m test test_int --list-cases|wc -l
42
$ wc -l Lib/test/test_int.py
885 Lib/test/test_int.py

test_int has only 42 methods and takes 4.8 seconds to run (with -R 3:3). That's small, but the file is made of 885 lines of Python code. I'm lazy, I don't want to read so many lines. I will use python -m test.bisect_cmd to identify which test method leaks so I have less test code to read and reproducing the test will be even faster.

I run python -m test.bisect_cmd:

$ ./python -m test.bisect_cmd -R 3:3 test_int
(...)
[+] Iteration 17: run 1 tests/2
(...)
test_int leaked [1, 1, 1] references, sum=3
(...)
* test.test_int.PyLongModuleTests.test_pylong_misbehavior_error_path_from_str

I love watching this tool doing my job, I don't have anything to do! :-)

I confirm that the test_pylong_misbehavior_error_path_from_str() test leaks:

$ ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
test_int leaked [1, 1, 1] references, sum=3
Total duration: 445 ms

The test_pylong_misbehavior_error_path_from_str() method is only 17 lines of code, it's way better than 885 lines of code (52x less code to read). And reproducing the bug now only takes 445 ms instead of 4.8 seconds (10x faster).

At this point, there is the brave method of looking into the C code: Python is made of 500 000 lines of C code. Good luck! Or maybe there is another way?

Git bisection

Again, I'm lazy. I always begin with the "divide to conquer" method. A Git bisection is an efficient method for that.

I start git bisect:

git bisect reset
git bisect start --term-bad=leak --term-good=noleak
git bisect leak  # we just saw that current commit leaks

Defining "good" and "bad" terms helps me a lot to prevent mistakes: it's a nice Git bisect feature! In the past, I always picked the wrong one at some point which messed up the whole bisection.

Ok, now how can I know when the leak was introduced? Well, I like to move in the past step by step: one day, two days, one week, one month, one year, etc.

I pick a random commit merged yesterday:

$ date
Fri Nov  4 11:55:12 CET 2022

$ git log
(...)
commit 016c7d37b6acfe2203542a2655080c6402b3be1f
Date:   Thu Nov 3 23:21:01 2022 +0000
(...)
commit 4c4b5ce2e529a1279cd287e2d2d73ffcb6cf2ead
Date:   Thu Nov 3 16:18:38 2022 -0700
(...)

I'm not lucky at my first bet, the code already leaked yesterday:

$ git checkout 4c4b5ce2e529a1279cd287e2d2d73ffcb6cf2ead^C
$ make && ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
test_int leaked [1, 1, 1] references, sum=3

I repeat the process, I pick a random commit the day before:

$ git log
(...)
commit f3007ac3702ea22c7dd0abf8692b1504ea3c9f63
Author: Victor Stinner <vstinner@python.org>
Date:   Wed Nov 2 20:45:58 2022 +0100
(...)

For my greatest pleasure, I pick a commit made by myself. Maybe I'm lucky and I'm the one who introduced the leak :-D

$ git checkout f3007ac3702ea22c7dd0abf8692b1504ea3c9f63
$ make && ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
(...)
Tests result: NO TESTS RAN

"NO TESTS RAN" means that the test doesn't exist. Oh wait, the test didn't exist 2 days ago? So the test itself is new? Well, no tests ran also means... "no leak".

I will make the assumption that "NO TESTS RAN" means "no leak" and see what's going on:

$ git bisect noleak
Bisecting: 13 revisions left to test after this (roughly 4 steps)

$ make && ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
Tests result: NO TESTS RAN
$ git bisect noleak
Bisecting: 6 revisions left to test after this (roughly 3 steps)

$ make && ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
Tests result: NO TESTS RAN
$ git bisect noleak
Bisecting: 3 revisions left to test after this (roughly 2 steps)

$ make && ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
Tests result: NO TESTS RAN
$ git bisect noleak
Bisecting: 1 revision left to test after this (roughly 1 step)

$ make && ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
test_int leaked [1, 1, 1] references, sum=3
$ git bisect leak
Bisecting: 0 revisions left to test after this (roughly 0 steps)

$ make && ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
test_int leaked [1, 1, 1] references, sum=3

vstinner@mona$ git bisect leak
4c4b5ce2e529a1279cd287e2d2d73ffcb6cf2ead is the first leak commit

commit 4c4b5ce2e529a1279cd287e2d2d73ffcb6cf2ead
Author: Gregory P. Smith <greg@krypto.org>
Date:   Thu Nov 3 16:18:38 2022 -0700

    gh-90716: bugfixes and more tests for _pylong. (#99073)

    * Properly decref on _pylong import error.
    * Improve the error message on _pylong TypeError.
    * Fix the assertion error in pydebug builds to be a TypeError.
    * Tie the return value comments together.

    These are minor followups to issues not caught among the reviewers on
    https://github.com/python/cpython/pull/96673.

 Lib/test/test_int.py | 39 +++++++++++++++++++++++++++++++++++++++
 Objects/longobject.c | 15 +++++++++++----
 2 files changed, 50 insertions(+), 4 deletions(-)

In total, it took 7 git bisect steps to identify a single commit. That's quick! I also love this tool, I feel that it does my job!

Sometimes, I mess up with Git bisection. Here, the guilty commit seems like a good candidate since it changes Objects/longobject.c which is C code, so it can likely introduce a leak. Moreover, this C file is the implementation of the Python int type, so it is directly related to test_int (the test suite of the int type).

Just in case, I test manually the the leak before/after:

# after
$ git checkout 4c4b5ce2e529a1279cd287e2d2d73ffcb6cf2ead
$ make && ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
test_int leaked [1, 1, 1] references, sum=3

# before
$ git checkout 4c4b5ce2e529a1279cd287e2d2d73ffcb6cf2ead^
$ make && ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
Tests result: NO TESTS RAN

Ok, there is no doubt anymore: the commit introduced the leak. But since the commit also adds the leaking test, maybe the leak already existed, and it's just that nobody noticed the leak before.

Debug the leak

Since I identified the commit introducing the leak, I only have to review code changes by this single commit. But to debug the code, I prefer to come back to the main branch. To prepare a fix, I will have to start from the main branch anyway.

Go back to the main branch:

$ git bisect reset
$ git switch main

The second command is useless, I was already at the main branch. I did some many mistakes with Git in the past, that I took the habit of doing things very carefully. I don't care of doing things twice, just in case. It's cheaper than messing with the Git god! Trust me.

Just in case, I double check that the leak is still there in the main branch:

$ make && ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
test_int leaked [1, 1, 1] references, sum=3

Ok, we are good to start debugging. Let me open Lib/test/test_int.py and look for the test_pylong_misbehavior_error_path_from_str() method:

@support.cpython_only  # tests implementation details of CPython.
@unittest.skipUnless(_pylong, "_pylong module required")
@mock.patch.object(_pylong, "int_from_string")
def test_pylong_misbehavior_error_path_from_str(
        self, mock_int_from_str):
    big_value = '7'*19_999
    with support.adjust_int_max_str_digits(20_000):
        mock_int_from_str.return_value = b'not an int'
        with self.assertRaises(TypeError) as ctx:
            int(big_value)
        self.assertIn('_pylong.int_from_string did not',
                      str(ctx.exception))

        mock_int_from_str.side_effect = RuntimeError("test123")
        with self.assertRaises(RuntimeError):
            int(big_value)

Always divide to conquer: let me try to make the code as short as possible (7 lines), I also make the "big_value" smaller:

@mock.patch.object(_pylong, "int_from_string")
def test_pylong_misbehavior_error_path_from_str(self, mock_int_from_str):
    big_value = '7' * 9999
    with support.adjust_int_max_str_digits(10_000):
        mock_int_from_str.return_value = b'not an int'
        with self.assertRaises(TypeError) as ctx:
            int(big_value)

Ok, so the test is about converting a long string (9999 decimal digits) to an integer using the new _pylong module which is implemented in pure Python (Lib/_pylong.py) and called from C code (Objects/longobject.c). Well, I followed recent developments, so I don't have to dig into the C code to know that. It helps!

If I search for _pylong in Objects/longobject.c, I find this interesting function:

/* asymptotically faster str-to-long conversion for base 10, using _pylong.py */
static int
pylong_int_from_string(const char *start, const char *end, PyLongObject **res)
{
    PyObject *mod = PyImport_ImportModule("_pylong");
    ...
}

With a quick look, I don't see any obvious reference leak in this code. I add printf() to make sure that I'm looking at the right function:

static int
pylong_int_from_string(const char *start, const char *end, PyLongObject **res)
{
    ...
    PyObject *s = PyUnicode_FromStringAndSize(start, end-start);
    if (s == NULL) {
        Py_DECREF(mod);
        goto error;
    }
printf("pylong_int_from_string()\n");
    PyObject *result = PyObject_CallMethod(mod, "int_from_string", "O", s);
    ...
}

I added the print before the int_from_string() call, since this function is overriden by the test.

I build Python and run the test:

$ make
$ ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
(...)
beginning 6 repetitions
123456
pylong_int_from_string()
.pylong_int_from_string()
.pylong_int_from_string()
.pylong_int_from_string()
.pylong_int_from_string()
.pylong_int_from_string()
(...)

Ok, I'm looking at the right place. The print happens when the test runs. So which code path is taken? Let me add print calls after the function call:

static int
pylong_int_from_string(const char *start, const char *end, PyLongObject **res)
{
    ...
    PyObject *result = PyObject_CallMethod(mod, "int_from_string", "O", s);
    Py_DECREF(s);
    Py_DECREF(mod);
    if (result == NULL) {
printf("pylong_int_from_string() error\n");   // <====== ADD
        goto error;
    }
    if (!PyLong_Check(result)) {
printf("pylong_int_from_string() wrong type\n");   // <====== ADD
        PyErr_SetString(PyExc_TypeError,
                        "_pylong.int_from_string did not return an int");
        goto error;
    }
printf("pylong_int_from_string() ok\n");   // <====== ADD
    ...
}

Test output:

...
pylong_int_from_string() wrong type
.pylong_int_from_string() wrong type
.pylong_int_from_string() wrong type
...

Aha, the bug should be around the if (!PyLong_Check(result)) code path. Oh wait... result is a Python object, and in this code path, the function exits without returning result to the caller, nor removing the reference to result. That's our leak!

Write a fix

To write a fix, I start by reverting all local changes (remove debug traces, restore the original test code):

$ git checkout .

I write a fix:

$ git diff
diff --git a/Objects/longobject.c b/Objects/longobject.c
index a872938990..652fdb7974 100644
--- a/Objects/longobject.c
+++ b/Objects/longobject.c
@@ -2376,6 +2376,7 @@ pylong_int_from_string(const char *start, const char *end, PyLongObject **res)
         goto error;
     }
     if (!PyLong_Check(result)) {
+        Py_DECREF(result);
         PyErr_SetString(PyExc_TypeError,
                         "_pylong.int_from_string did not return an int");
         goto error;

I build and test my fix:

$ make && ./python -m test -R 3:3 test_int -m test_pylong_misbehavior_error_path_from_str
(...)
Tests result: SUCCESS

Ok, the leak is fixed! So it was a just a missing Py_DECREF() in code recently added to Python. It's a common mistake. By the way, when I looked at the code the first code, I also missed this "obvious" leak.

I prepare a PR:

$ git switch -c int_str
$ git commit -a
# Commit message:
# gh-90716: Fix pylong_int_from_string() refleak

Let me validate my work from the new clean commit:

$ make && ./python -m test -R 3:3 test_int
(...)
Tests result: SUCCESS

I complete the commit message using git commit --amend:

gh-90716: Fix pylong_int_from_string() refleak

Fix validated by:

    $ ./python -m test -R 3:3 test_int
    Tests result: SUCCESS

I run gh_pr.sh (my short shell script) to create a PR from the command line.

I add the skip news label on the PR, since this refleak is not part of any Python release, no user is impacted. It's not worth documenting it. I don't think that the change is part of Python 3.12 alpha 1. Moreover, only very few users test alpha 1 releases.

Here it is, my shiny PR fixing the leak! https://github.com/python/cpython/pull/99094

Since Gregory worked on longobject.c recently, I add him in copy of my PR. I just add the comment cc @gpshead to my PR.

I don't plan to wait for this review. The change is just one line, I'm confident that it does fix the issue, I don't need a review.

To finish, I reply by email to the buildbot-status failure email.

Conclusion

In total, it took me between one and two hours to reproduce, debug and fix this reference leak.

In the meanwhile, I also looked into other Python stuffs (and I discussed with friends!), while the bisection was running, or during the Python build. It's hard to estimate exactly how much time it takes me to fix a refleak.

I consider that I'm efficient on fixing such leak since I'm following the Python development: I was already aware of the on-going _pylong work. I also fixed many refleaks in the past.

By the way, I wrote the python -m test.bisect_cmd tool exactly to accelerate my work on debugging reference leaks. I'm now also used to Git bisection.

For me, the key of my whole methodology is to "divide to conquer":

Reproduce the issue
Get a reproducer
Make the reproducer as fast as possible and as short as possible
Use Git bisection to identify the change introducing the change
Add print calls to identify which parts of the code and the test are taken by the issue

Oh by the way, while I finished my article, my PR got reviewed and I merged it: my commit fixing the leak!

Python C API: Add functions to access PyObject

2021-10-05T14:00:00+02:00

The PyObject structure prevents indirectly to optimize CPython. We will see why and how I prepared the C API to make this structure opaque. It took me 1 year and a half to add functions and to introduce incompatible C API changes (fear!).

In February 2020, I started by adding functions like Py_SET_TYPE() to abstract accesses to the PyObject structure. I modified C extensions of the standard library to use functions like Py_TYPE() and Py_SET_TYPE().

I converted the Py_TYPE() macro to a static inline function, but my change was reverted twice. I had to fix many C extensions and fix a test_exceptions crash on Windows to be able to finally merge my change in September 2021.

Finally, we will also see what can be done next to be able to fully make the PyObject structure opaque.

Thanks to Dong-hee Na, Hai Shi and Andy Lester who helped me to make these changes, and thanks to Miro Hrončok who reported C extensions broken by my incompatible C API changes.

This article is a follow-up of the Make structures opaque in the Python C API article.

Drawing: "A spider in my bedroom" by Kéké

The C API prevents to optimize CPython

The C API allows to access directly to structure members by deferencing an PyObject* pointer. Example getting directly the reference count of an object:

Py_ssize_t get_refcnt(PyObject *obj)
{
    return obj->ob_refcnt;
}

This ability to access directly structure members prevents optimizing CPython.

Mandatory inefficient boxing/unboxing

The ability to dereference a PyObject* pointer prevents optimizations which avoid inefficient boxing/unboxing, like tagged pointers or list strategies.

No tagged pointer

Tagged pointers require adding code to all functions which currently dereference object pointers. The current C API prevents doing that in C extensions, since pointers can be dereferenced directly.

No list strategies

Since all Python object structures must start with a PyObject ob_base; member, it is not possible to make other structures opaque before PyObject is made opaque. It prevents implementing PyPy list strategies to reduce the memory footprint, like storing an array of numbers directly as numbers, not as boxed numbers (PyLongObject objects).

Currently, the PyListObject structure cannot be made opaque. If PyListObject could be made opaque, it would be possible to store an array of numbers directly as numbers, and to box objects in PyList_GetItem() on demand.

No moving garbage collector

Being able to dereference a PyObject** pointer also prevents to move objects in memory. A moving garbage collector can compact memory to reduce the fragmentation. Currently, it cannot be implemented in CPython.

Cannot allocate temporarily objects on the stack

In CPython, all objects must be allocated on the heap. If an object is allocated on the stack, stored in a list and the list is still accessible after the function completes: the stack memory is no longer valid, and so the list is corrupted at the function exit.

If objects would only be referenced by opaque handles, as the HPy project does, it would be possible to copy the object from the stack to the heap memory, when the object is added to the list.

Reference counting doesn't scale

The PyObject structure has a reference count (ob_refcnt member), whereas reference counting is a performance bottleneck when using the same objects from multiple threads running in parallel. Quickly, there is a race for the memory cacheline which contains the PyObject.ob_refcnt counter. It is especially true for the most commonly used Python objects like None and True singletons. All CPUs want to read or modify it in parallel.

This problem killed the Gilectomy project which attempted to remove the GIL from CPython.

A tracing garbage collector doesn't need reference counting, but it cannot be implemented currently because of the PyObject structure.

Creation of the issue (Feb 2020)

In February 2020, I created the bpo-39573 : "[C API] Make PyObject an opaque structure in the limited C API". It is related to my work on the my PEP 620 (Hide implementation details from the C API).

My initial plan was to make the PyObject structure fully opaque in the C API.

Add functions

In Python 3.8, Py_REFCNT() and Py_TYPE() macros can be used to set directly an object reference count or an object type:

Py_REFCNT(obj) = new_refcnt;
Py_TYPE(obj) = new_type;

Such syntax requires a direct access to PyObject.ob_refcnt and PyObject.ob_type members as l-value.

In Python 3.9, I added Py_SET_REFCNT() and Py_SET_TYPE() functions to add an abstraction to PyObject members, and I added Py_SET_SIZE() to add an abstraction to the PyVarObject.ob_size member.

In Python 3.9, I also added Py_IS_TYPE(obj, type,) helper function to test an object type. It is equivalent to: Py_TYPE(obj) == type.

Use Py_TYPE() and Py_SET_SIZE() in the stdlib

I modified the standard library (C extensions) to no longer access directly PyObject and PyVarObject members directly:

Replace "obj->ob_refcnt" with Py_REFCNT(obj)
Replace "obj->ob_type" with Py_TYPE(obj)
Replace "obj->ob_size" with Py_SIZE(obj)
Replace "Py_REFCNT(obj) = new_refcnt" with Py_SET_REFCNT(obj, new_refcnt)
Replace "Py_TYPE(obj) = new_type" with Py_SET_TYPE(obj, new_type)
Replace "Py_SIZE(obj) = new_size" with Py_SET_SIZE(obj, new_size)
Replace "Py_TYPE(obj) == type" test with Py_IS_TYPE(obj, type)

Enforce Py_SET_TYPE()

In Python 3.10, I converted Py_REFCNT(), Py_TYPE() and Py_SIZE() macros to static inline functions, so Py_TYPE(obj) = new_type becomes a compiler error.

Static inline functions still access directly PyObject and PyVarObject members at the ABI level, and so don't solve the initial goal: "make the PyObject structure opaque". Not accessing members at the ABI level can have a negative impact on performance and I prefer to address it later. I already get enough backfire with the other C API changes that I made :-)

Broken C extensions (first revert)

Converting Py_TYPE() and Py_SIZE() macros to static inline functions broke 16 C extensions:

Cython
PyPAM
bitarray
boost
breezy
duplicity
gobject-introspection
immutables
mercurial
numpy
pybluez
pycurl
pygobject3
pylibacl
pyside2
rdiff-backup

In November 2020, during the Python 3.10 devcycle, I preferred to revert Py_TYPE() and Py_SIZE() changes.

I kept the Py_REFCNT() change since it only broke a single C extension (PySide2) and it was simple to update it to Py_SET_REFCNT().

pythoncapi_compat

I created the pythoncapi_compat project to provide the following functions to Python 3.8 and older:

Py_SET_REFCNT()
Py_SET_TYPE()
Py_SET_SIZE()
Py_IS_TYPE()

I also wrote a upgrade_pythoncapi.py script to upgrade C extensions to use these functions, without losing support for Python 3.8 and older.

Using the pythoncapi_compat project, I succeeded to update multiple C extensions to prepare them for Py_TYPE() becoming a static inline function.

test_exceptions crash (second revert)

In June 2021, during the Python 3.11 devcycle, I changed again Py_TYPE() and Py_SIZE() since most C extensions have been fixed in the meanwhile.

Problem: test_recursion_in_except_handler() of test_exceptions started to crash on a Python debug build on Windows: see bpo-44348.

Since nobody understood the issue, it was decided to revert my change again to repair buildbots.

Fix BaseException deallocator

In September 2021, I looked at the test_exceptions crash. In a debug build, the MSC compiler doesn't inline calls to static inline functions. Because of that, converting Py_TYPE() macro to a static inline functions increases the stack memory usage on a Python debug build on Windows.

I proposed to enable compiler optimizations when building Python in debug mode on Windows, to inline calls to static inline functions like Py_TYPE(). This idea was rejected, since the debug build must remain fully usable in a debugger.

I looked again at the crash and found the root issue. test_recursion_in_except_handler() creates chained of exceptions. When an exception is deallocated, it calls the deallocator of another exception, etc.

recurse_in_except() sub-test creates chains of 11 nested deallocator calls
recurse_in_body_and_except() sub-test creates a chain of 8192 nested deallocator calls

I proposed a change to use the trashcan mechanism. It limits the call stack to 50 function calls. I checked with a benchmark that the performance overhead is acceptable. My change fixed the test_exceptions crash!

Close the PyObject issue

Since most C extensions have been fixed and test_exceptions is fixed, I was able to change Py_TYPE() and Py_SIZE() for the third time. My final commit: Py_TYPE becomes a static inline function.

I changed the issue topic to restrict it to adding functions to access PyObject members. Previously, the goal was to make the PyObject structure opaque. It took 1 year and a half to add made all these changes.

What's Next to Make PyObject opaque?

The PyObject structure is used to define structurres of all Python types, like PyListObject. All structures start with PyObject ob_base; and so the compiler must have access to the PyObject structure.

Moreover, PyType_FromSpec() and PyType_Spec API use indirectly sizeof(PyObject) in the PyType_Spec.basicsize member when defining a type.

One option to make the PyObject structure opaque would be to modify the PyObject structure to make it empty, and move its members into a new private _PyObject structure. This _PyObject structure would be allocated before the PyObject* pointer, same idea as the current PyGC_Head header which is also allocated before the PyObject* pointer.

These changes are more complex than what I expected and so I prefer to open a new issue later to propose these changes. Also, the performance of these changes must be checked with benchmarks, to ensure that there is no performance overhead or that the overhead is acceptable.

C API changes between Python 3.5 to 3.10

2021-10-04T15:00:00+02:00

I'm trying to enhance and to fix the Python C API for 5 years. My first goal was to shrink the C API without breaking third party C extensions. I hid many private functions from the public functions: I moved them to the "internal C API". I also deprecated and removed many functions.

Between Python 3.5 and 3.10, 80 symbols have been removed. Python 3.10 is the first Python version exporting less symbols than its previous version!

Since Python 3.8, the C API is organized as 3 parts:

Include/ directory: Limited API
Include/cpython/ directory: CPython implementation details
Include/internal/ directory: The internal API

The devguide Changing Python’s C API documentation now gives guidelines for C API additions, like avoiding borrowed references.

The limited C API got a few more functions, whereas broken and private functions have been removed. The Stable ABI is now explicitly defined and documented in the C API Stability page.

This article lists all C API changes, not only the ones done by me.

Shrink the the C API

Between Python 3.5 and 3.10, 80 symbols (functions or variables) have been removed, 3 structures have been removed, and 21 functions have been deprecated. In meanwhile, other symbols have been added to implement new Python features at each Python version.

Python 3.10 is the first Python version exporting less symbols than its previous version.

Python 3.6

Deprecate 4 functions:

PyUnicode_AsDecodedObject()
PyUnicode_AsDecodedUnicode()
PyUnicode_AsEncodedObject()
PyUnicode_AsEncodedUnicode()

Python 3.7

Deprecate PyOS_AfterFork()
Remove PyExc_RecursionErrorInst singleton (also removed in Python 3.6.4).

Python 3.8

Remove 3 functions:

PyByteArray_Init()
PyByteArray_Fini()
PyEval_ReInitThreads()

Remove 1 structure:

PyInterpreterState (moved to the internal C API)

Python 3.9

Remove 32 symbols:

PyAsyncGen_ClearFreeLists()
PyCFunction_ClearFreeList()
PyCmpWrapper_Type
PyContext_ClearFreeList()
PyDict_ClearFreeList()
PyFloat_ClearFreeList()
PyFrame_ClearFreeList()
PyFrame_ExtendStack()
PyList_ClearFreeList()
PyMethod_ClearFreeList()
PyNoArgsFunction type
PyNullImporter_Type
PySet_ClearFreeList()
PySortWrapper_Type
PyTuple_ClearFreeList()
PyUnicode_ClearFreeList()
Py_UNICODE_MATCH()
_PyAIterWrapper_Type
_PyBytes_InsertThousandsGrouping()
_PyBytes_InsertThousandsGroupingLocale()
_PyDebug_PrintTotalRefs()
_PyFloat_Digits()
_PyFloat_DigitsInit()
_PyFloat_Repr()
_PyThreadState_GetFrame() (and _PyRuntime.getframe)
_PyUnicode_ClearStaticStrings()
_Py_AddToAllObjects()
_Py_InitializeFromArgs()
_Py_InitializeFromWideArgs()
_Py_PrintReferenceAddresses()
_Py_PrintReferences()
_Py_tracemalloc_config

Remove 1 structure:

PyGC_Head (moved to the internal C API)

Deprecate 15 functions:

PyEval_CallFunction()
PyEval_CallMethod()
PyEval_CallObject()
PyEval_CallObjectWithKeywords()
PyNode_Compile()
PyParser_SimpleParseFileFlags()
PyParser_SimpleParseStringFlags()
PyParser_SimpleParseStringFlagsFilename()
PyUnicode_AsUnicode()
PyUnicode_AsUnicodeAndSize()
PyUnicode_FromUnicode()
PyUnicode_WSTR_LENGTH()
Py_UNICODE_COPY()
Py_UNICODE_FILL()
_PyUnicode_AsUnicode()

Python 3.10

Remove 44 symbols:

PyAST_Compile()
PyAST_CompileEx()
PyAST_CompileObject()
PyAST_Validate()
PyArena_AddPyObject()
PyArena_Free()
PyArena_Malloc()
PyArena_New()
PyFuture_FromAST()
PyFuture_FromASTObject()
PyLong_FromUnicode()
PyNode_Compile()
PyOS_InitInterrupts()
PyObject_AsCharBuffer()
PyObject_AsReadBuffer()
PyObject_AsWriteBuffer()
PyObject_CheckReadBuffer()
PyParser_ASTFromFile()
PyParser_ASTFromFileObject()
PyParser_ASTFromFilename()
PyParser_ASTFromString()
PyParser_ASTFromStringObject()
PyParser_SimpleParseFileFlags()
PyParser_SimpleParseStringFlags()
PyParser_SimpleParseStringFlagsFilename()
PyST_GetScope()
PySymtable_Build()
PySymtable_BuildObject()
PySymtable_Free()
PyUnicode_AsUnicodeCopy()
PyUnicode_GetMax()
Py_ALLOW_RECURSION
Py_END_ALLOW_RECURSION
Py_SymtableString()
Py_SymtableStringObject()
Py_UNICODE_strcat()
Py_UNICODE_strchr()
Py_UNICODE_strcmp()
Py_UNICODE_strcpy()
Py_UNICODE_strlen()
Py_UNICODE_strncmp()
Py_UNICODE_strncpy()
Py_UNICODE_strrchr()
_Py_CheckRecursionLimit

Remove 1 structure:

_PyUnicode_Name_CAPI

Deprecate 1 function:

PyUnicode_InternImmortal()

Moreover, PyUnicode_FromStringAndSize(NULL, size) and PyUnicode_FromUnicode(NULL, size) have been deprecated.

Statistics

Public Python symbols exported with PyAPI_FUNC() and PyAPI_DATA():

Python	Symbols
2.7	891
3.6	1041 (+150)
3.7	1068 (+27)
3.8	1105 (+37)
3.9	1115 (+10)
3.10	1080 (-35)

Command used to count public symbols:

grep -E 'PyAPI_(FUNC|DATA)' Include/*.h Include/cpython/*.h|grep -v ' _Py'|wc -l

Reorganize header files

Since Python 3.8, the C API is organized as 3 parts:

Include/ directory: Limited API
Include/cpython/ directory: CPython implementation details
Include/internal/ directory: The internal API

The intent is to help developers to think about if their additions must be part of the limited C API, the CPython C API or the internal C API.

Python 3.7

Creation on the Include/internal/ directory.

Python 3.8

Creation on the Include/cpython/ directory.

Python 3.10

Move 8 header files from Include/ to Include/cpython/:

odictobject.h
parser_interface.h
picklebufobject.h
pyarena.h
pyctype.h
pydebug.h
pyfpe.h
pytime.h

Python 3.10 added a Include/README.rst documentation to explain this organization and give guidelines for adding new functions. For example, new functions in the public C API must not steal references nor return borrowed references. In the meanwhile, this documentation moved to the devguide: Changing Python’s C API.

Statistics

Number of C API line numbers per Python version:

Python	Limited API	CPython API	Internal API	Total
2.7	12,686 (100%)	0	0	12,686
3.6	16,011 (100%)	0	0	16,011
3.7	16,517 (96%)	0	705 (4%)	17,222
3.8	13,160 (70%)	3,417 (18%)	2,230 (12%)	18,807
3.9	12,264 (62%)	4,343 (22%)	3,066 (16%)	19,673
3.10	10,305 (52%)	4,513 (23%)	5,092 (26%)	19,910

Commands:

Limited: wc -l Include/*.h
CPython: wc -l Include/cpython/*.h
Internal: wc -l Include/internal/*.h

Changes in the Limited C API

Between Python 3.8 and 3.10, 4 new functions have been and 14 symbols (functions or variables) have been removed from the limited C API.

The trashcan API was excluded from the limited C API since it never worked. The implementation accessed directly PyThreadState members, whereas this structure is opaque in the limited C API.

On the other side, Py_EnterRecursiveCall() and Py_LeaveRecursiveCall() functions have been added to the limited C API. In Python 3.8, they were defined as macros accessing directly PyThreadState members. In Python 3.9, they became opaque function calls and so are now compatible with the stable ABI.

Python 3.9

Add 3 functions to the limited C API:

Py_EnterRecursiveCall()
Py_LeaveRecursiveCall()
PyFrame_GetLineNumber()

Remove 14 symbols from the limited C API:

PyFPE_START_PROTECT()
PyFPE_END_PROTECT()
PyThreadState_DeleteCurrent()
PyTrash_UNWIND_LEVEL
Py_TRASHCAN_BEGIN
Py_TRASHCAN_BEGIN_CONDITION
Py_TRASHCAN_END
Py_TRASHCAN_SAFE_BEGIN
Py_TRASHCAN_SAFE_END
_PyTraceMalloc_NewReference()
_Py_CheckRecursionLimit
_Py_GetRefTotal()
_Py_NewReference()
_Py_ForgetReference()

Python 3.10

Add 1 function to the limited C API:

PyUnicode_AsUTF8AndSize()

PEP 652: Maintaining the Stable ABI

Petr Viktorin wrote and implemented the PEP 652: Maintaining the Stable ABI in Python 3.10.

The Stable ABI (Application Binary Interface) for extension modules or embedding Python is now explicitly defined. The C API Stability documentation describes C API and ABI stability guarantees along with best practices for using the Stable ABI.

Creation of the pythoncapi_compat project

2021-03-30T20:00:00+02:00

In 2020, I created a new pythoncapi_compat project to add Python 3.10 support to C extensions without losing support for old Python versions. It supports Python 2.7-3.10 and PyPy 2.7-3.7. The project is made of two parts:

pythoncapi_compat.h: Header file providing new C API functions to old Python versions, like Py_SET_TYPE().
upgrade_pythoncapi.py: Script upgrading C extension modules using pythoncapi_compat.h. For example, it replaces Py_TYPE(obj) = type; with Py_SET_TYPE(obj, type);.

This article is about the creation of the header file and the upgrade script.

Photo: Strange cats 🐾 by Kéké.

Py_SET_TYPE() macro for Python 3.8 and older

Py_TYPE() macro converted to a static inline function

In May 2020 in the bpo-39573 "Make PyObject an opaque structure", Py_TYPE() (change by Dong-hee Na), Py_REFCNT() and Py_SIZE() (change by me) macros were converted to static inline functions. This change broke 17 C extension modules (see my previous article Make structures opaque in the Python C API).

I prepared this change in Python 3.9 by adding Py_SET_REFCNT(), Py_SET_TYPE() and Py_SET_SIZE() functions, and by modifying Python to use these functions. I also added Py_IS_TYPE() function which tests the type of an object:

static inline int _Py_IS_TYPE(PyObject *ob, PyTypeObject *type) {
    return ob->ob_type == type;
}
#define Py_IS_TYPE(ob, type) _Py_IS_TYPE(_PyObject_CAST(ob), type)

For example, Py_TYPE(ob) == (tp) can be replaced with Py_IS_TYPE(ob, tp).

Cython and numpy fixes

I fixed Cython by adding __Pyx_SET_REFCNT() and __Pyx_SET_SIZE() macros:

#if PY_VERSION_HEX >= 0x030900A4
  #define __Pyx_SET_REFCNT(obj, refcnt) Py_SET_REFCNT(obj, refcnt)
  #define __Pyx_SET_SIZE(obj, size) Py_SET_SIZE(obj, size)
#else
  #define __Pyx_SET_REFCNT(obj, refcnt) Py_REFCNT(obj) = (refcnt)
  #define __Pyx_SET_SIZE(obj, size) Py_SIZE(obj) = (size)
#endif

The numpy fix:

#if PY_VERSION_HEX < 0x030900a4
    #define Py_SET_TYPE(obj, typ) (Py_TYPE(obj) = typ)
    #define Py_SET_SIZE(obj, size) (Py_SIZE(obj) = size)
#endif

The numpy fix was updated to not have a return value by adding ", (void)0":

#if PY_VERSION_HEX < 0x030900a4
    #define Py_SET_TYPE(obj, type) ((Py_TYPE(obj) = (type)), (void)0)
    #define Py_SET_SIZE(obj, size) ((Py_SIZE(obj) = (size)), (void)0)
#endif

So the macros better mimicks the static inline functions behavior.

C API Porting Guide

I copied the numpy macros to the C API section of the Python 3.10 porting guide (What's New in Python 3.10). Py_SET_TYPE() documentation.

Since Py_TYPE() is changed to the inline static function, Py_TYPE(obj) = new_type must be replaced with Py_SET_TYPE(obj, new_type): see Py_SET_TYPE() (available since Python 3.9). For backward compatibility, this macro can be used:
#if PY_VERSION_HEX < 0x030900A4
#  define Py_SET_TYPE(obj, type) ((Py_TYPE(obj) = (type)), (void)0)
#endif

Copy/paste macros

Up to 3 macros must be copied/pasted for backward compatibility in each project:

#if PY_VERSION_HEX < 0x030900A4
#  define Py_SET_TYPE(obj, type) ((Py_TYPE(obj) = (type)), (void)0)
#endif

#if PY_VERSION_HEX < 0x030900A4
#  define Py_SET_REFCNT(obj, refcnt) ((Py_REFCNT(obj) = (refcnt)), (void)0)
#endif

#if PY_VERSION_HEX < 0x030900A4
#  define Py_SET_SIZE(obj, size) ((Py_SIZE(obj) = (size)), (void)0)
#endif

These macros started to be copied into multiple projects. Examples:

There might be a better way than copying/pasting these compatibility layer in each project, adding macros one by one...

Creation of the pythoncapi_compat.h header file

While the code for Py_SET_REFCNT(), Py_SET_TYPE() and Py_SET_SIZE() macros is short, I also wanted to use the new seven Python 3.9 getter functions on Python 3.8 and older:

Py_IS_TYPE()
PyFrame_GetBack()
PyFrame_GetCode()
PyInterpreterState_Get()
PyThreadState_GetFrame()
PyThreadState_GetID()
PyThreadState_GetInterpreter()

In June 2020, I created the pythoncapi_compat project project with a pythoncapi_compat.h header file which defines these functions as static inline functions. An "#if PY_VERSION_HEX" guard prevents to define a function if it's already provided by Python.h. Example of the current implementation of PyThreadState_GetInterpreter() for Python 3.8 and older:

// bpo-39947 added PyThreadState_GetInterpreter() to Python 3.9.0a5
#if PY_VERSION_HEX < 0x030900A5
static inline PyInterpreterState *
PyThreadState_GetInterpreter(PyThreadState *tstate)
{
    assert(tstate != NULL);
    return tstate->interp;
}
#endif

I wrote tests on each function using a C extension. The project initially supported Python 3.6 to Python 3.10. The test runner checks also for reference leaks.

Mercurial and Python 2.7

The Mercurial project has multiple C extensions, was broken on Python 3.10 by the Py_TYPE() change, and is one of the last project still requiring Python 2.7 in 2021. It's a good candidate to check if pythoncapi_compat.h is useful.

I proposed a patch then converted to a merge request. It got accepted in the "next" branch, but compatibility with Visual Studio 2008 had to be fixed for Python 2.7 on Windows. I fixed pythoncapi_compat.h by defining inline as __inline:

// Compatibility with Visual Studio 2013 and older which don't support
// the inline keyword in C (only in C++): use __inline instead.
#if (defined(_MSC_VER) && _MSC_VER < 1900 \
     && !defined(__cplusplus) && !defined(inline))
#  define inline __inline
#  define PYTHONCAPI_COMPAT_MSC_INLINE
   // These two macros are undefined at the end of this file
#endif

(...)

#ifdef PYTHONCAPI_COMPAT_MSC_INLINE
#  undef inline
#  undef PYTHONCAPI_COMPAT_MSC_INLINE
#endif

I chose to continue writing static inline, so pythoncapi_compat.h remains close to Python header files. I also modified the pythoncapi_compat test suite to also test Python 2.7.

pybind11 and PyPy

More recently, I added PyPy 2.7, 3.6 and 3.7 support for pybind11, since PyPy is tested by their CI. The fix is to no longer define the following functions on PyPy:

PyFrame_GetBack(), _PyFrame_GetBackBorrow()
PyThreadState_GetFrame(), _PyThreadState_GetFrameBorrow()
PyThreadState_GetID()
PyObject_GC_IsTracked()
PyObject_GC_IsFinalized()

Creation of the upgrade_pythoncapi.py script

upgrade_pythoncapi.py

In November 2020, I created a new upgrade_pythoncapi.py script to replace "Py_TYPE(obj) = type;" with "Py_SET_TYPE(obj, type);". The script is based on my old sixer.py project which adds Python 3 support to a Python project without losing Python 2 support. The upgrade_pythoncapi.py script uses regular expressions to replace one pattern with another.

Similar to sixer which adds import six to support Python 2 and Python 3 in a single code base, upgrade_pythoncapi.py adds #include "pythoncapi_compat.h" to support old and new versions of the Python C API in a single code base.

I first created a new GitHub project for upgrade_pythoncapi.py, but since it was too tightly coupled to the pythoncapi_compat.h header file, I moved the script to the pythoncapi_compat project.

Tests

I added more and more "operations" to update C extensions. For me, the most important part is the test suite to ensure that the script doesn't introduce bugs. It contains code which must not be replaced. For example, it ensures that frame->f_code = code is not replaced with _PyFrame_GetCodeBorrow(frame) = code by mistake.

Borrowed references

Code accessing frame->f_code directly must use PyFrame_GetCode() but this function returns a strong reference, whereas frame->f_code gives a borrowed reference. I added "Borrow" variants of the functions to pythoncapi_compat.h for upgrade_pythoncapi.py. For example, frame->f_code is replaced with _PyFrame_GetCodeBorrow() which is defined as:

static inline PyCodeObject*
_PyFrame_GetCodeBorrow(PyFrameObject *frame)
{
    return (PyCodeObject *)_Py_StealRef(PyFrame_GetCode(frame));
}

The _Py_StealRef(obj) function converts a strong reference to a borrowed reference (simplified code):

static inline PyObject* _Py_StealRef(PyObject *obj)
{
    Py_DECREF(obj);
    return obj;
}

It is the opposite of Py_NewRef(). It is similar to Py_DECREF(obj) but it can be used as an expression: it returns obj. pythoncapi_compat.h defines private _Py_StealRef() and _Py_XStealRef() static inline functions. First I proposed to add them to Python, but I abandoned the idea (see bpo-42522).

Thanks to the "Borrow" suffix in function names, it becomes easier to discover the usage of borrowed references. Using a borrowed reference is unsafe if it is possible that the object is destroyed before the last usage of borrowed reference. In case of doubt, it's better to use a strong reference. For example, _PyFrame_GetCodeBorrow() can be replaced with PyFrame_GetCode(), but it requires to explicitly delete the created strong reference with Py_DECREF().

Practical solution for incompatible C API changes

So far, I succeeded to convince 4 projects to use pythoncapi_compat.h: bitarray, immutables, Mercurial and python-zstandard.

In my opinion, pythoncapi_compat.h is the right approach to introduce incompatible C API changes: provide a practical solution to support old and new Python versions in a single code base.

The next steps is to get it adopted more widely and get it endorsed by the Python project, maybe by moving it under the PSF organization on GitHub.

Make structures opaque in the Python C API

2021-03-26T12:00:00+01:00

This article is about changes that I made, with the help other developers, in the Python C API in Python 3.8, 3.9 and 3.10 to avoid accessing structures members: prepare the C API to make structures opaque. These changes are related to my PEP 620 "Hide implementation details from the C API".

One change had negative impact on performance and had to be reverted. Making Python slower just to make structures opaque would first require to get the PEP 620 accepted.

While compatible changes merged in Python 3.8 and Python 3.9 went fine, one Python 3.10 incompatible change caused more troubles and had to be reverted.

Photo: OVHcloud data center fire in Strasbourg.

Rationale

The C API currently exposes most object structures, C extensions indirectly access structures members through the API, but can also access them directly. It causes different issues:

Modifying a structure can break an unknown number of C extensions. To prevent any risk, CPython core developers avoid modifying structures. Once most structures will be opaque, it will be possible to experiment optimizations which require deep structures changes without breaking C extensions. The irony is that we first have to break the backward compatibility and C extensions for that.
Any structure change breaks the ABI. The stable ABI solved this issue by not exposing structures into its limited C API. The idea is to bend the default C API towards the limited C API to provide a stable ABI for everyone in the long term.

Issues

Opaque structures

Python 3.8 made the PyInterpreterState structure opaque.
Python 3.9 made the PyGC_Head structure opaque.

Add getter functions to Python 3.9

PyObject, PyVarObject:
- Py_SET_REFCNT()
- Py_SET_TYPE()
- Py_SET_SIZE()
- Py_IS_TYPE()
PyFrameObject:
- PyFrame_GetCode()
- PyFrame_GetBack()
PyThreadState:
- PyThreadState_GetInterpreter()
- PyThreadState_GetFrame()
- PyThreadState_GetID()
PyInterpreterState:
- PyInterpreterState_Get()

PyInterpreterState_Get() can be used to replace PyThreadState_Get()->interp and PyThreadState_GetInterpreter(PyThreadState_Get()).

Convert macros to static inline functions in Python 3.8

Macro pitfalls

Macros are convenient but have multiple pitfalls. Some macros can be abused in surprising ways. For example, the following code is valid with Python 3.9:

if (obj == NULL || PyList_SET_ITEM (l, i, obj) < 0) { ... }

In Python 3.9, PyList_SET_ITEM() returns obj in this case, obj is a pointer, and so the test checks if a pointer is negative which makes no sense (but is accepted by C compilers by default). This code is likely a confusion with PyList_SetItem() which returns a int, negative in case of an error.

Zackery Spytz and me modified PyList_SET_ITEM() and PyCell_SET() macros in Python 3.10 to return void.

This change broke alsa-python: I proposed a fix which was merged.

One nice side effect of converting macros to static inline functions is that debuggers and profilers are able to retrieve the name of the function.

Converted macros

Py_INCREF(), Py_XINCREF()
Py_DECREF(), Py_XDECREF()
PyObject_INIT(), PyObject_INIT_VAR()
_PyObject_GC_TRACK(), _PyObject_GC_UNTRACK(), _Py_Dealloc()

Performance

Since Py_INCREF() is criticial for general Python performance, the impact of the change was analyzed in depth before being merged in bpo-35059. The usage of __attribute__((always_inline)) and __forceinline to force inlining was rejected.

Cast to PyObject*

Old Py_INCREF() implementation in Python 3.7:

#define Py_INCREF(op) (                   \
    _Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA \
    ((PyObject *)(op))->ob_refcnt++)

where _Py_INC_REFTOTAL _Py_REF_DEBUG_COMMA becomes _Py_RefTotal++, if the Py_REF_DEBUG macro is defined, or nothing otherwise. Current Py_INCREF() implementation in Python 3.10:

static inline void _Py_INCREF(PyObject *op)
{
#ifdef Py_REF_DEBUG
    _Py_RefTotal++;
#endif
    op->ob_refcnt++;
}
#define Py_INCREF(op) _Py_INCREF(_PyObject_CAST(op))

Most static inline functions go through a macro to cast their argument to PyObject* using the macro:

#define _PyObject_CAST(op) ((PyObject*)(op))

Convert macros to regular functions in Python 3.9

Converted macros

PyIndex_Check()
PyObject_CheckBuffer()
PyObject_GET_WEAKREFS_LISTPTR()
PyObject_IS_GC()
PyObject_NEW(): alias to PyObject_New()
PyObject_NEW_VAR(): alias to PyObjectVar_New()

Performance

PyType_HasFeature() was modified to always call PyType_GetFlags() function, rather than accessing directly PyTypeObject.tp_flags. The problem is that on macOS, Python is built without LTO, the PyType_GetFlags() call is not inlined, making functions like tuplegetter_descr_get() slower: see bpo-39542. I reverted the PyType_HasFeature() change until the PEP 620 is accepted. macOS does not use LTO to keep support support for macOS 10.6 (Snow Leopard): see bpo-41181.

Fast static inline functions

To keep best performances on Python built without LTO, fast private variants were added as static inline functions to the internal C API:

_PyIndex_Check()
_PyObject_IS_GC()
_PyType_HasFeature()
_PyType_IS_GC()

For example, PyObject_IS_GC() is defined as a function, whereas _PyObject_IS_GC() is defined as an internal static inline function. Header file:

/* Test if an object implements the garbage collector protocol */
PyAPI_FUNC(int) PyObject_IS_GC(PyObject *obj);

// Fast inlined version of PyObject_IS_GC()
static inline int _PyObject_IS_GC(PyObject *obj)
{
    return (PyType_IS_GC(Py_TYPE(obj))
            && (Py_TYPE(obj)->tp_is_gc == NULL
                || Py_TYPE(obj)->tp_is_gc(obj)));
}

C code:

int
PyObject_IS_GC(PyObject *obj)
{
    return _PyObject_IS_GC(obj);
}

Python 3.10 incompatible C API change

The Py_REFCNT() macro was converted to a static inline function: Py_REFCNT(obj) = refcnt; now fails with a compiler error. It must be replaced with Py_SET_REFCNT(obj, refcnt): Py_SET_REFCNT() was added to Python 3.9.

The complex case of Py_TYPE() and Py_SIZE() macros

Macros converted and then reverted

The Py_TYPE() and Py_SIZE() macros were also converted to static inline functions in Python 3.10, but the change broke 17 C extensions.

Since the change broke too many C extensions, I reverted the change: I converted Py_TYPE() and Py_SIZE() back to macros to have more time to fix fix C extensions.

What's Next?

Convert again Py_TYPE() and Py_SIZE() macros to static inline functions.
Add "%T" formatter for Py_TYPE(obj)->tp_name: see rejected bpo-34595.
Modify Cython to use getter functions.
Attempt to make some structures opaque, like PyThreadState.

Isolate Python Subinterpreters

2020-12-27T22:00:00+01:00

This article is about the work done in Python in 2019 and 2020 to better isolate subinterpreters. Static types are converted to heap types, extension modules are converted to use the new multiphase initialization API (PEP 489), caches, states, singletons and free lists are made per-interpreter, many bugs have been fixed, etc.

Running multiple interpreters in parallel with one "GIL" per interpreter cannot be done yet, but a lot of complex technical challenges have been solved.

Why isolating subinterpreters?

The final goal is to be able run multiple interpreters in parallel in the same process, like one interpreter per CPU, each interpreter would run in its own thread. The principle is the same than the multiprocessing module and has the same limitations: no Python object can be shared directly between two interpreters. Later, we can imagine helpers to share Python mutable objects using proxies which would prevent race conditions.

The work on subinterpreter requires to modify many functions and extension modules. It will benefit to Python in different ways.

Converting static types to heap types and convert extension modules to the multiphase initialization API (PEP 489) makes extension modules implemented in C to behave closer to modules implemented in Python, which is good for the PEP 399 -- Pure Python/C Accelerator Module Compatibility Requirements. So this work also helps Python implementations other than CPython, like PyPy.

These changes also destroy more Python objects and release more memory at Python exit which matters when Python is embedded in an application. Python should be "state less", especially release all memory at exit. This work slowly fix the bpo-163574: Py_Finalize() doesn't clear all Python objects at exit. Python leaks less and less Python objects at exit.

Proof-of-concept in May 2020

In May 2020, I wrote a proof-of-concept to prove the feasability of the project and to prove that it is faster than sequential execution: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround. Benchmark on 4 CPUs:

Sequential: 1.99 sec +- 0.01 sec
Threads: 3.15 sec +- 0.97 sec (1.5x slower)
Multiprocessing: 560 ms +- 12 ms (3.6x faster)
Subinterpreters: 583 ms +- 7 ms (3.4x faster)

The performance of subintepreters is basically the same speed than multiprocessing on this benchmark which is promising.

Experimental isolated subintepreters

To write this PoC, I added a --with-experimental-isolated-subinterpreters option to ./configure in bpo-40514 which defines the EXPERIMENTAL_ISOLATED_SUBINTERPRETERS macro. Effects of this special build:

Make the GIL per-interpreter.
_xxsubinterpreters.run_string() releases the GIL when running the subinterpreter.
Add a thread local storage for the Python thread state ("tstate").
Disable the garbage collector in subinterpreters.
Disable the type attribute lookup cache.
Disable free lists: frame, list, tuple, type attribute lookup cache.
Disable singletons: latin1 characters.
Disable interned strings.
Disable the fast pymalloc memory allocator (force libc malloc memory allocator).

Features are disabled because their implementation is currently not compatible with multiple interpreters running in parallel.

This special build is designed to be temporary. It should ease the development of isolated subinterpreters. It will be removed once subinterpreters will be fully isolated (once each interpreter will have its own GIL).

Convert static types to heap types

Types declared in Python (class MyType: ...) are always "heap types": types dynamically allocated on the heap memory. Historically, all types declared in C were declared as "static types": defined statically at build time.

In C, static types are referenced directly using the using & operator to get their address, they are not copied. For example, the Python str type is referenced as &PyUnicode_Type in C.

Types are also regular objects (PyTypeObject inherits from PyObject) and have a reference count, whereas the PyObject.ob_refcnt member is not atomic and so must not be modified in parallel. Problem: all interpreters share the same static types. Static types have other problems:

A type __mro__ tuple (PyTypeObject.tp_mro member) has the same problem of non-atomic reference count.
When a subtype is created, it is stored in the PyTypeObject.tp_subclasses dictionary member (accessible in Python with the __subclasses__() method), whereas Python dictionaries are not thread-safe.
Static types behave differently than regular Python types. For example, usually it is not possible to add an arbitrary attribute or override an attribute. It goes against the PEP 399 -- Pure Python/C Accelerator Module Compatibility Requirements principles.
etc.

Right now, 43% (89/206) of types are declared as heap types on a total of 206 types. For comparison, in Python 3.8, only 9% (15/172) of types were declared as heap types: 74 types have been converted in the meanwhile.

TODO: convert the remaining 117 static types: see bpo-40077.

Multiphase initialization API

Historically, extension modules are declared with the PyModule_Create() function. Usually, such extension can be instanciated exactly once. It is stored in an internal PyInterpreterState.modules_by_index list; an unique index is assigned to the module and stored in PyModuleDef.m_base.m_index. Usually, such extension use static global variables.

Such "static" extension has multiple issues:

The extension cannot be unloaded: its memory is not released at Python exit. It is an issue when Python is embedded in an application.
The extension behaves differently than modules defined in Python. When an extension is reimported, its namespace (module.__dict__) is duplicated, but mutable objects and static global variables are still shared. It goes against the PEP 399 -- Pure Python/C Accelerator Module Compatibility Requirements principles.
etc.

In 2013, Petr Viktorin, Stefan Behnel and Nick Coghlan wrote the PEP 489 -- Multi-phase extension module initialization which has been approved and implemented in Python 3.5. For example, the _abc module initialization function is now just a call to the new PyModuleDef_Init() function:

PyMODINIT_FUNC
PyInit__abc(void)
{
    return PyModuleDef_Init(&_abcmodule);
}

An extension module can have a module state, if PyModuleDef.m_size is greater than zero. Example:

typedef struct {
    PyTypeObject *_abc_data_type;
    unsigned long long abc_invalidation_counter;
} _abcmodule_state;

static struct PyModuleDef _abcmodule = {
    ...
    .m_size = sizeof(_abcmodule_state),  // <=== HERE ===
};

The PyModule_GetState() can be used to retrieve the module state. Example:

static inline _abcmodule_state*
get_abc_state(PyObject *module)
{
    void *state = PyModule_GetState(module);
    assert(state != NULL);
    return (_abcmodule_state *)state;
}

static PyObject *
_abc__abc_init(PyObject *module, PyObject *self)
{
    _abcmodule_state *state = get_abc_state(module);
    ...
    data = abc_data_new(state->_abc_data_type, NULL, NULL);
    ...
}

Right now, 77% (102/132) of extension modules use the new multiphase initialization API (PEP 489) on a total of 132 extension modules. For comparison, in Python 3.8, only 23% (27/118) of extensions used the new multiphase initialization API: 75 extensions have been converted in the meanwhile.

TODO: convert the remaining 30 extension modules (bpo-163574).

Module states

Some modules have a state which should be stored in the interpreter to share its state between multiple instances of the module, and also to give access to the state in functions of the public C API (ex: PyAST_Check()).

States made per-interpreter:

2019-05-10: warnings (bpo-36737, commit by Eric Snow)
2019-11-07: parser (bpo-36876, commit by Vinay Sajip)
2019-11-20: gc (bpo-36854, commit by me)
2020-11-02: ast (bpo-41796, commit by me)
2020-12-15: atexit (bpo-42639, commit by me)

Singletons

Singletons must not be shared between interpreters.

Singletons made per-interpreter.

bpo-38858:

2019-12-17: small integer, the [-5; 256] range (commit by me)

bpo-40521:

2020-06-04: empty tuple singleton (commit by me)
2020-06-23: empty bytes string singleton and single byte character (b'\x00' to b'\xFF') singletons (commit by me)
2020-06-23: empty Unicode string singleton (commit by me)
2020-06-23: empty frozenset singleton (commit by me); later removed.
2020-06-24: single Unicode character (U+0000-U+00FF range) (commit by me)

I also micro-optimized the code: most singletons are now always created at startup, it's no longer needed to check if it is created at each function call. Moreover, an assertion now ensures that singletons are no longer used after they are deleted.

Free lists

A free list is a micro-optimization on memory allocations. The memory of recently destroyed objects is not freed to be able to reuse it for new objects. Free lists must not be shared between interpreters.

Free lists made per-interpreter (bpo-40521):

2020-06-04: slice (commit by me)
2020-06-04: tuple (commit by me)
2020-06-04: float (commit by me)
2020-06-04: frame (commit by me)
2020-06-05: async generator (commit by me)
2020-06-05: context (commit by me)
2020-06-05: list (commit by me)
2020-06-23: dict (commit by me)
2020-06-23: MemoryError (commit by me)

Caches

Caches made per interpreter:

2020-06-04: slice cache (bpo-40521, commit by me)
2020-12-26: type attribute lookup cache (bpo-42745, commit by me)

Interned strings and identifiers

2020-12-25: Per-interpreter identifiers: _PyUnicode_FromId() (bpo-39465, commit by me)
2020-12-26: Per-interpreter interned strings: PyUnicode_InternInPlace() (bpo-40521, commit by me)

For _PyUnicode_FromId(), I added the pycore_atomic_funcs.h header file (commit) which adds functions for atomic memory accesses (to variables of type Py_ssize_t). It uses __atomic_load_n() and __atomic_store_n() on GCC and clang, or _InterlockedCompareExchange64() and _InterlockedExchange64() on MSC (Windows).

First, I tried to use the _Py_hashtable type: PR 20048. Using _Py_hashtable, _PyUnicode_FromId() took 15.5 ns +- 0.1 ns. I optimized _Py_hashtable: _PyUnicode_FromId() took 6.65 ns +- 0.09 ns. But it was still slower than the reference code: 2.38 ns +- 0.00 ns.

The merged implementation uses an array. An unique index is assigned, index in this array. The array is made larger on demand. The final change adds 1 ns per function call:

[ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower

Misc

2020-03-19: Per-interpreter pending calls (bpo-39984, commit by me).

Bugfixes

GIL bugfixes for daemon threads in Python 3.9
Fix many leaks discovered by subinterpreters
Fix pickling heap types implemented in C with protocols 0 and 1 (bpo-41052)

PEP 630: Isolating Extension Modules

In August 2020, Petr Viktorin wrote PEP 630 -- Isolating Extension Modules which gives practical advices on how to update an extension module to make it stateless using previous PEPs (heap types, multi-phase init, etc.). Once a module is stateless, it becomes safe to use it subinterpreters running in parallel.

Thanks

The work on subintepreters, multiphase init and heap types is a collaborative work on-going for 2 years. I would like to thank the following developers for helping on this large task:

Christian Heimes
Dong-hee Na
Eric Snow
Erlend Egeberg Aasland
Hai Shi
Mohamed Koubaa
Nick Coghlan
Paulo Henrique Silva
Petr Viktorin
Vinay Sajip

Note: Since the work is scattered in many issues and pull requests, it's hard to track who helped: sorry if I forgot someone! (Please contact me and I will complete the list.)

What's Next?

There are still multiple interesting technical challenges:

bpo-39511: Per-interpreter singletons (None, True, False, etc.)
bpo-40601: Hide static types from the C API
Make pymalloc allocator compatible with subinterpreters.
Make the GIL per interpreter. Maybe even give the choice to share or not the GIL when a subinterpreter is created.
Make the _PyArg_Parser (parser_init()) function compatible with subinterpreters. Maybe use a per-interpreter array, similar solution than _PyUnicode_FromId().
bpo-15751: Make the PyGILState API compatible with subinterpreters (issue created in 2012!)
bpo-40522: Get the current Python interpreter state from Thread Local Storage (autoTSSkey)

Also, there are still many static types to convert to heap types (bpo-40077) and many extension modules to convert to the multiphase initialization API (bpo-163574).

I'm tracking the work in my Python Subinterpreters page and in the bpo-40512: Meta issue: per-interpreter GIL.

Hide implementation details from the Python C API

2020-12-25T22:00:00+01:00

This article is the history of Python C API discussions over the last 4 years, and the creation of C API projects: pythoncapi website, pythoncapi_compat.h header file and HPy (new clean C API). More and more people are aware of issues caused by the C API and are working on solutions.

It took me a lot of iterations to find the right approach to evolve the C API without breaking too many third-party extension modules. My first ideas were based on two APIs with an opt-in option somehow. At the end, I decided to fix directly the default API, and helped maintainers of extension modules to update their projects for incompatible C API changes.

I wrote a pythoncapi_compat.h header file which adds C API functions of newer Python to old Python versions up to Python 2.7. I also wrote a upgrade_pythoncapi.py script to add Python 3.10 support to an extension module without losing Python 2.7 support: the tool adds #include "pythoncapi_compat.h". For example, it replaces Py_TYPE(obj) = type with Py_SET_SIZE(obj, type).

The photo: my cat attacking the Python C API.

Year 2016

Between 2016 and 2017, Larry Hastings worked on removing the GIL in a CPython fork called "The Gilectomy". He pushed the first commit in April 2016: Removed the GIL. Don't merge this! ("Few programs work now"). At EuroPython 2016, he gave the talk Larry Hastings - The Gilectomy where he explains that the current parallelism bottleneck is the CPython reference counting which doesn't scale with the number of threads.

It was just another hint telling me that "something" should be done to make the C API more abstract, move away from implementation details like reference counting. PyPy also has performance issues with the C API for many years.

Year 2017

May

In 2017, I discussed with Eric Snow who was working on subinterpreters. He had to modify public structures, especially the PyInterpreterState structure. He created Include/internal/ subdirectory to create a new "internal C API" which should not be exported. (Later, he moved the PyInterpreterState structure to the internal C API in Python 3.8.)

I started the discuss C API changes during the Python Language Summit (PyCon US 2017): "Python performance" slides (PDF):

Split Include in sub-directories
Move towards a stable ABI by default

See also the LWN article: Keeping Python competitive by Jake Edge.

July: first PEP draft

I proposed the first PEP draft to python-ideas: PEP: Hide implementation details in the C API.

The idea is to add an opt-in option to distutils to build an extension module with a new C API, remove implementation details from the new C API, and maybe later switch to the new C API by default.

September

I discussed my C API change ideas at the CPython core dev sprint (at Instagram, California). The ideas were liked by most (if not all) core developers who are fine with a minor performance slowdown (caused by replacing macros with function calls). I wrote A New C API for CPython blog post about these discussions.

November

I proposed Make the stable API-ABI usable on the python-dev list. The idea is to add PyTuple_GET_ITEM() (for example) to the limited C API but declared as a function call. Later, if enough extension modules are compatible with the extended limited C API, make it the default.

Year 2018

In July, I created the pythoncapi website to collect issues of the current C API, list things to avoid in new functions like borrowed references, and start to design a new better C API.

In September, Antonio Cuni wrote Inside cpyext: Why emulating CPython C API is so Hard article.

Year 2019

In February, I sent Update on CPython header files reorganization to the capi-sig list.

Include/: limited C API
Include/cpython/: CPython C API
Include/internal/: CPython internal C API

In March, I modified the Python debug build to make its ABI compatible with the release build ABI: What’s New In Python 3.8: Debug build uses the same ABI as release build.

In May, I gave a lightning talk Status of the stable API and ABI in Python 3.8, at the Language Summit (during Pycon US 2019):

Convert macros to static inline functions
Install the internal C API
Debug build now ABI compatible with the release build ABI
Getting rid of global variables

By the way, see my Split Include/ directory in Python 3.8 article: I converted many macros in Python 3.8.

In July, the HPy project was created during EuroPython at Basel. There was an informal meeting which included core developers of PyPy (Antonio, Armin and Ronan), CPython (Victor Stinner and Mark Shannon) and Cython (Stefan Behnel).

In December, Antonio, Armin and Ronan had a small internal sprint to kick-off the development of HPy: HPy kick-off sprint report

Year 2020

April

I proposed PEP: Modify the C API to hide implementation details on the python-dev list. The main idea is to provide a new optimized Python runtime which is backward incompatible on purpose, and continue to ship the regular runtime which is fully backward compatible.

June

I wrote PEP 620 -- Hide implementation details from the C API and proposed the PEP to python-dev. This PEP is my 3rd attempt to fix the C API: I rewrote it from scratch. Python now distributes a new pythoncapi_compat.h header and a process is defined to reduce the number of broken C extensions when introducing C API incompatible changes listed in this PEP.

I created the pythoncapi_compat project: header file providing new C API functions to old Python versions using static inline functions.

December

I wrote a new upgrade_pythoncapi.py script to add Python 3.10 support to an extension module without losing support with Python 2.7. I sent New script: add Python 3.10 support to your C extensions without losing Python 3.6 support to the capi-sig list.

The pythoncapi_compat project got its first users (bitarray, immutables, python-zstandard)! It proves that the project is useful and needed.

I collaborated with the HPy project to create a manifesto explaining how the C API prevents to optimize CPython and makes the CPython C API inefficient on PyPy. It is still a draft.

Leaks discovered by subinterpreters

2020-12-23T14:00:00+01:00

This article is about old reference leaks discovered or caused by the work on isolating subinterpreters: leaks in 6 different modules (gc, _weakref, _abc, _signal, _ast and _thread).

Refleaks buildbot failures

With my work on isolating subinterpreters, old bugs about Python objects leaked at Python exit are suddenly becoming blocker issues on buildbots.

When subinterpreters still share Python objects with the main interpreter, it is ok-ish to leak these objects at Python exit. Right now (current master branch), there are still more than 18 000 Python objects which are not destroyed at Python exit:

$ ./python -X showrefcount -c pass
[18411 refs, 6097 blocks]

This issue is being solved in the bpo-1635741: Py_Finalize() doesn't clear all Python objects at exit which was opened almost 14 years ago (2007).

When subinterpreters are better isolated, objects are no longer shared, and suddenly these leaks make subinterpreters tests failing on Refleak buildbots. For example, when an extension module is converted to the multiphase initialization API (PEP 489) or when static types are converted to heap types, these issues pop up.

It is a blocker issue for me, since I care of having only "green" buildbots (no test failure), otherwise more serious regressions can be easily missed.

Per-interpreter GC state

In November 2019, I made the state of the GC module per-interpreter in bpo-36854 (commit) and test_atexit started to leak:

$ ./python -m test -R 3:3 test_atexit -m test.test_atexit.SubinterpreterTest.test_callbacks_leak
test_atexit leaked [3988, 3986, 3988] references, sum=11962

I fixed the usage of the PyModule_AddObject() function in the _testcapi module (commit).

I also pushed a workaround in finalize_interp_clear():

+    /* bpo-36854: Explicitly clear the codec registry
+       and trigger a GC collection */
+    PyInterpreterState *interp = tstate->interp;
+    Py_CLEAR(interp->codec_search_path);
+    Py_CLEAR(interp->codec_search_cache);
+    Py_CLEAR(interp->codec_error_registry);
+    _PyGC_CollectNoFail();

I dislike having to push a "temporary" workaround, but the Python finalization is really complex and fragile. Fixing the root issues would require too much work, whereas I wanted to repair the Refleak buildbots as soon as possible.

In December 2019, the workaround was partially removed (commit):

-    Py_CLEAR(interp->codec_search_path);
-    Py_CLEAR(interp->codec_search_cache);
-    Py_CLEAR(interp->codec_error_registry);

The year after (December 2020), the last GC collection was moved into PyInterpreterState_Clear(), before finalizating the GC (commit).

Port _weakref to multiphase init

In March 2020, the _weakref module was ported to the multiphase initialization API (PEP 489) in bpo-40050 and test_importlib started to leak:

$ ./python -m test -R 3:3 test_importlib
test_importlib leaked [6303, 6299, 6303] references, sum=18905

The analysis was quite long and complicated. The importlib imported some extension modules twice and it has to inject frozen modules to "bootstrap" the code.

At the end, I fixed the issue by removing the now unused _weakref import in importlib._bootstrap_external (commit). The fix also avoids importing an extension module twice.

Convert _abc static types to heap types

In April 2020, the static types of the _abc extension module were converted to heap types in bpo-40077 (commit) and test_threading started to leak:

$ ./python -m test -R 3:3 test_threading
test_threading leaked [19, 19, 19] references, sum=57

I created bpo-40149 to track the leak.

Objects hold a reference to heap types

In March 2019, the PyObject_Init() function was modified in bpo-35810 to keep a strong reference (INCREF) to the type if the type is a heap type (commit):

+    if (PyType_GetFlags(tp) & Py_TPFLAGS_HEAPTYPE) {
+        Py_INCREF(tp);
+    }

I opened bpo-40217: The garbage collector doesn't take in account that objects of heap allocated types hold a strong reference to their type to discuss the regression (the test_threading leak).

First workaround (not merged): force a second garbage collection

While analysing test_threading regression leak, I identified a first workaround: add a second _PyGC_CollectNoFail() call in finalize_interp_clear().

It was only a workaround which helped to understand the issue, it was not merged.

First fix (merged): abc_data_traverse()

I merged a first fix: add a traverse function to the _abc._abc_data type (commit):

+static int
+abc_data_traverse(_abc_data *self, visitproc visit, void *arg)
+{
+    Py_VISIT(self->_abc_registry);
+    Py_VISIT(self->_abc_cache);
+    Py_VISIT(self->_abc_negative_cache);
+    return 0;
+}

Second workaround (not merged): visit the type in abc_data_traverse()

A second workaround was identified: add Py_VISIT(Py_TYPE(self)); to the new abc_data_traverse() function.

Again, it was only a workaround which helped to understand the issue, but it was not merged.

Second fix (merged): call Py_VISIT(Py_TYPE(self)) automatically

20 days after I opened bpo-40217, Pablo Galindo modified PyType_FromSpec() to add a wrapper around the traverse function of heap types to ensure that Py_VISIT(Py_TYPE(self)) is always called (commit).

Last fix (merged): fix every traverse function

In May 2020, Pablo Galindo changed his mind. He reverted his PyType_FromSpec() change and instead fixed traverse function of heap types (commit).

At the end, abc_data_traverse() calls Py_VISIT(Py_TYPE(self)). The second "workaround" was the correct fix!

Convert _signal to multiphase init

In September 2020, Mohamed Koubaa ported the _signal module to the multiphase initialization API (PEP 489) in bpo-1635741 (commit 71d1bd95) and test_interpreters started to leak:

$ ./python -m test -R 3:3 test_interpreters
test_interpreters leaked [237, 237, 237] references, sum=711

I created bpo-41713 to track the regression. Since I failed to find a simple fix, I started by reverting the change which caused Refleak buildbots to fail (commit).

I had to refactor the _signal extension module code with multiple commits to fix all bugs.

The first fix was to remove the IntHandler variable: there was no need to keep it alive, it was only needed once in signal_module_exec().

The second fix is to close the Windows event at exit:

+ #ifdef MS_WINDOWS
+     if (sigint_event != NULL) {
+         CloseHandle(sigint_event);
+         sigint_event = NULL;
+     }
+ #endif

The last fix, the most important, is to clear the strong reference to old Python signal handlers when signal_module_exec() is called more than once:

// If signal_module_exec() is called more than one, we must
// clear the strong reference to the previous function.
Py_XSETREF(Handlers[signum].func, Py_NewRef(func));

The _signal module is not well isolated for subinterpreters yet, but at least it no longer leaks.

Per-interpreter _ast state

In September 2019, the _ast extension module was converted to PEP 384 (stable ABI) in bpo-38113 (commit): the AST state moves into a module state.

This change caused 3 different bugs including crashes (bpo-41194, bpo-41261, bpo-41631). The issue is complex since there are public C APIs which require to access AST types, whereas it became possible to have multiple _ast extension module instances.

In July 2020, I fixed the root issue in bpo-41194 by replacing the module state with a global state (commit):

static astmodulestate global_ast_state;

A global state is bad for subinterpreters. In November 2020, I made the AST state per-interpreter in bpo-41796 (commit and test_ast started to leak:

$ ./python -m test -R 3:3 test_ast
test_ast leaked [23640, 23636, 23640] references, sum=70916

The fix is to call _PyAST_Fini() earlier (commit).

Python types contain a reference to themselves in in their PyTypeObject.tp_mro member (the MRO tuple: Method Resolution Order). _PyAST_Fini() must called before the last GC collection to destroy AST types.

_PyInterpreterState_Clear() now calls _PyAST_Fini(). It now also calls _PyWarnings_Fini() on subinterpeters, not only on the main interpreter.

_thread lock traverse

In December 2020, while I tried to port the _thread extesnion module to the multiphase initialization API (PEP 489), test_threading started to leak:

$ ./python -m test -R 3:3 test_threading
test_threading leaked [56, 56, 56] references, sum=168

As usual, the workaround was to force a second GC collection in interpreter_clear():

     /* Last garbage collection on this interpreter */
     _PyGC_CollectNoFail(tstate);
+    _PyGC_CollectNoFail(tstate);
     _PyGC_Fini(tstate);

It took me two days to full understand the problem. I drew reference cycles on paper to help me to understand the problem:

There are two cycles:

Cycle 1:
- at fork function
- -> __main__ module dict
- -> at fork function
Cycle 2:
- _thread lock type
- -> lock type methods
- -> _thread module dict
- -> _thread local type
- -> _thread module
- -> _thread module state
- -> _thread lock type

Moreover, there is a link between these two reference cycles: an instance of the lock type.

I fixed the issue by adding a traverse function to the lock type and add Py_TPFLAGS_HAVE_GC flag to the type (commit):

+static int
+lock_traverse(lockobject *self, visitproc visit, void *arg)
+{
+    Py_VISIT(Py_TYPE(self));
+    return 0;
+}

Notes on weird GC bugs

gc.get_referents() and gc.get_referrers() can be used to check traverse functions.
gc.is_tracked() can be used to check if the GC tracks an object.
Using the gdb debugger on gc_collect_main() helps to see which objects are collected. See for example the finalize_garbage() functions which calls finalizers on unreachable objects.
The solution is usually a missing traverse functions or a missing Py_VISIT() in an existing traverse function.
GC bugs are hard to debug :-)

Thanks Pablo Galindo for helping me to debug all these tricky GC bugs!

Thanks to everybody who are helping to better isolate subintrepreters by converting extension modules to the multiphase initialization API (PEP 489) and by converting dozens of static types to heap types. We made huge progresses last months!

GIL bugfixes for daemon threads in Python 3.9

2020-04-04T22:00:00+02:00

My previous article Daemon threads and the Python finalization in Python 3.2 and 3.3 introduces issues caused by daemon threads in the Python finalization and past changes to make them work.

This article is about bugfixes of the infamous GIL (Global Interpreter Lock) in Python 3.9, between March 2019 and March 2020, for daemon threads during Python finalization. Some bugs were old: up to 6 years old. Some bugs were triggered by the on-going work on isolating subinterpreters in Python 3.9.

Drawing: #CoronaMaison by Boulet.

Fix 1: Exit PyEval_AcquireThread() if finalizing

In March 2019, Remy Noel created bpo-36469: a multithreaded Python application using 20 daemon threads hangs randomly at exit on Python 3.5:

The bug happens about once every two weeks on a script that is fired more than 10K times a day.

Eric Snow analyzed the bug and understood that it is related to daemon threads and Python finalization. He identified that PyEval_AcquireLock() and PyEval_AcquireThread() function take the GIL but don't exit the thread if Python is finalizing.

When Python is finalizing and a daemon thread takes the GIL, Python can hang randomly.

Eric created bpo-36475 to propose to modify PyEval_AcquireLock() and PyEval_AcquireThread() to also exit the thread in this case. In April 2019, Joannah Nanjekye fixed the issue with commit f781d202:

bpo-36475: Finalize PyEval_AcquireLock() and PyEval_AcquireThread() properly (GH-12667)

PyEval_AcquireLock() and PyEval_AcquireThread() now
terminate the current thread if called while the interpreter is
finalizing, making them consistent with PyEval_RestoreThread(),
Py_END_ALLOW_THREADS, and PyGILState_Ensure().

The fix adds exit_thread_if_finalizing() function which exit the thread if Python is finalizing. This function is called after each take_gil() call.

The fix is very similar to PyEval_RestoreThread() fix made in 2013 (commit 0d5e52d3) to fix bpo-1856 (Python crash involving daemon threads during Python exit).

Fix 2: PyEval_RestoreThread() on freed tstate

concurrent.futures crash on FreeBSD

In December 2019, I reported bpo-39088: test_concurrent_futures crashed randomly with a coredump on AMD64 FreeBSD Shared 3.x buildbot. In March 2020, I succeeded to reproduce the bug on FreeBSD and I was able to debug the coredump in gdb:

(gdb) frame
#0  0x00000000003b518c in PyEval_RestoreThread (tstate=0x801f23790) at Python/ceval.c:387
387         _PyRuntimeState *runtime = tstate->interp->runtime;

(gdb) p tstate->interp
$3 = (PyInterpreterState *) 0xdddddddddddddddd

The Python thread state (tstate) was freed. In debug mode, the "free()" function of the Python memory allocator fills the freed memory block with 0xDD byte pattern (D stands for dead byte) to detect usage of freed memory.

The problem is that Python finalization already freed the memory of all PyThreadState structures, when PyEval_RestoreThread(tstate) is called by a daemon thread. PyEval_RestoreThread() dereferences tstate:

_PyRuntimeState *runtime = tstate->interp->runtime;

This bug is a regression caused by my change: Add PyInterpreterState.runtime field of bpo-36710. I replaced:

void PyEval_RestoreThread(PyThreadState *tstate) {
    _PyRuntimeState *runtime = &_PyRuntime;
    ...
}

with:

void PyEval_RestoreThread(PyThreadState *tstate) {
    _PyRuntimeState *runtime = tstate->interp->runtime;
    ...
}

Fix PyEval_RestoreThread() for daemon threads

I created bpo-39877 to investigate this bug. I managed to reproduce the crash on Linux with a script spawning daemon threads which sleep randomly between 0.0 and 1.0 second, and by adding sleep(1); call at Py_RunMain() exit.

I wrote a PyEval_RestoreThread() fix which access to _PyRuntimeState.finalizing without the GIL.

Antoine Pitrou asked me to convert _PyRuntimeState.finalizing to an atomic variable to avoid inconsistencies in case of parallel accesses. At March 7, 2020, I pushed commit 7b3c252d:

bpo-39877: _PyRuntimeState.finalizing becomes atomic (GH-18816)

Convert _PyRuntimeState.finalizing field to an atomic variable:

* Rename it to _finalizing
* Change its type to _Py_atomic_address
* Add _PyRuntimeState_GetFinalizing() and _PyRuntimeState_SetFinalizing()
  functions
* Remove _Py_CURRENTLY_FINALIZING() function: replace it with testing
  directly _PyRuntimeState_GetFinalizing() value

Convert _PyRuntimeState_GetThreadState() to static inline function.

The day after, I pushed my fix, commit eb4e2ae2:

bpo-39877: Fix PyEval_RestoreThread() for daemon threads (GH-18811)

* exit_thread_if_finalizing() does now access directly _PyRuntime
  variable, rather than using tstate->interp->runtime since tstate
  can be a dangling pointer after Py_Finalize() has been called.
* exit_thread_if_finalizing() is now called *before* calling
  take_gil(). _PyRuntime.finalizing is an atomic variable,
  we don't need to hold the GIL to access it.

exit_thread_if_finalizing() is now called before take_gil() to ensure that take_gil() cannot be called with an invalid Python thread state (tstate).

I commented naively:

Ok, it should now be fixed.

Clear Python thread states earlier: my first failed attempt in 2013

In 2013, I opened bpo-19466 to clear earlier the Python thread state of threads during Python finalization. My intent was to display ResourceWarning warnings of daemon threads as well. In November 2013, I pushed commit 45956b9a:

Close #19466: Clear the frames of daemon threads earlier during the Python
shutdown to call objects destructors. So "unclosed file" resource warnings
are now correctly emitted for daemon threads.

Later, I discovered a crash in the the garbage collector while trying to reproduce a race condition in asyncio: I created bpo-20526. Sadly, this bug was trigger by my previous change. I decided that it's safer to revert my change.

By the way, when I looked again at bpo-20526, I was able to reproduce again the garbage collector bug, likely because of recent changes. With the help of Pablo Galindo Salgado, Pablo and me understood the root issue. At March 24, 2020, I pushed a fix (commit) to finally fix this 6 years old bug! The fix removes the following line from PyThreadState_Clear():

Py_CLEAR(tstate->frame);

Fix 3: Exit also take_gil() at exit point if finalizing

After fixing PyEval_RestoreThread(), I decided to attempt again to fix bpo-19466 (clear earlier Python thread states). Sadly, I discovered that my PyEval_RestoreThread() fix introduced a race condition!

While the main thread finalizes Python, daemon threads can be waiting for the GIL: they block in take_gil(). When the main thread releases the GIL during finalization, a daemon thread take the GIL instead of exiting. Daemon threads only check if they must exit before trying to take the GIL.

The solution is to call exit_thread_if_finalizing() twice in take_gil(): before and after taking the GIL.

In March 2020, I pushed commit 9229eeee:

bpo-39877: take_gil() checks tstate_must_exit() twice (GH-18890)

take_gil() now also checks tstate_must_exit() after acquiring
the GIL: exit the thread if Py_Finalize() has been called.

I commented:

I ran multiple times daemon_threads_exit.py with slow_exit.patch: no crash.

I also ran multiple times stress.py + sleep_at_exit.patch of bpo-37135: no crash.

And I tested asyncio_gc.py of bpo-19466: no crash neither.

Python finalization now looks reliable. I'm not sure if it's "more" reliable than previously, but at least, I cannot get a crash anymore, even after bpo-19466 has been fixed (clear Python thread states of daemon threads earlier).

Funny fact, in June 2019, Eric Snow added a very similar bug in bpo-36818 with commit 396e0a8d: test_multiprocessing_spawn segfault on FreeBSD (bpo-37135). I reverted his change to fix the issue. At this time, I didn't have the bandwidth to investigate the root cause. I just reverted Eric's change.

Fix 4: Exit take_gil() while waiting for the GIL if finalizing

While I was working on moving pending calls from _PyRuntime to PyInterpreterState, bpo-3998, I had another bug.

At March 18, 2020, I pushed a take_gil() fix to avoid accessing tstate if Python is finalizing, commit 29356e03:

bpo-39877: Fix take_gil() for daemon threads (GH-19054)

bpo-39877, bpo-39984: If the thread must exit, don't access tstate to
prevent a potential crash: tstate memory has been freed.

And while working on the inefficient signal handling in multithreaded applications (bpo-40010), I discovered that the previous fix was not enough!

At March 19, 2020, I pushed a take_gil() fix to exit while take_gil() is waiting for the GIL if Python is finalizing, commit a36adfa6:

bpo-39877: 4th take_gil() fix for daemon threads (GH-19080)

bpo-39877, bpo-40010: Add a third tstate_must_exit() check in
take_gil() to prevent using tstate which has been freed.

I can only hope that this fix is the last one to fix all corner cases with daemon threads in take_gil() (bpo-39877)!

Summary of GIL bugfixes

The GIL got 5 main bugfixes for daemon threads and Python finalization:

May 2011, Antoine Pitrou, commit 0d5e52d3: take_gil() exits if finalizing after taking the GIL (1 check)
April 2019, Joannah Nanjekye, commit f781d202: PyEval_AcquireLock() and PyEval_AcquireThread() also exit if Python is finalizing
March 8, 2020, Victor Stinner, commit eb4e2ae2: take_gil() exits if finalizing before taking the GIL (1 check)
March 9, 2020, Victor Stinner, commit 9229eeee: take_gil() exits if finalizing before and after taking the GIL (2 checks)
March 19, 2020, Victor Stinner, commit a36adfa6: take_gil() exits if finalizing before, while, and after taking the GIL (3 checks)

Threading shutdown race condition

2020-04-03T20:00:00+02:00

This article is about a race condition in threading shutdown that I fixed in Python 3.9 in March 2019. I also forbid spawning daemon threads in subinterpreters to fix another related bug.

Drawing: #CoronaMaison by Julien Neel.

Race condition in threading shutdown

Random test failure noticed on FreeBSD buildbot

In March 2019, I noticed that test_threading.test_threads_join_2() was killed by SIGABRT on the FreeBSD CURRENT buildbot, bpo-36402:

Fatal Python error: Py_EndInterpreter: not the last thread

The test_threads_join_2() test failed randomly on buildbots when tests were run in parallel, but test_threading passed when it was re-run sequentially. Such failure was silently ignored, since the build was seen overall as a success.

The test test_threading.test_threads_join_2() was added by in 2013 commit 7b476993.

In 2016, I already reported the same test failure: bpo-27791 (same test, also on FreeBSD). And Christian Heimes reported a similar issue: bpo-28084. I simply closed these issues because I only saw the failure once in 4 months and I didn't have access to FreeBSD to attempt to reproduce the crash.

Reproduce the race condition

In 2019, I had a FreeBSD VM to attempt to reproduce the bug locally.

In June 2019, I found a reliable way to reproduce the bug by adding random sleeps to the test. With this patch, I was also able to reproduce the bug on Linux. I am way more comfortable to debug an issue on Linux with my favorite debugging tools!

I identified a race condition in the Python finalization. I also understood that the bug was not specific to subinterpreters:

The test shows the bug using subinterpreters (Py_EndInterpreter), but the bug also exists in Py_Finalize() which has the same race condition.

I wrote a patch for Py_Finalize() to help me to reproduce the bug without subinterpreters:

+    if (tstate != interp->tstate_head || tstate->next != NULL) {
+        Py_FatalError("Py_EndInterpreter: not the last thread");
+    }

threading._shutdown() race condition

threading._shutdown() uses threading.enumerate() which iterates on threading._active dictionary.

threading.Thread registers itself into threading._active when the thread starts. It unregisters itself from threading._active when it completes.

The bug occurs when the thread is unregistered whereas the underlying native thread is still running and the Python thread state is not deleted yet.

_thread._set_sentinel() creates a lock and registers a tstate->on_delete callback to release this lock. It's called by threading.Thread when the thread starts to set threading.Thread._tstate_lock. This lock is used by threading.Thread.join() method to wait until the thread completes.

_thread.start_new_thread() calls the C function t_bootstrap() which ends with:

tstate->interp->num_threads--;
PyThreadState_Clear(tstate);
PyThreadState_DeleteCurrent();
PyThread_exit_thread();

When the native thread completes, _PyThreadState_DeleteCurrent() is called: it calls tstate->on_delete() callback which releases threading.Thread._tstate_lock lock.

The root issue is that:

threading._shutdown() rely on threading._alive dictionary
Py_EndInterpreter() rely on the interpreter linked list of Python thread states of the interpreter (interp->tstate_head).

The lock on Python thread states (threading.Thread._tstate_lock) and PyThreadState.on_delete callback were added in 2013 by Antoine Pitrou to Python 3.4, commit 7b476993 of bpo-18808:

Issue #18808: Thread.join() now waits for the underlying thread state
to be destroyed before returning. This prevents unpredictable aborts
in Py_EndInterpreter() when some non-daemon threads are still running.

Fix threading._shutdown()

Finally in June 2019, I fixed the race condition in threading._shutdown() with commit 468e5fec:

bpo-36402: Fix threading._shutdown() race condition (GH-13948)

Fix a race condition at Python shutdown when waiting for threads.  Wait
until the Python thread state of all non-daemon threads get deleted
(join all non-daemon threads), rather than just wait until Python
threads complete.

The fix is to modify threading._shutdown() to wait until the Python thread state of all non-daemon threads get deleted, rather than calling the join() method of all non-daemon threads. The join() does not ensure that the Python thread state is deleted.

The Python finalization calls threading._shutdown() to wait until all threads complete. Only non-daemon threads are awaited: daemon threads can continue to run after threading._shutdown().

Py_EndInterpreter() requires that the Python thread states of all threads have been deleted. What about daemon threads? More about that in the next section ;-)

Note: This change introduced a regression (memory leak) which is not fixed yet: bpo-37788.

Forbid daemon threads in subinterpreters

In June 2019, while fixing the threading shutdown, I found a reliable way to trigger a bug with daemon threads when a subinterpreter is finalized:

Fatal Python error: Py_EndInterpreter: not the last thread

By design, daemon threads can run after a Python interpreter is finalized, whereas Py_EndInterpreter() requires that all threads completed.

I reported bpo-37266 to propose to forbid the creation of daemon threads in subinterpreters. I fixed the issue with commit 066e5b1a:

bpo-37266: Daemon threads are now denied in subinterpreters (GH-14049)

In a subinterpreter, spawning a daemon thread now raises an
exception. Daemon threads were never supported in subinterpreters.
Previously, the subinterpreter finalization crashed with a Pyton
fatal error if a daemon thread was still running.

The change adds this check to Thread.start():

if self.daemon and not _is_main_interpreter():
    raise RuntimeError("daemon thread are not supported "
                       "in subinterpreters")

I commented:

Daemon threads must die. That's a first step towards their death!

Antoine Pitrou created bpo-39812: Avoid daemon threads in concurrent.futures as a follow-up.

In February 2020, when rebuilding Fedora Rawhide with Python 3.9, Miro Hrončok of my team noticed that my change broke the python-jep project. I reported the bug upstream. It has been fixed by using regular threads, rather than daemon threads: commit.

Conclusion

A random failure on a FreeBSD buildbot was hiding a severe race condition in the threading shutdown. The bug existed since 2013, but was silently ignored since the test passed when re-run.

The race condition was that that the threading shutdown didn't ensure that the Python thread state of all non-daemon threads are deleted, whereas it is a Py_EndInterpreter() requirement.

I fixed the threading shutdown by waiting until the Python thread state of all non-daemon threads is deleted.

I also modified Thread.start() to forbid spawning daemon threads in Python subinterpreters to fix a related issue.

Daemon threads and the Python finalization in Python 3.2 and 3.3

2020-03-26T22:00:00+01:00

At exit, the Python finalization calls Python objects finalizers (the __del__() method) and deallocates memory. The daemon threads are a special kind of threads which continue to run during and after the Python finalization. They are causing race conditions and tricky bugs in the Python finalization.

This article covers bugs fixed in the Python finalization in Python 3.2 and Python 3.3 (2009 to 2011), and a backport in Python 2.7.8 (2014).

Drawing: #CoronaMaison by Luppi.

Daemon threads

Python has a special kind of thread: "daemon" threads. The difference with regular threads is that Python doesn't wait until daemon threads complete at exit, whereas it waits until all regular ("non-daemon") threads complete. Example:

import threading, time
thread = threading.Thread(target=time.sleep, args=(5.0,), daemon=False)
thread.start()

This Python program spawns a regular thread which sleeps for 5 seconds. Python takes 5 seconds to exit:

$ time python3 sleep.py

real   0m5,047s

If daemon=False is replaced with daemon=True to spawn a daemon thread instead, Python exits immediately (57 ms):

$ time python3 sleep.py

real   0m0,057s

Note: The Thread.join() method can be called explicitly to wait until a daemon thread completes.

Don't destroy the GIL at exit

In November 2009, Antoine Pitrou implemented a new GIL (Global Interpreter Lock) in Python 3.2: commit 074e5ed9.

In September 2010, he found a crash with daemon threads while stressing test_threading: bpo-9901: GIL destruction can fail. test_finalize_with_trace() failed with:

Fatal Python error: pthread_mutex_destroy(gil_mutex) failed

He pushed a fix for this crash in Python 3.2, commit b0b384b7:

Issue #9901: Destroying the GIL in Py_Finalize() can fail if some other
threads are still running.  Instead, reinitialize the GIL on a second
call to Py_Initialize().

The Python GIL internally uses a lock. If the lock is destroyed while a daemon thread is waiting for it, the thread can crash. The fix is to no longer destroy the GIL at exit.

Exit the thread in PyEval_RestoreThread()

The Python finalization clears and deallocates the "Python thread state" of all threads (in PyInterpreterState_Delete()) which calls Python object finalizers of these threads. Calling a finalizer can drop the GIL to call a system call. For example, closing a file drops the GIL. When the GIL is dropped, a daemon thread is awaken to take the GIL. Since the Python thread state was just deallocated, the daemon thread crash.

This bug is a race condition. It depends on which order threads are executed, on which order objects are finalized, on which order memory is deallocated, etc.

The crash was first reported in April 2005: bpo-1193099: Embedded python thread crashes. In January 2008, Gregory P. Smith reported bpo-1856: shutdown (exit) can hang or segfault with daemon threads running. He wrote a short Python program reproducing the bug: spawn 40 daemon threads which do some I/O operations and sleep randomly between 0 ms and 5 ms in a loop.

Adam Olsen proposed a solution (with a patch):

I think non-main threads should kill themselves off if they grab the interpreter lock and the interpreter is tearing down. They're about to get killed off anyway, when the process exits.

In May 2011, Antoine Pitrou pushed a fix to Python 3.3 (6 years after the first bug report) which implements this solution, commit 0d5e52d3:

Issue #1856: Avoid crashes and lockups when daemon threads run while the
interpreter is shutting down; instead, these threads are now killed when
they try to take the GIL.

PyEval_RestoreThread() fix explanation

The fix adds a new _Py_Finalizing variable which is set by Py_Finalize() to the (Python thread state of the) thread which runs the finalization.

Simplified patch of the PyEval_RestoreThread() fix:

@@ -440,6 +440,12 @@ PyEval_RestoreThread()
         take_gil(tstate);
+        if (_Py_Finalizing && tstate != _Py_Finalizing) {
+            drop_gil(tstate);
+            PyThread_exit_thread();
+        }

If Python is finalizing (_Py_Finalizing is not NULL) and PyEval_RestoreThread() is called by a thread which is not thread running the finalization, the thread exits immediately (call PyThread_exit_thread()).

PyEval_RestoreThread() is called when a thread takes the GIL. Typical example of code which drops the GIL to call a system call (close a file descriptor, io.FileIO() finalizer) and then takes again the GIL:

Py_BEGIN_ALLOW_THREADS
close(fd);
Py_END_ALLOW_THREADS

The Py_BEGIN_ALLOW_THREADS macro calls PyEval_SaveThread() to drop the GIL, and the Py_END_ALLOW_THREADS macro calls PyEval_RestoreThread() to take the GIL. Pseudo-code:

PyEval_SaveThread();     // drop the GIL
close(fd);
PyEval_RestoreThread();  // take the GIL

With Antoine's fix, if Python is finalizing, a thread now exits immediately when calling PyEval_RestoreThread().

Revert take_gil() backport to 2.7

In June 2014, Benjamin Peterson (Python 2.7 release manager) backported Antoine's change to Python 2.7: fix included in 2.7.8.

Problem: the Ceph project started to crash with Python 2.7.8.

In November 2014, the change was reverted in Python 2.7.9: see bpo-21963 discussion for the rationale.

In 2014, I already wrote:

Anyway, daemon threads are evil :-( Expecting them to exit cleanly automatically is not good. Last time I tried to improve code to cleanup Python at exit in Python 3.4, I also had a regression (just before the release of Python 3.4.0): see the issue #21788.

Conclusion

Daemon threads caused crashes in the Python finalization, first noticed in 2005.

Python 3.2 (released in February 2011) got a new GIL and also a bugfix for daemon thread. Python 3.3 (released in September 2012) also got a bugfix for daemon threads. The Python finalization became more reliable.

Changing Python finalization is risky. A backport of a bugfix into Python 2.7.8 caused a regression which required to revert the bugfix in Python 2.7.9.

Python 3.7 Development Mode

2020-01-16T12:00:00+01:00

This article describes the discussion on the design of the development mode (-X dev) that I added to Python 3.7 and how it has been implemented.

The development mode enables runtime checks which are too expensive to be enabled by default. It can be enabled by python3 -X dev command line option or by PYTHONDEVMODE=1 environment variable. It helps developers to spot bugs in their code and helps them to be prepared for future Python changes.

Drawing: Ready to race, by Guillaume Singelin.

Email sent to python-ideas

In March 2016, I proposed Add a developer mode to Python: -X dev command line option on the python-ideas list:

When I develop on CPython, I'm always building Python in debug mode using ./configure --with-pydebug. This mode enables a lot of extra checks which helps me to detect bugs earlier. The debug mode makes Python much slower and so is not enabled by default.

I propose to add a "development mode" to Python, to get a few checks to detect bugs earlier: a new -X dev command line option. Example:
python3.6 -X dev script.py
I propose to enable:

Show DeprecationWarning and ResourceWarning warnings: python -Wd

Show BytesWarning warning: python -b

Enable Python assertions (assert) and set __debug__ to True: remove (or just ignore) -O or -OO command line arguments

faulthandler to get a Python traceback on segfault and fatal errors: python -X faulthandler

Debug hooks on Python memory allocators: PYTHONMALLOC=debug

I wrote an implementation of this development mode using exec(). Ronald Oussoren commented my patch:

Why does this patch execv() the interpreter to set options? I'd expect it to be possible to get the same result by updating the argument parsing code in Py_Main.

More on that later :-) Marc-Andre Lemburg didn't buy the idea:

I'm not sure whether this would make things easier for the majority of developers, e.g. someone not writing C extensions would likely not be interested in debugging memory allocations or segfaults, someone spending more time on numerics wouldn't bother with bytes warnings, etc.

Opinion shared by Ethan Furman, so I gave up at this point, closed my issue and my PR.

async keyword, DeprecationWarning and PEP 565

At November 1, 2017, Ned Deily, the Python 3.7 release release, sent an email to python-dev: Reminder: 12 weeks to 3.7 feature code cutoff.

A discussion started on async and await becoming keywords and how this incompatible change was conducted. Read LWN article Who should see Python deprecation warnings? (December 2017) by Jonathan Corbet for the whole story:

In early November, one sub-thread of a big discussion on preparing for the Python 3.7 release focused on the await and async identifiers. They will become keywords in 3.7, meaning that any code using those names for any other purpose will break. Nick Coghlan observed that Python 3.6 does not warn about the use of those names, calling it "a fairly major oversight/bug". In truth, though, Python 3.6 does emit warnings in that case — but users rarely see them.

The question is who should see DeprecationWarning. Long time ago, it has been decided to hide them by default to not bother users. Users are not able to fix them, and so it is only a source of annoyance.

If the warning is displayed by default, developers can be annoyed by warnings coming from code that they cannot easily fix, like third-party dependencies.

At November 12, 2017, Nick Coghlan proposed PEP 565: Show DeprecationWarning in __main__ as a compromise:

This change will mean that code entered at the interactive prompt and code in single file scripts will revert to reporting these warnings by default, while they will continue to be silenced by default for packaged code distributed as part of an importable module.

The PEP has been approved and implemented in Python 3.7. For example, DeprecationWarning is now displayed by default when running a script and in the REPL:

$ cat example.py
import imp

$ python3 example.py
example.py:1: DeprecationWarning: the imp module is deprecated ...
  import imp

$ python3
Python 3.7.6 (default, Dec 19 2019, 22:52:49)
>>> import imp
__main__:1: DeprecationWarning: the imp module is deprecated ...

Development mode proposed on python-dev

I was not convinced that only displaying warnings in the __main__ module is enough to help developers to fix issues in their code. A project is way larger than just this module.

I came back with my idea, now on the python-dev list: Add a developer mode to Python: -X dev command line option.

This mode shows DeprecationWarning and ResourceWarning is all modules, not only in the __main__ module. In my opinion, having an opt-in mode for developers is the best option. Python should not spam users with warnings which are targeting developers.

In the context of Python 3.7 incompatible changes, the feedback was way better this time.

Issues with the Python initialization

When I proposed the idea, my plan was to call exec() to replace the current process with a new process. But when I tried to implement it, it was more tricky than expected. My first blocker issue was to remove -O option from the command line. I hate having to parse the command line: it is very fragile and it's too easy to make mistake.

So I tried to write a clean implementation: configure Python properly in "development mode". The first blocker issue was to implement PYTHONMALLOC=debug. The C code to read and apply the Python configuration used Python objects before the Python initialization even started. For example, -W and -X options were stored as Python lists. It means that the Python memory allocator was used before Python would be able to parse PYTHONMALLOC environment variable.

Moreover, the Python configuration is quite complex. Many options are inter-dependent. For example, the -E command line option ignores environment variables with a name staring with PYTHON: like PYTHONMALLOC! Python has to parse the command line before being able to handle PYTHONMALLOC.

Python lists depends on the memory allocator which depends on PYTHONMALLOC environment variable which depends on the -E command line option which depends on Python lists...

In short, it wasn't possible to write a clean implementation of the development mode without refactoring the Python initialization code.

Refactoring main.c

For all these reasons, I refactored Python initialization code in main.c, with bpo-32030 with two large changes:

commit f7e5b56c: bpo-32030: Split Py_Main() into subfunctions
commit a7368ac6: bpo-32030: Enhance Py_Main()

Add -X dev option

Since I got enough approval by my peers (core developers), I pushed commit ccb0442a of bpo-32043 to add the -X dev command line option. Thanks to the previous refactoring, the implementation is less intrusive.

Effects of the development mode:

Add default warnings option. For example, display DeprecationWarning and ResourceWarning warnings.
Install debug hooks on memory allocators as if PYTHONMALLOC is set to debug.
Enable my faulthandler module to dump the Python traceback on a crash.

Add PYTHONDEVMODE environment variable

In a PR review, Antoine Pitrou proposed:

Speaking of which, perhaps it would be nice to set those environment variables so that child processes launched using subprocess inherit them?

I created bpo-32101 to add PYTHONDEVMODE environment variable: commit 5e3806f8.

Setting PYTHONDEVMODE=1 allows to also enable the development mode in Python child processes, without having to touch their command line.

Enable asyncio debug mode

I created bpo-32047: asyncio: enable debug mode when -X dev is used and asked in the -X dev thread on python-dev:

What do you think? Is it ok to include asyncio in the global "developer mode"?

Antoine Pitrou didn't like the idea because asyncio debug mode was "quite expensive", but Yury Selivanov (one of the asyncio maintainers) and Barry Warsaw liked the idea, so I merged my PR: commit 44862df2.

Antoine Pitrou created bpo-31970: asyncio debug mode is very slow. Hopefully, he found a way to make asyncio debug mode more efficient by truncating tracebacks to 10 frames (commit 921e9432).

Fix warnings filters

While checking warnings filters, I noticed that the development mode was hiding some ResourceWarning warnings. I completed the documentation and fixed warnings filters in bpo-32089.

Python 3.8 logs close() exception

By default, Python ignores silently EBADF error (bad file descriptor) which can lead to a severe crash , bpo-18748 (simplified gdb traceback):

Program received signal SIGABRT, Aborted.
[Switching to Thread 0xb7b0eb70 (LWP 17152)]
0xb7fe1424 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7fe1424 in __kernel_vsyscall ()
#1  0xb7e4e941 in *__GI_raise (sig=6)
#2  0xb7e51d72 in *__GI_abort ()
#3  0xb7e8ae15 in __libc_message (do_abort=1, fmt=0xb7f606f5 "%s")
#4  0xb7e8af44 in *__GI___libc_fatal (message=0xb7fc75ec
    "libgcc_s.so.1 must be installed for pthread_cancel to work\n")
#5  0xb7fc4ffa in pthread_cancel_init ()
#6  0xb7fc509d in _Unwind_ForcedUnwind (...)
#7  0xb7fc2b98 in *__GI___pthread_unwind (buf=<optimized out>)
#8  0xb7fbcce0 in __do_cancel () at pthreadP.h:265
#9  __pthread_exit (value=0x0) at pthread_exit.c:30
...

Notice the "libgcc_s.so.1 must be installed for pthread_cancel to work" error message: the glibc loads dynamically libgcc_s.so.1 library when a thread completes, but another thread closed its file descriptor!

The worst is that the crash is not deterministic: it's a race condition which requires to try many times, even with an example designed to trigger the crash!

Since the EBADF error is silently ignored, it is hard to notice or to debug such issue. I modified the development mode in Python 3.8 to log close() exceptions in io.IOBase destructor.

It was not accepted to always log the close() exception. So having an opt-in development mode is a good practical compromise!

Python 3.9 checks encoding and errors

In June 2019, my colleague Miro Hrončok reported bpo-37388:

I was just bit by specifying an nonexisitng error handler for bytes.decode() without noticing.

Consider this code:
>>> 'a'.encode('cp1250').decode('utf-8', errors='Boom, Shaka Laka, Boom!')
'a'

I modified the development mode in Python 3.9, to also check encoding and errors arguments on string encoding and decoding operations, like bytes.decode() or str.encode().

By default, for best performance, the errors argument is only checked at the first encoding/decoding error and the encoding argument is sometimes ignored for empty strings.

Having an opt-in development mode allows to enable additional debug checks at runtime, without having to care too much about the performance overhead.

Note: I love the choice of the example, "Boom, Shaka Laka, Boom!" from the game Gruntz :-D

Development Mode Example

Even in the __main__ module with PEP 565, ResourceWarning is still not displayed by default (PEP 565 only shows DeprecationWarning):

$ python3 -c 'print(len(open("README.rst").readlines()))'
39

The development mode shows the warning:

$ python3 -X dev -c 'print(len(open("README.rst").readlines()))'
-c:1: ResourceWarning: unclosed file <_io.TextIOWrapper name='README.rst' mode='r' encoding='UTF-8'>
ResourceWarning: Enable tracemalloc to get the object allocation traceback
39

Not closing a resource explicitly can leave a resource open for way longer than expected. It can cause severe issues at Python exit. It is bad in CPython, but it is even worse in PyPy. Closing resources explicitly makes an application more deterministic and more reliable.

If one of the development mode effect causes an issue, it is still possible to override most options. For example, PYTHONMALLOC=default python3 -X dev ... command enables the development mode without installing debug hooks on memory allocators.

Pass the Python thread state explicitly

2020-01-08T15:00:00+01:00

Keeping Python competitive

I'm trying to find ways to make Python more efficient for many years, see for example my discussion at the Language Summit during Pycon US 2017: Keeping Python competitive (LWN article); slides. At EuroPython 2019 (Basel), I gave the keynote "Python Performance: Past, Present and Future": slides and video. I gave my vision on the Python performance and listed 3 projects to speedup Python that I consider as realistic:

subinterpreters: see Eric Snow's multi-core-python project
better C API: see HPy (new C API) and pythoncapi.readthedocs.io
tracing garbage collector for CPython

This article is about subinterpreters.

Subinterpreters

Eric Snow is working on subinterpreters since 2015, see his first blog post published in September 2016: Solving Multi-Core Python. See Eric Snow's multi-core-python project wiki for the whole history.

In September 2017, he wrote a concrete proposal: PEP 554: Multiple Interpreters in the Stdlib.

Eric mentions the PEP 432: Simplifying the CPython startup sequence as one blocker issue. I fixed this issue (at least for the subinterpreters case) with my PEP 587: Python Initialization Configuration that I implemented in Python 3.8.

Sadly, implementing subinterpreters in the 30 years old CPython project is hard since a lot of code has to be updated. CPython is made of not less than 603K lines of C code (and 815K lines of Python code)!

In May 2018, at CPython sprint during Pycon US, I discussed subinterpreters with Eric Snow and Nick Coghlan. I draw an overview of Python internals and the different "states" on a whiteboard:

Python and Python subinterpreter lifecycles (creation and finalization):

As a follow-up of this meeting, I wrote down the current state and what should be done: Reorganize Python “runtime”.

Getting the current Python thread state

In the current master branch of Python, getting the current Python thread state is done using these two macros:

#define _PyRuntimeState_GetThreadState(runtime) \
    ((PyThreadState*)_Py_atomic_load_relaxed(&(runtime)->gilstate.tstate_current))

#define _PyThreadState_GET() _PyRuntimeState_GetThreadState(&_PyRuntime)

These macros depend on the global _PyRuntime variable: instance of the _PyRuntimeState structure. There is exactly one instance of _PyRuntimeState: data shared by all interpreters on purpose (more info about _PyRuntimeState below).

_Py_atomic_load_relaxed() uses an atomic operation which may become an performance issue if Python is modified to get the Python thread state in more places. I tried to check if it uses a slow atomic read instruction, but it seems like only a write uses an explicit memory fence operation: read seems to be "free" (it's a regular efficient MOV instruction). I only checked the x86-64 machine code, it may be different on other architectures.

GIL state

Currently, the _PyRuntimeState structure has a gilstate field which is shared between all subinterpreters. The long term goal of the PEP 554 (subinterpreters) is to have one GIL per subinterpeters to execute multiple interpreters in parallel. Currently, only one interpreter can be executed at the same time: there is no parallelism, except if a thread releases the GIL which is not the common case.

It's tracked by these two issues:

I expect that fixing this issue may require to add a lock somewhere which can hurt performances, depending on how the GIL state is accessed.

Passing a state to internal function calls

To avoid any risk of performance penality with incoming Python internal changes for subinterpreters, but also to make things more explicit, I proposed to pass explicitly "a state" to internal C function calls.

First, it wasn't obvious which "state" should be passed: _PyRuntimeState, PyThreadState, a structure containing both, or something else?

Moreover, it was unclear how to get the runtime from PyThreadState, and how to get PyThreadState from runtime?

I started to pass runtime to some functions (_PyRuntimeState): Pass _PyRuntimeState as an argument rather than using the _PyRuntime global variable.

Then I pushed more changes to pass tstate to some other functions (PyThreadState): Pass explicitly tstate to function calls.

I added PyInterpreterState.runtime so getting _PyRuntimeState from PyThreadState is now done using: tstate->interp->runtime. It's no longer needed to pass runtime and tstate to internal functions: tstate is enough.

Slowly, I modified the internals to only pass tstate to internal functions: tstate should become the root object to access all Python states.

I ended with a thread on the python-dev mailing list to summarize this work: Pass the Python thread state to internal C functions. The feedback was quite positive, most core developers agreed that passing explicitly tstate is a good practice and the work should be continued.

_PyRuntimeState and PyInterpreterState

Currently, some _PyRuntimeState fields are shared by all interperters, whereas they should be moved into PyInterpreterState: it's still a work in progress.

For example, I continued the work started by Eric Snow to move the garbage collector state from _PyRuntimeState to PyInterpreterState: GC operates out of global runtime state..

As explained above, another example is gilstate that should also be moved to PyInterpreterState, but that's a complex change that should be well prepared to not break anything.

More subinterpreter work

Implementing subinterpreters also requires to cleanup various parts of Python internals.

For example, I modified Python so Py_NewInterpreter() and Py_EndInterpreter() (create and finalize a subinterpreter) share more code with Py_Initialize() and Py_Finalize() (create and finalize the main interpreter): new_interpreter() should reuse more Py_InitializeFromConfig() code.

They are still many issues to be fixed: it's moving slowly but steadily!

Graphics bugs in Firefox and GNOME

2019-10-10T17:00:00+02:00

After explaining how to Debug Hybrid Graphics issues on Linux, here is the story of four graphics bugs that I had in GNOME and Firefox on my Fedora 30 between May 2018 and September 2019: bugs in gnome-shell, Gtk, Firefox and mutter.

gnome-shell freezes

In May 2018, six months after I got my Lenovo P50 laptop, gnome-shell was "sometimes" freezing between 1 and 5 seconds. It was annoying because key stokes created repeated keys writing "helloooooooooooooooooooooo" instead of "hello" for example.

My colleagues led my to #fedora-desktop of the GIMP IRC server where I met my colleague Jonas Ådahl (jadahl) who almost immediately identified my issue! Extract of the IRC chat:

15:03 <vstinner> hello. i upgraded from F27 to F28, and it seems like I
    switched from Xorg to Wayland. sometimes, the desktop hangs a few
    milliseconds (less than 2 secondes)
15:03 <vstinner> bentiss told me that  "libinput error: client bug: timer
    event7 keyboard: offset negative (-39ms)" can occur when shell is too
    slow
15:04 <vstinner> journalctl shows me frenquently the bug
    https://gitlab.gnome.org/GNOME/gnome-shell/issues/1 "Object
    Shell.GenericContainer (0x559e6bfddc60), has been already finalized.
    Impossible to get any property from it."
15:04 <vstinner> i also get "Window manager warning: last_user_time
    (3093467) is greater than comparison timestamp (3093466).  This most
    likely represents a buggy client sending inaccurate timestamps in
    messages such as _NET_ACTIVE_WINDOW.  Trying to work around..." errors
    in logs (from shell)
15:05 <vstinner> bentiss: ah, i also get "libinput error: client bug: timer
    event7 trackpoint: offset negative (-352ms)" errors
15:06 <vstinner> it's a recent laptop, Lenovo P50: 32 GB of RAM, 4 physical
    CPUs (8 threads) Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
15:06 <vstinner> so. what can i do to debug such performance issue? may it
    come from shell? what does it mean if shell is slow? can it be a GPU
    issue?  a javascript issue?
...
15:13 <jadahl> vstinner: whats your hardware? Do you have a hybrid gpu
    system?
15:13 <jadahl> ah, yes P50
15:14 <jadahl> vstinner: there is a branch on mutter upstream that fixes
    that issue. want to compile it to test?

Ten minutes after I asked my question, Jonas asked the right question: Do you have a hybrid gpu system?

I was able to workaround the issue by connecting my laptop to my TV using the HDMI port:

15:22 < jadahl> for example, IIRC if you have a monitor connected to the
    HDMI, the issue will go away since the secondary GPU is always awake
    anyway
...
15:31 < vstinner> jadahl: i plugged a HDMI cable to my TV and it seems like
    the issue is gone
15:31 < vstinner> jadahl: impressive

When an external monitor is used (like a TV plugged on the HDMI port), my NVIDIA GPU is always active which works around the bug I had in gnome-shell.

Jonas provided me a RPM package for Fedora including his work-in-progress fix: Upload HW cursor sprite on-demand. I confirmed that this change fixed my bug. His mutter change has been merged upstream.

Firefox crash when selecting text

In March 2019, Firefox with Wayland crashed on wl_abort() when selecting more than 4000 characters in a <textarea>. I found the bug in Gmail when selecting the whole email text to remove it. Pressing CTRL + A or Right-click + Select All crashed the whole Firefox process!

I reported the bug to Firefox: Firefox with Wayland crash on wl_abort() when selecting more than 4000 characters in a <textarea>.

Running gdb in Firefox caused me some troubles since it's a very large binary with many libraries. I also read Wayland protocol specifications. I managed to analyze the bug and so I reported the bug to Gtk as well, On Wayland, notify_surrounding_text() crash on wl_abort() if text is longer than 4000 bytes:

According to gdb, wl_proxy_marshal_array_constructor_versioned() calls wl_abort() because the buffer is too short. It seems like wl_buffer_put() fails with E2BIG.

Quickly, I identified that my Gtk bug has already been fixed 3 months before by Carlos Garnacho (imwayland: Respect maximum length of 4000 Bytes on strings being sent) and the fix is part of gtk-3.24.3 ("wayland: Respect length limits in text protocol" says "Overview of Changes in GTK+ 3.24.3").

I requested to upgrade Gtk in Fedora. But it was not possible since the newer version changed the theme. I was asked to cherry-pick the fix and that's what I did: imwayland: Respect maximum length of 4000 Bytes on strings.

My PR was merged and a new package was built. I tested it and confirmed that it fixed the crash: FEDORA-2019-d67ec97b0b. Soon, the package was pushed to the public Fedora package repository.

That's the cool part about open source: if you have the skills to hack the code, you can fix an annoying which is affecting you!

Firefox: [Wayland] Window partially or not updated when switching between two tabs

Analyze the bug

In September 2019, after a large system upgrade (install 6 packages, upgrade 234 packages, remove 5 packages), Firefox started to not update the window content sometimes when I switched from one tab to another. Example:

It took me a few hours to analyze the bug to be able to produce an useful bug report.

I followed Fedora's guide How to debug Firefox problems advices.

First, I tried to understand which GPU driver is used. I finished by blacklisting the nouveau driver in the Linux kernel, to ensure that Firefox was using my Intel IGP. I still reproduced the bug.

I disabled all Firefox extensions: bug reproduced.

Then I created a new Firefox profile and started Firefox in safe mode: bug reproduced.

I tested the latest Firefox binary from mozilla.org (Firefox 69.0): bug reproduced.

Finally, I tested Firefox Nightly from mozilla.org (Firefox 71.0a1): bug reproduced.

Ok, it was enough data to produce an interesting bug report. I reported [Wayland] Window partially or not updated when switching between two tabs to Firefox.

Identify the regression using Fedora packages

Then I looked at /var/log/dnf.log and I tried to identify which package update could explain the regression.

I downgraded gtk3-3.24.11-1.fc30.x86_64 to gtk3.x86_64 3.24.10-1.fc30: bug reproduced.

I rebooted on oldest available Linux kernel, version 5.2.8-200.fc30.x86_64: bug reproduced. I checked journalctl logs to check which Linux version I was running whhen the bug was first seen: Linux 5.2.9-200.fc30.x86_64.

I don't know why, but downgrading Firefox was only my 3rd test.

I downgraded firefox-69.0-2.fc30.x86_64 to firefox-68.0.2-1.fc30.x86_64: the bug is gone! Ok, so the regression comes from the Firefox package, and it was introduced between package versions 68.0.2-1.fc30 and 69.0-2.fc30.

On IRC, I met my colleague Martin Stránský who package Firefox for Fedora. He told me that he is aware of my bug and may have a fix for my bug. Great!

Only 9 days later, Martin Stránský fix has been merged in Firefox upstream, released in Firefox Nightly, and a new package has been shipped in Fedora 30! Thanks Martin for your efficiency!

The final Firefox change is quite large and intrusive: [Wayland] Fix rendering glitches on wayland

Xwayland crash in xwl_glamor_gbm_create_pixmap()

In September 2019, while I was debugging the previous Firefox bug, I started my IRC client hexchat. Suddently, Xwayland crashed which closed my whole Gnome session! I was testing various GPU configurations to analyze the Firefox bug.

ABRT managed to rebuild an useless traceback and identified an existing bug report. It added my coment to [abrt] xorg-x11-server-Xwayland: OsLookupColor(): Segmentation fault at address 0x28 report.

At July 26, 2019 (1 month before I got the bug), Olivier Fourdan added an interesting comment:

glamor_get_modifiers+0x767 is xwl_glamor_gbm_create_pixmap() so this is the same as bug 1729925 fixed upstream with xwayland: Do not free a NULL GBM bo.

So in fact, my bug was already fixed by Olivier Fourdan in Xwayland upstream, but the fix didn't land into Fedora yet.

Thanks!

I would like to thank the following developers who fixed my Fedora 30. What a coincidence, all four are my collagues! It seems like Red Hat is investing in the Linux desktop :-)

Carlos Garnacho (Red Hat).

Jonas Ådahl (Red Hat).

Martin Stránský (Red Hat).

Olivier Fourdan (Red Hat).

Debug Hybrid Graphics issues on Linux

2019-09-11T15:50:00+02:00

Hybrid Graphics is a complex hardware and software solution to achieve longer laptop battery life: an integrated graphics device is used by default, and a discrete graphics device with higher graphics performances is enabled on demand.

If it is designed and implemented carefully, users should not notice that a laptop has two graphical devices.

Sadly, the Linux implementation is not perfect yet. I had to debug different graphics issues on GNOME last months, so I decided to write down an article about this technology.

This article is about the GNOME desktop environment with Wayland running on Fedora 30, with Linux kernel vgaswitcheroo in muxless mode (more about that above).

Hybrid Graphics

Hybrid Graphics are known under different names:

Linux kernel vgaswitcheroo
PRIME in Linux open source GPU drivers (nouveau, ati, amdgpu and intel), the "muxless" flavor of hybrid graphics
Bumblebee: NVIDIA Optimus for Linux
"AMD Dynamic Switchable Graphics" for Radeon
"Dual GPUs"
etc.

Nowadays, most manufacturers utilizes the muxless model:

Dual GPUs but only one of them is connected to outputs. The other one is merely used to offload rendering, its results are copied over PCIe into the framebuffer. On Linux this is supported with DRI PRIME.

In 2010, the first generation hybrid model used the muxed model:

Dual GPUs with a hardware multiplexer chip to switch outputs between GPUs. This model makes the user choose (at boot time or at login time) between the two power/graphics profiles and is almost fixed throughout the user session.

Note: The development to support hybrid graphics in Linux started in 2010.

Does my Linux have Hybrid Graphics?

On Linux, Hybrid Graphics is used if the /sys/kernel/debug/vgaswitcheroo/ directory exists.

No Hybrid Graphics, single graphics device:

$ sudo cat /sys/kernel/debug/vgaswitcheroo/switch
cat: /sys/kernel/debug/vgaswitcheroo/switch: No such file or directory

Hybrid Graphics with two graphics devices:

$ sudo cat /sys/kernel/debug/vgaswitcheroo/switch
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynOff:0000:01:00.0

Command to list graphics devices:

$ lspci|grep VGA
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M1000M] (rev a2)

Hardware

My employer gave me a Lenovo P50 laptop to work in December 2017. It is my only computer at home, so I needed a powerful laptop (even if it's heavy for traveling to conferences). The CPU, RAM and battery are great, but the hybrid graphics caused me some headaches.

My Lenovo P50 has two GPUs:

$ lspci|grep VGA
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M1000M] (rev a2)

The Integrated Graphics Device is a Intel IGP (Intel HD Graphics 530)
The Discrete Graphics Device is a NVIDIA GPU (NVIDIA Quadro M1000M)

I didn't know that that the laptop had two graphics device when I chose the laptop model. I discovered hybrid graphics when I started to debug graphics issues.

BIOS

Hybrid graphics can be configured in the BIOS:

Discrete Graphics mode will achieve higher graphics performances.
Hybrid Graphics mode (default) runs as Integrated Graphics mode to achieve longer battery life, and Discrete Graphics is enabled on demand.

On my Lenovo P50, using the Discrete Graphics mode removes "00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530" from lspci command output: the Intel IGP is fully disabled. The Linux kernel only sees the NVIDIA GPU.

Linux kernel

On Linux, hybrid graphics is handled by vgaswitcheroo:

$ sudo cat /sys/kernel/debug/vgaswitcheroo/switch
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynPwr:0000:01:00.0

IGD stands for Integrated Graphics Device
DIS stands for DIScrete Graphics Device
"+" marks the active card
Pwr: the graphics device is always active
DynPwr: the graphics device is actived on demand

The last field (ex: 0000:00:02.0) is based on the PCI identifier:

$ lspci|grep VGA
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M1000M] (rev a2)

On my laptop, hybrid graphics is detected by an ACPI "Device-Specific Method" (DSM):

$ journalctl -b -k|grep 'VGA switcheroo'
Sep 11 02:29:54 apu kernel: VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle

See: VGA Switcheroo (Linux kernel documentation).

OpenGL

Mesa provides glxinfo utility to get information about the OpenGL driver currently used:

$ glxinfo|grep -E 'Device|direct rendering'
direct rendering: Yes
    Device: Mesa DRI Intel(R) HD Graphics 530 (Skylake GT2)  (0x191b)

On this example, the discrete Intel IGP is used.

In Firefox, go to about:support page and search for the Graphics section to get information about compositing, WebGL, GPU, etc.

DRI_PRIME environment variable

Set DRI_PRIME=1 environment variable to run an application with the discrete GPU.

Example:

$ DRI_PRIME=1 glxinfo|grep -E 'Device|rendering'
direct rendering: Yes
    Device: NV117 (0x13b1)

switcheroo-control

switcheroo-control is a deamon controlling /sys/kernel/debug/vgaswitcheroo/switch (Linux kernel). It can be accessed by DBus.

When the daemon starts, it looks for xdg.force_integrated=VALUE parameter in the Linux command line. If VALUE is 1, true or on, or if xdg.force_integrated=VALUE is not found in the command line, the daemon writes DIGD into /sys/kernel/debug/vgaswitcheroo/switch (delayed switch to the integrated graphics device: my Intel IGP)

If xdg.force_integrated=0 is found in the command line, the daemon leaves /sys/kernel/debug/vgaswitcheroo/switch unchanged.

systemd:

Check if the service is running: sudo systemctl status switcheroo-control.service
Disable the service: sudo systemctl disable switcheroo-control.service and sudo systemctl stop switcheroo-control.service

On Fedora, switcheroo-control is installed by default.

It is unclear to me if this daemon is still useful for my setup. It seems like the the Linux kernel switcheroo uses the integrated Intel IGP by default anyway.

Disable the discrete GPU by blacklisting its driver

To debug graphical bugs, I wanted to ensure that the discrete NVIDIA GPU is never used.

I found the solution of fully disabling the nouveau driver in the Linux kernel: add modprobe.blacklist=nouveau to the Linux kernel command line. On Fedora, you can use:

sudo grubby --update-kernel=ALL --args="modprobe.blacklist=nouveau"

To reenable nouveau, remove the parameter. On Fedora:

sudo grubby --update-kernel=ALL --remove-args="modprobe.blacklist=nouveau"

Demo!

For this test, my laptop is not connected to anything (no power cable, no external monitor, no dock).

When my laptop is idle (no 3D application is running), the NVIDIA GPU is suspended:

$ cat /sys/bus/pci/drivers/nouveau/0000\:01\:00.0/enable
0
$ cat /sys/bus/pci/drivers/nouveau/0000\:01\:00.0/power/runtime_status
suspended

I explicitly run a 3D application on it:

DRI_PRIME=1 glxgears

The NVIDIA GPU becomes active:

$ cat /sys/bus/pci/drivers/nouveau/0000\:01\:00.0/enable
2
$ cat /sys/bus/pci/drivers/nouveau/0000\:01\:00.0/power/runtime_status
active

I stop the 3D application. A few seconds later, the NVIDIA GPU is suspended again:

$ cat /sys/bus/pci/drivers/nouveau/0000\:01\:00.0/enable
0
$ cat /sys/bus/pci/drivers/nouveau/0000\:01\:00.0/power/runtime_status
suspended

Graphics devices and monitors

When I disabled the nouveau driver using modprobe.blacklist=nouveau kernel command line parameter, I was no longer able to use external monitors. I understood that:

The Intel IGP is connected to the internal laptop screen
The NVIDIA GPU is connected to the external monitors (DisplayPort and HDMI ports)

When my laptop has no external monitor connected, the discrete NVIDIA GPU is actived on demand (suspended when idle)

When I connect my laptop to two external monitors (using my dock), the discrete NVIDIA GPU is always active:

$ cat /sys/bus/pci/drivers/nouveau/0000\:01\:00.0/power/runtime_status
active

Split Include/ directory in Python 3.8

2019-06-19T12:00:00+02:00

In September 2017, during the CPython sprint at Facebook, I proposed my idea to create A New C API for CPython. I'm still working on the Python C API at: pythoncapi.readthedocs.io.

My analysis is that the C API leaks too many implementation details which prevent to optimize Python and make the implementation of PyPy (cpyext) more painful.

In Python 3.8, I created Include/cpython/ sub-directory to stop adding new APIs to the stable API by mistake.

I moved more private functions into the internal C API: Include/internal/ directory.

I also converted some macros like Py_INCREF() and Py_DECREF() to static inline functions to have well defined parameter and return type, and to avoid macro pitfals.

Finally, I removed 3 functions from the C API.

Include/internal/

In Python 3.7, Eric Snow created Include/internal/ sub-directory for the CPython "internal C API": API which should not be used outside CPython code base. In Python 3.6, these APIs were surrounded by:

#ifdef Py_BUILD_CORE
...
#endif

In Python 3.8, I continued this work by moving more private functions into this directory: see bpo-35081.

I started a thread on python-dev: [Python-Dev] Rename Include/internal/ to Include/pycore/. But it was decided to keep Include/internal/ name. It was decided that internal header files must not be included implicitly by the generic #include <Python.h>, but included explicitly. For example, when I moved _PyObject_GC_TRACK() and _PyObject_GC_UNTRACK() to the internal C API, I had to add #include "pycore_object.h" to 32 C files!

I also modified make install to install this internal C API, so it can be used for specific needs like debuggers or profilers which have to access CPython internals (access structure fields) but cannot call functions. For example, Eric Snow moved the PyInterpreterState structure to the internal C API.

Installing the internal C API ease the migration of APIs to internal: if an API is still needed after it's moved, it's now possible to opt-in to use it.

Using the internal C API requires to define Py_BUILD_CORE_MODULE macro and use a different include, like #include "internal/pycore_pystate.h". It's more complicated on purpose: ensure that it's not used by mistake.

Python 3.8 now provides 21 internal header files:

pycore_accu.h       pycore_getopt.h      pycore_pyhash.h
pycore_atomic.h     pycore_gil.h         pycore_pylifecycle.h
pycore_ceval.h      pycore_hamt.h        pycore_pymem.h
pycore_code.h       pycore_initconfig.h  pycore_pystate.h
pycore_condvar.h    pycore_object.h      pycore_traceback.h
pycore_context.h    pycore_pathconfig.h  pycore_tupleobject.h
pycore_fileutils.h  pycore_pyerrors.h    pycore_warnings.h

Include/cpython/

The PEP 384 "Defining a Stable ABI" introduced Py_LIMITED_API macro to exclude functions from the Python C API. The problem is when a new API is added, it has to explicitly be excluded using #ifndef Py_LIMITED_API. If the author forgets it, the function is added to be stable API by mistake.

I proposed to move the API which should be excluded from the stable ABI to a new subdirectory. I created a poll on the sub-directory name:

Include/cpython/
Include/board/
Include/impl/
Include/pycapi/ (the name that I proposed initially)
Include/unstable/
other (add comment)

The Include/cpython/ name won with 100% of the 3 votes (and a few more supports in the python-dev discussion and in the bug tracker) :-)

I created bpo-35134: Add a new Include/cpython/ subdirectory for the "CPython API" with implementation details.

My initial description of the directory content:

The new subdirectory will contain #ifndef Py_LIMITED_API code, not the “Stable ABI” of PEP 384, but more “implementation details” of CPython.

The change is backward compatible: #include <Python.h> will still provide exactly the same API. For example, object.h automatically includes cpython/object.h. But Include/cpython/ headers must not be included directly (it would fail with a compilation error).

For example, Include/object.h now ends with:

#ifndef Py_LIMITED_API
#  define Py_CPYTHON_OBJECT_H
#  include  "cpython/object.h"
#  undef Py_CPYTHON_OBJECT_H
#endif

Include/cpython/object.h structure (content replaced with ...):

#ifndef Py_CPYTHON_OBJECT_H
#  error "this header file must not be included directly"
#endif

#ifdef __cplusplus
extern "C" {
#endif

...

#ifdef __cplusplus
}
#endif

In Python 3.8, the work is not complete. I tried to double- or even triple-check my changes to ensure that I don't remove an API by mistake. This work is still on-going in Python 3.9.

Summary of Include/ directories

The header files have been reorganized to better separate the different kinds of APIs:

Include/*.h should be the portable public stable C API.
Include/cpython/*.h should be the unstable C API specific to CPython; public API, with some private API prefixed by _Py or _PY.
Include/internal/*.h is the private internal C API very specific to CPython. This API comes with no backward compatibility warranty and should not be used outside CPython. It is only exposed for very specific needs like debuggers and profiles which has to access to CPython internals without calling functions. This API is now installed by make install.

Convert macros to static inline functions

In bpo-35059, I converted some macros to static inline functions:

Py_INCREF(), Py_DECREF()
Py_XINCREF(), Py_XDECREF()
PyObject_INIT(), PyObject_INIT_VAR()
Private functions: _PyObject_GC_TRACK(), _PyObject_GC_UNTRACK(), _Py_Dealloc()

Compared to macros, static inline functions have multiple advantages:

Parameter types and return type are well defined;
They don't have issues specific to macros: see GCC Macro Pitfals;
Variables have a well defined local scope.

Python 3.7 uses ugly macros with comma and semicolon. Example:

#define _Py_REF_DEBUG_COMMA ,
#define _Py_CHECK_REFCNT(OP) /* a semicolon */;

#define _Py_NewReference(op) (                          \
    _Py_INC_TPALLOCS(op) _Py_COUNT_ALLOCS_COMMA         \
    _Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA               \
    Py_REFCNT(op) = 1)

Python 3.6 requires C99 standard of the C dialect. It was time to start to use it :-)

Removed functions

bpo-35713: I removed PyByteArray_Init() and PyByteArray_Fini() functions. They did nothing since Python 2.7.4 and Python 3.2.0, were excluded from the limited API (stable ABI), and were not documented.

bpo-36728: I also removed PyEval_ReInitThreads() function. It should not be called explicitly: use PyOS_AfterFork_Child() instead.

Python 3.8 sys.unraisablehook

2019-06-15T01:00:00+02:00

I added a new sys.unraisablehook function to allow to set a custom hook to control how "unraisable exceptions" are handled. It is already testable in Python 3.8 beta1, released last week!

An "unraisable exception" is an error which happens when Python cannot report it to the caller. Examples: object finalizer error (__del__()), weak reference callback failure, error during a GC collection. At the C level, the PyErr_WriteUnraisable() function is called to handle such exception.

Design the new hook was tricky, as its implementation.

The photo shows an exception awaiting to catch you ;-)

Kill Python at the first unraisable exception

One month ago, Thomas Grainger opened bpo-36829: "CLI option to make PyErr_WriteUnraisable abort the current process". He wrote:

Currently it's quite easy for these errors to go unnoticed. (...) The point for me is that CI will fail if it happens, then I can use gdb to find out the cause

Zackery Spytz wrote the PR 13175 to add -X abortunraisable command line option. When this option is used, PyErr_WriteUnraisable() calls Py_FatalError("Unraisable exception") which calls abort(): it raises SIGABRT signal which kills the process by default.

Handle unraisable exception in Python: sys.unraisablehook

I concur with Thomas that it's easy to miss such exception, but I dislike killing the process. It's not practical to have to use a low-level debugger like gdb to handle such bug.

I proposed a different design: add a new sys.unraisablehook hook allowing to use arbitrary Python code to handle an "unraisable exception".

I wrote a hook example which displays the Python stack where the exception occurred using the traceback module.

I chose to pass an single object as argument to sys.unraisablehook. The object has 4 attributes:

exc_type: Exception type.
exc_value: Exception value, can be None.
exc_traceback: Exception traceback, can be None.
object: Object causing the exception, can be None.

I wanted to design an extensible API: keep the backward compatibility even if tomorrow we want to add a new attribute to the object to pass more information.

Adding source parameter to the warnings module

To explain the rationale of my proposed sys.unraisablehook design (single objeect with attributes), let me tell you my bad experience with the warnings module.

Use tracemalloc for ResourceWarning

In March 2016, I was tired how debugging ResourceWarning warnings: it's hard to guess where the bug comes from. The warning is logged where the resource is released, but I was interested by where the resource was allocated.

My tracemalloc module provides a convenient get_object_traceback() function which provides the traceback where any Python has been allocated.

I opened bpo-26604: "ResourceWarning: Use tracemalloc to display the traceback where an object was allocated when a ResourceWarning is emitted".

warnings hooks cannot be extended

The problem is that the showwarning() and formatwarning() functions of warnings can be overriden. They use a fixed number of positional parameters:

def showwarning(message, category, filename, lineno, file=None, line=None): ...
def formatwarning(message, category, filename, lineno, line=None): ...

If they are called with an additional parameter, they fail with a TypeError. I wanted to add a new source parameter to these functions.

Reuse existing WarningMessage class

To extend the warnings module, I chose to rely on the existing WarningMessage class which can be used to "pack" all parameters as a single object. This class was used by catch_warnings context manager.

I had to add new private _showwarnmsg() and _formatwarnmsg() functions. They are called with a WarningMessage instance. The implementation has to detect when showwarning() and formatwarning() is overriden: the overriden function must be called with the legacy API in this case. The backward compatibility requirement makes the implementation complex.

Regression

After Python 3.6 was released with my new feature, bpo-35178 was reported. The warnings module called a custom formatwarning() with the line argument passed as a keyword argument, whereas other arguments are passed as positional arguments. The fix was trivial, but it shows that backward compatibility is hard.

Example

By the way, example of the feature using a filebug.py script:

def func():
    f = open(__file__)
    f = None

func()

The feature adds the "Object allocated at" traceback, whereas existing f = None output is worthless.

$ python3 -Wd -X tracemalloc=5 filebug.py
filebug.py:3: ResourceWarning: unclosed file <_io.TextIOWrapper name='filebug.py' mode='r' encoding='UTF-8'>
  f = None
Object allocated at (most recent call first):
  File "filebug.py", lineno 2
    f = open(__file__)
  File "filebug.py", lineno 5
    func()

Limitations of my unraisablehook idea

To come back to bpo-36829, I identified a limitation in my sys.unraisablehook idea: unraisable exceptions which occurs very late during Python finalization cannot be handled by a custom hook.

Thomas said that he is fine with having to use gdb to debug an issue during Python finalization.

In my experience, using gdb on system Python is unpleasant, since it's usually deeply optimized (PGO + LTO optimizations). gdb fails to read variables which are only displayed as <optimized out>. By the way, that's why I fixed the debug build of Python to be ABI compatible with a release build, but that's a different story.

Thomas's idea of killing the process allows to detect unraisable exceptions whenever they occur.

API discussed on python-dev

I started a discussion on python-dev to get more feedback: bpo-36829: Add sys.unraisablehook().

New exception while handling an exception

Nathaniel Smith asked what happens if a custom hook raises a new exception?

This problem is easy to fix: PyErr_WriteUnraisable() calls the default hook to handle the new exception (I already implemented this solution).

Positional arguments

Serhiy Storchaka preferred passing 5 positional arguments (exc_type, exc_value, exc_tb, obj and msg):

Currently we have no plans for adding more details, and I do not think that we will need to do this in future.

Later, he added:

If you have plans for adding new details in future, I propose to add a 6th parameter "context" or "extra" (always None currently). It is as extensible as packing all arguments into a single structure, but you do not need to introduce the structure type and create its instance until you need to pass additional info.

Reuse sys.excepthook

Steve Dower proposed to reuse sys.excepthook, rather than adding a new hook, and create a new exception to pass extra info.

Nathaniel explained that sys.excepthook and sys.unraisablehook have different behavior and so require to be different.

Object resurrection

Steve Dower was concerned by object resurrection and proposed to only pass repr(obj) to the hook.

I explained that an object can only be resurrected after its finalization, which is different than deallocation. Accessing a finalized object should not crash Python. The deallocation makes an object unsable, except that deallocation only happens once the last references to an object is gone, and so the object is no longer accessible.

Nathaniel added that repr() would limit features of the hook:

A clever hook might want the actual object, so it can pretty-print it, or open an interactive debugger and let it you examine it, or something.

Naming

Gregory P. Smith proposed the term "uncatchable" rather than "unraisable".

Keyword-only arguments

Barry Warsaw suggested to consider keyword-only arguments to help future proof the call signature.

Avoid redundant exc_type and exc_traceback parameters

Petr Viktorin asked why (exc_type, exc_value, exc_traceback) triple is needed, wheras exc_type could be get from type(exc_type) and exc_traceback from exc_value.__traceback__.

I made some tests. exc_value can be NULL sometimes. In some cases, exc_traceback can be set, whereas exc_value.__traceback__ is not set (None).

Productive discussion!

As usual, the python-dev discussion was very productive. Each corner case has been discussed and the API has been challenged.

Thanks to Petr's remark, I enhanced the existing hook to instanciate an exception if exc_value is NULL, create a traceback if exc_traceback is NULL, and set exc_value.__traceback__ to the traceback. If one of these actions fail, the failure is silently ignored.

I also paid more attention to object resurrection.

After one week of discussion, I was not convinced by other alternative propositions, whereas multiple core devs wrote that they like my API.

I decided to push my commit ef9d9b63:

commit ef9d9b63129a2f243591db70e9a2dd53fab95d86
Author: Victor Stinner <vstinner@redhat.com>
Date:   Wed May 22 11:28:22 2019 +0200

    bpo-36829: Add sys.unraisablehook() (GH-13187)

    Add new sys.unraisablehook() function which can be overridden to
    control how "unraisable exceptions" are handled. It is called when an
    exception has occurred but there is no way for Python to handle it.
    For example, when a destructor raises an exception or during garbage
    collection (gc.collect()).

New err_msg attribute

Unraisable exception were logged with no context, only an hardcoded "Exception ignored in:" error message.

Early in sys.unraisablehook discussion, Serhiy proposed to add a new err_msg parameter to pass an optional error message.

I implemented this idea in bpo-36829 with commit 71c52e30:

commit 71c52e3048dd07567f0c690eab4e5d57be66f534
Author: Victor Stinner <vstinner@redhat.com>
Date:   Mon May 27 08:57:14 2019 +0200

    bpo-36829: Add _PyErr_WriteUnraisableMsg() (GH-13488)

I was able to add a new parameter as a new err_msg attribute without breaking the backward compatibility!

test.support.catch_unraisable_exception()

I wrote a new context manager catching unraisable exceptions: test.support.catch_unraisable_exception(). The exception is stored and so can be used for tests in the context manager, but cleared at context manager exit.

I modified tests to use this new context manager:

test_coroutines
test_cprofile
test_exceptions
test_generators
test_io
test_raise
test_ssl
test_thread
test_yield_from

Example:

class BrokenDel:
    def __del__(self):
        raise ValueError("del is broken")

obj = BrokenDel()
with support.catch_unraisable_exception() as cm:
    del obj
    self.assertEqual(cm.unraisable.object, BrokenDel.__del__)

test_io memory leak regression

I modified test_io to ignore expected unraisable exceptions:

commit c15a682603a47f5aef5025f6a2e3babb699273d6
Author: Victor Stinner <vstinner@redhat.com>
Date:   Thu Jun 13 00:23:49 2019 +0200

    bpo-37223: test_io: silence destructor errors (GH-14031)

This change introduced a memory leak, bpo-37261:

test_io leaked [23208, 23204, 23208] references, sum=69620
test_io leaked [7657, 7655, 7657] memory blocks, sum=22969

The problem was this catch_unraisable_exception method:

def __exit__(self, *exc_info):
    del self.unraisable
    sys.unraisablehook = self._old_hook

Sometimes, del self.unraisable triggered a new unraisable exception. At this point, catch_unraisable_exception hook was still registered:

def _hook(self, unraisable):
    self.unraisable = unraisable

At the end, del self.unraisable instruction indirectly sets again the self.unraisable attribute.

First fix

First, I suspected that the io.BufferedRWPair object which triggered the first unraisable exception was resurrected, and that del self.unraisable called again its finalizer or deallocator, which triggered the same unraisable exception again.

My first attempt to fix the issue was to clear the sys.unraisablehook by setting it to None, and only later delete the attribute:

def __exit__(self, *exc_info):
    self.unraisablehook = None
    sys.unraisablehook = self._old_hook
    del self.unraisable

If self.unraisablehook = None triggers a new unraisable exception, it is silently ignored.

Second correct fix

But when I chatted with Pablo Galindo, he told me that an object cannot be finalized twice thanks to Antoine Pitrou's PEP 442: Safe object finalization.

I looked again into gdb. Oh. In fact, it's more subtle. del self.unraisable clears the last reference to BufferedRWPair which calls its deallocator. The dealloactor indirectly calls the BufferedWriter finalizer; the BufferedWriter was stored in the BufferedRWPair. This finalizer triggers a new unraisable exception.

BufferedRWPair does not trigger two unraisable exception. It's a different object (BufferedWriter).

My final fix is to restore the old hook before deleting the unraisable attribute:

def __exit__(self, *exc_info):
    sys.unraisablehook = self._old_hook
    del self.unraisable

And fix test_io using two nested context managers:

# Ignore BufferedWriter (of the BufferedRWPair) unraisable exception
with support.catch_unraisable_exception():
    # Ignore BufferedRWPair unraisable exception
    with support.catch_unraisable_exception():
        pair = None
        support.gc_collect()
    support.gc_collect()

I also documented corner cases in sys.unraisablehook documentation:

sys.unraisablehook can be overridden to control how unraisable exceptions are handled.

Storing exc_value using a custom hook can create a reference cycle. It should be cleared explicitly to break the reference cycle when the exception is no longer needed.

Storing object using a custom hook can resurrect it if it is set to an object which is being finalized. Avoid storing object after the custom hook completes to avoid resurrecting objects.

regrtest now detects unraisable exceptions

Once I fixed tests to silence all expected unraisable exceptions, I created bpo-37069 to modify regrtest to install a custom hook. I merged my commit 95f61c8b:

commit 95f61c8b1619e736bd5e29a0da0183234634b6e8
Author: Victor Stinner <vstinner@redhat.com>
Date:   Thu Jun 13 01:09:04 2019 +0200

    bpo-37069: regrtest uses sys.unraisablehook (GH-13759)

    regrtest now uses sys.unraisablehook() to mark a test as "environment
    altered" (ENV_CHANGED) if it emits an "unraisable exception".
    Moreover, regrtest logs a warning in this case.

    Use "python3 -m test --fail-env-changed" to catch unraisable
    exceptions in tests.

A test is marked as "environment altered" (ENV_CHANGED) if the test triggers an unraisable exception. Using --fail-env-changed option (option used by default on all Python CIs), a test is marked as failed in this case.

Hook features

sys.unraisablehook allows to set a custom hook to handle unraisable exceptions. It opens many interesting features:

Log the exception into system logs, over the network, or open a popup.
Inspect the Python stack: traceback.print_stack()
Inspect object content (object which caused the exception)
Get the traceback where object has been allocated: tracemalloc.get_object_traceback()

By the way, reimplementing Thomas's initial idea became trivial:

import signal, sys

def abort_hook(unraisable):
    signal.raise_signal(signal.SIGABRT)

sys.unraisablehook = abort_hook

threading.excepthook

Since I was happy of sys.unraisablehook, I decided to work on the 14-years old issue bpo-1230540: I proposed to add threading.excepthook(), but that's a different story!

asyncio WSASend() memory leak

2019-03-06T20:00:00+01:00

I fixed multiple bugs in asyncio ProactorEventLoop previously. But test_asyncio still failed sometimes. I noticed a memory leak in test_asyncio which will haunt me for 1 year in 2018...

Yet another example of a test failure which looks harmless but hides a critical bug. The bug is that sending a network packet on Windows using asyncio ProactorEventLoop can leak the packet. With such bug, it is easy to imagine a very quick increase of the memory footprint of a network server...

I'm curious why nobody noticed it before me? For me, the only explanation is that nobody was running a server using ProactorEventLoop. Before Python 3.8, SelectorEventLoop was the default asyncio event loop on Windows. bpo-34687: Andrew Svetlov, Yury Selivanov and me agreed to make ProactorEventLoop the default in Python 3.8! Lib/asyncio/windows_events.py change of my commit 6ea29c5e:

-DefaultEventLoopPolicy = WindowsSelectorEventLoopPolicy
+DefaultEventLoopPolicy = WindowsProactorEventLoopPolicy

The bug wasn't a regression. It was only discovered 5 years after the code has been written thanks to new tests.

UPDATE: I updated the article to add the "Regression? Nope" section and elaborate the Conclusion.

Previous article: asyncio: WSARecv() cancellation causing data loss.

Yet another random buildbot failure

One day at the end of January 2018, I noticed a new failure on the AMD64 Windows8.1 Refleaks 3.x" buildbot worker. I reported bpo-32710:

AMD64 Windows8.1 Refleaks 3.x: http://buildbot.python.org/all/#/builders/80/builds/118

test_asyncio leaked [4, 4, 3] memory blocks, sum=11

I reproduced the issue. I'm running test.bisect to try to isolate this bug.

Only 15 minutes later thanks to my test.bisect tool, I identified the leaking test, test_sendfile_close_peer_in_middle_of_receiving():

It seems to be related to sendfile():

C:\vstinner\python\master>python -m test -R 3:3 test_asyncio \
    -m test.test_asyncio.test_events.ProactorEventLoopTests.test_sendfile_close_peer_in_middle_of_receiving
...
test_asyncio leaked [1, 2, 1] memory blocks, sum=4

The test is identified, so it should take a few hours, maximum, to fix the bug, no? We will see...

April

3 months later, I asked:

The test is still leaking memory blocks. Any progress on investigating the issue?

Nobody replied.

At that time, I was busy to fix a bunch of various other bugs reported by buildbots which were easier to fix and I was kind of exhausted by asyncio, I didn't want to touch it.

June

Oh, I found again this bug while working on my PR 7827 (detect handle leaks on Windows in regrtest).

In 2018, I was very busy with fixing dozens of multiprocessing bugs (fix tests but also fix some bugs in multiprocessing).

For example, I noticed another memory leak on AMD64 Windows8.1 Refleaks 3.7, bpo-33735:

http://buildbot.python.org/all/#/builders/132/builds/154

test_multiprocessing_spawn leaked [1, 2, 1] memory blocks, sum=4

This test_multiprocessing_spawn leak and the test_asyncio leak on Windows Refleaks haunted me in 2018...

In fact, it wasn't a real leak. After a few runs, the test stopped to leak:

$ ./python -m test test_multiprocessing_spawn \
    -m test.test_multiprocessing_spawn.WithProcessesTestPool.test_imap_unordered \
    -R 1:30
...
test_multiprocessing_spawn leaked [4, 5, 1, 5, 1, 2, 0, 0, 0, ..., 0, 0, 0] memory blocks, sum=18
test_multiprocessing_spawn failed in 42 sec 470 ms

I fixed the test with commit 23401fb9.

I fixed other multiprocessing bugs like bpo-33929.

These multiprocessing bugs kept me busy.

July-December

Nothing. Nobody looked at the issue.

Again, I was busy fixing various test failures reported by buildbots.

Update in January 2019

In January 2019, after months of hard work on fixing every single buildbot failure, I realized suddenly that the test_asyncio leak, bpo-32710, was one of the last unfixed known test failure! So I decided to have a new look at it.

Update on test_asyncio.test_sendfile.ProactorEventLoopTests:

test_sendfile_close_peer_in_the_middle_of_receiving() leaks 1 reference per run: this leak was the obvious bug bpo-35682, I already fixed it with commit 80fda712.
test_sendfile_fallback_close_peer_in_the_middle_of_receiving() leaks 1 reference per run: I don't understand why.

Note: I had to copy/paste these test names a lot of times. Pleeease, for my comfort, use shorter test names! :-) (I had to copy/paste them, I don't think that a regular human is able to type these very long names!)

I spent a lot of time to investigate test_sendfile_fallback_close_peer_in_the_middle_of_receiving() leak and I don't understand the issue.

The main loop is BaseEventLoop._sendfile_fallback(). For the specific case of this test, the loop can be simplified to:

proto = _SendfileFallbackProtocol(transp)
try:
    while True:
        data = b'x' * (1024 * 64)
        await proto.drain()
        transp.write(data)
finally:
    await proto.restore()

The server closes the connection after it gets 1024 bytes. The client socket gets a ConnectionAbortedError exception in _ProactorBaseWritePipeTransport._loop_writing() which calls _fatal_error():

except OSError as exc:
    self._fatal_error(exc, 'Fatal write error on pipe transport')

_fatal_error() calls _force_close() which sets _closing to True, and calls protocol.connection_lost(). In the meanwhile, drain() raises ConnectionError because is_closing() is true:

async def drain(self):
    if self._transport.is_closing():
        raise ConnectionError("Connection closed by peer")
    ...

Said differently: everything works as expected.

Regression caused by my previous proactor fix?

I suspected my own commit 79790bc3 pushed 7 months ago to fix a race condition in WSARecv() causing data loss (that's my previous article: asyncio: WSARecv() cancellation causing data loss).

Hint: nah, it's unrelated. Moreover, this change has been pushed in May, whereas I reported bpo-32710 leak in January.

Short script reproducing the leak

Identifying a leak of a single reference is really hard since the test uses hundreds of Python objects! My blocker issue was to repeat the test enough times to trigger the leak N times rather than getting a leak of exactly a single Python reference. The problem was that the test failed when ran more than once.

All my previous attempts to identify the bug failed:

Use gc.get_referrers() to track references between Python objects.
Use tracemalloc to track memory usage: the leak is too small, it's lost in the results "noise".

I decided to do what I should have done first: remove as much code as possible to reduce the code that I have to audit. I removed most Python imports, I inlined manually function calls, I removed a lot of code which was unused in the test, etc.

After a few hours, I managed to reduce the giant pile of code used by the test into a very short script of only 159 lines of Python code: test_aiosend.py. The script doesn't call the asyncio sendfile() implementation, but uses its own copy of the code, simplified to do exactly what the test needs:

async def sendfile(transp):
    proto = _SendfileFallbackProtocol(transp)
    try:
        data = b'x' * (1024 * 24)
        while True:
            await proto.drain()
            transp.write(data)
    finally:
        await proto.restore()

with a local copy of the code of _SendfileFallbackProtocol class.

Having all code involved in the bug in a single file is way more efficient to follow the control flow and understands what happens.

The original code is waaaaay more complex, scattered across multiple Python files in Lib/asyncio and Lib/test/test_asyncio/ directories.

Root bug identified: WSASend()

It took me 1 year, a few sleepless nights, multiple attempts to understand the leak, but I eventually found it! WSASend() doesn't release the memory if it fails immediately. I expected something way more complex, but it's that simple...

Using the test_aiosend.py script that I created, I was finally able to repeat the test in a loop. Thanks to that, it became obvious using tracemalloc that the leaked memory was the memory passed to WSASend().

I pushed commit a234e148 to fix WSASend():

commit a234e148394c2c7419372ab65b773d53a57f3625
Author: Victor Stinner <vstinner@redhat.com>
Date:   Tue Jan 8 14:23:09 2019 +0100

    bpo-32710: Fix leak in Overlapped_WSASend() (GH-11469)

    Fix a memory leak in asyncio in the ProactorEventLoop when ReadFile()
    or WSASend() overlapped operation fail immediately: release the
    internal buffer.

I was very disappointed by the simplicity of the fix, it only adds a single line:

diff --git a/Modules/overlapped.c b/Modules/overlapped.c
index 69875a7f37da..bbaa4fb3008f 100644
--- a/Modules/overlapped.c
+++ b/Modules/overlapped.c
@@ -1011,6 +1012,7 @@ Overlapped_WSASend(OverlappedObject *self, PyObject *args)
         case ERROR_IO_PENDING:
             Py_RETURN_NONE;
         default:
+            PyBuffer_Release(&self->user_buffer);
             self->type = TYPE_NOT_STARTED;
             return SetFromWindowsErr(err);
     }

So what? One year to add a single line? That's unfair!

My commit contains a very similar fix for do_ReadFile() used by Overlapped_ReadFile() and Overlapped_ReadFileInto().

Fixing more memory leaks

By the way, the _overlapped.Overlapped type has no traverse function: it may help the garbage collector to add one. Asyncio is famous for building reference cycles by design in Future.set_exception().

I wrote PR 11489 to implement tp_traverse for the _overlapped.Overlapped type. Serhiy Storchaka added:

I suspect that there are leaks when self->type is set to TYPE_NOT_STARTED.

And he was right! I modified my PR to fix all memory leaks. After my PR has been reviewed, I merged it, commit 5485085b:

commit 5485085b324a45307c1ff4ec7d85b5998d7d5e0d
Author: Victor Stinner <vstinner@redhat.com>
Date:   Fri Jan 11 14:35:14 2019 +0100

    bpo-32710: Fix _overlapped.Overlapped memory leaks (GH-11489)

    Fix memory leaks in asyncio ProactorEventLoop on overlapped operation
    failures.

    Changes:

    * Implement the tp_traverse slot in the _overlapped.Overlapped type
      to help to break reference cycles and identify referrers in the
      garbage collector.
    * Always clear overlapped on failure: not only set type to
      TYPE_NOT_STARTED, but release also resources.

Regression? Nope

Was the memory leak a regression? Nope. The bug existed since the creation of the overlapped.c file in the "Tulip" project in 2013, commit 27c40353:

commit 27c403531670f52cad8388aaa2a13a658f753fd5
Author: Richard Oudkerk <shibturn@gmail.com>
Date:   Mon Jan 21 20:34:38 2013 +0000

    New experimental iocp branch.

Tulip was the old name of the asyncio project, when it was still an external project on code.google.com. In the meanwhile, code.google.com has been closed and the project moved to https://github.com/python/asyncio/ (now read-only).

Extract of the original Overlapped_WSASend() implementation, I added a comment to show the location of the bug:

if (!PyArg_Parse(bufobj, "y*", &self->write_buffer))
    return NULL;

#if SIZEOF_SIZE_T > SIZEOF_LONG
if (self->write_buffer.len > (Py_ssize_t)PY_ULONG_MAX) {
    PyBuffer_Release(&self->write_buffer);
    PyErr_SetString(PyExc_ValueError, "buffer to large");
    return NULL;
}
#endif
...
self->error = err = (ret < 0 ? WSAGetLastError() : ERROR_SUCCESS);
switch (err) {
    case ERROR_SUCCESS:
    case ERROR_MORE_DATA:
    case ERROR_IO_PENDING:
        /********* !!! BUG HERE, BUFFER NOT RELEASED !!! ***********/
        Py_RETURN_NONE;
    ...
}

I fixed the memory leak 6 years after the code has been written!

So... why was this bug only discovered in 2018? Multiple very asyncio old bugs were discovered only recently thanks to more realistic and more advanced functional tests. First tests of asyncio were mostly tiny unit tests mocking most part of the code. It made sense in the early days of asyncio, when the code was not mature.

By the way, the code of the test which helped to discovered the bug is:

def test_sendfile_close_peer_in_the_middle_of_receiving(self):
    srv_proto, cli_proto = self.prepare_sendfile(close_after=1024)
    with self.assertRaises(ConnectionError):
        self.run_loop(
            self.loop.sendfile(cli_proto.transport, self.file))
    self.run_loop(srv_proto.done)

    self.assertTrue(1024 <= srv_proto.nbytes < len(self.DATA),
                    srv_proto.nbytes)
    self.assertTrue(1024 <= self.file.tell() < len(self.DATA),
                    self.file.tell())
    self.assertTrue(cli_proto.transport.is_closing())

Note: The test name has been made even longer in the meanwhile (add "the") :-)

Conclusion

For such complex bugs, a reliable debugging method is to remove as much code as possible to reduce the number of lines of code that should be read. tracemalloc remains efficient to identify a memory leak when a test can be run in a loop to make the leak more obvious (I was blocked at the beginning because the test failed when run a second time in a loop).

Lessons learned? You should try to investigate every single failure of your CI. It is important to have a test suite with functional tests. "Mock tests" are fine to quickly write reliable tests, but there are not enough: functional tests make the difference.

Thanks Richard Oudkerk for your great code to use Windows native APIs in asyncio and multiprocessing! I like Windows IOCP, even if the asyncio implementation is quite complex :-)

Ok, _overlapped.Overlapped should now have a few less memory leaks :-)

asyncio: WSARecv() cancellation causing data loss

2019-01-31T15:20:00+01:00

In December 2017, Yury Selivanov pushed the long awaited start_tls() function.

A newly added test failed on Windows. Later, the test started to fail randomly on Linux as well. In fact, it was a well hidden race condition in the asynchronous handshake of SSLProtocol which will take 5 months of work to be identified and fixed. The bug wasn't a recent regression, but only spotted thanks to newly added tests.

Even after this bug has been fixed, the same test still failed randomly on Windows! Once I found how to reproduce the bug, I understood that it's a very scary bug: WSARecv() cancellation randomly caused data loss! Again, it was a very well hidden bug which likely existing since the early days of the ProactorEventLoop implementation.

Previous article: Asyncio: Proactor ConnectPipe() Race Condition. Next article: asyncio: WSASend() memory leak.

New start_tls() function

The "starttls" feature have been requested since creation of asyncio. At October 24, 2013, Guido van Rossum created asyncio issue #79:

Glyph [Lefkowitz] and Antoine [Pitrou] really want a API to upgrade an existing Transport/Protocol pair to SSL/TLS, without having to create a new protocol.

At March 23, 2015, Giovanni Cannata created bpo-23749 which is basically the same feature request. I replied:

asyncio got a new SSL implementation which makes possible to implement STARTTLS. Are you interested to implement it?

Elizabeth Myers, Antoine Pitrou, Guido van Rossum and Yury Selivanov designed the feature. Yury wrote a prototype in 2015 for PostgreSQL. In 2017, Barry Warsaw wrote his own implementation for SMTP.

At the end of 2017, four year after Guido van Rossum created the feature request, Yury Selivanov implemented the feature and pushed the commit f111b3dc:

commit f111b3dcb414093a4efb9d74b69925e535ddc470
Author: Yury Selivanov <yury@magic.io>
Date:   Sat Dec 30 00:35:36 2017 -0500

    bpo-23749: Implement loop.start_tls() (#5039)

SSLProtocol Race Condition

Test fails on AppVeyor (Windows): temporary fix

At December 30, 2017, just after Yury pushed his implementation of start_tls() (the same day), Antoine Pitrou reported bpo-32458: it seems test_asyncio fails sporadically on AppVeyor:

ERROR: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLS)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\cpython\lib\test\test_asyncio\test_sslproto.py", line 284, in test_start_tls_server_1
    asyncio.wait_for(main(), loop=self.loop, timeout=10))
  File "C:\projects\cpython\lib\asyncio\base_events.py", line 440, in run_until_complete
    return future.result()
  File "C:\projects\cpython\lib\asyncio\tasks.py", line 398, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

Yury Selivanov wrote:

I'm leaving on a two-weeks vacation today. To avoid risking breaking the workflow, I'll mask this tests on AppVeyor. I'll investigate this when I get back.

and skipped the test as a temporary fix, commit 0c36bed1:

commit 0c36bed1c46d07ef91d3e02e69e974e4f3ecd31a
Author: Yury Selivanov <yury@magic.io>
Date:   Sat Dec 30 15:40:20 2017 -0500

    bpo-32458: Temporarily mask start-tls proactor test on Windows (#5054)

Bug reproduced on Linux

At May 23, 2018, five month after the bug have been reported, I wrote:

test_start_tls_server_1() just failed on my Linux. It likely depends on the system load.

Christian Heimes added:

[On Linux,] It's failing reproducible with OpenSSL 1.1.1 and TLS 1.3 enabled. I haven't seen it failing with TLS 1.2 yet.

At May 28, 2018, I found a reliable way to reproduce the issue on Linux:

Open 3 terminals and run these commands in parallel:

./python -m test test_asyncio -m test_start_tls_server_1 -F

./python -m test -j16 -r

./python -m test -j16 -r

It's a race condition which doesn't depend on the OS, but on the system load.

Root issue identified

Once I found how to reproduce the bug, I was able to investigate it. I created bpo-33674.

I found a race condition in SSLProtocol of asyncio/sslproto.py. Sometimes, _sslpipe.feed_ssldata() is called before _sslpipe.shutdown().

SSLProtocol.connection_made() -> SSLProtocol._start_handshake(): self._loop.call_soon(self._process_write_backlog)
SSLProtoco.data_received(): direct call to self._sslpipe.feed_ssldata(data)
Later, self._process_write_backlog() calls self._sslpipe.do_handshake()

The first write is delayed by call_soon(), whereas the first read is a direct call to the SSL pipe.

Workaround:

diff --git a/Lib/asyncio/sslproto.py b/Lib/asyncio/sslproto.py
index 2bfa45dd15..4a5dbb38a1 100644
--- a/Lib/asyncio/sslproto.py
+++ b/Lib/asyncio/sslproto.py
@@ -592,7 +592,7 @@ class SSLProtocol(protocols.Protocol):
         # (b'', 1) is a special value in _process_write_backlog() to do
         # the SSL handshake
         self._write_backlog.append((b'', 1))
-        self._loop.call_soon(self._process_write_backlog)
+        self._process_write_backlog()
         self._handshake_timeout_handle = \
             self._loop.call_later(self._ssl_handshake_timeout,
                                   self._check_handshake_timeout)

Yury Selivanov wrote:

The fix is correct and the bug is now obvious: data_received() occurs pretty much any time after connection_made() call; if call_soon() is used in connection_made(), data_received() may find the protocol in an incorrect state.

Kudos Victor for debugging this.

I pushed commit be00a558:

commit be00a5583a2cb696335c527b921d1868266a42c6
Author: Victor Stinner <vstinner@redhat.com>
Date:   Tue May 29 01:33:35 2018 +0200

    bpo-33674: asyncio: Fix SSLProtocol race (GH-7175)

    Fix a race condition in SSLProtocol.connection_made() of
    asyncio.sslproto: start immediately the handshake instead of using
    call_soon(). Previously, data_received() could be called before the
    handshake started, causing the handshake to hang or fail.

... the change is basically a single line change:

- self._loop.call_soon(self._process_write_backlog)
+ self._process_write_backlog()

I closed bpo-32458 and Yury Selivanov closed bpo-33674.

Not a regression

The SSLProtocol race condition wasn't new: it existed since January 2015, commit 231b404c:

commit 231b404cb026649d4b7172e75ac394ef558efe60
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Wed Jan 14 00:19:09 2015 +0100

    Issue #22560: New SSL implementation based on ssl.MemoryBIO

    The new SSL implementation is based on the new ssl.MemoryBIO which is only
    available on Python 3.5. On Python 3.4 and older, the legacy SSL implementation
    (using SSL_write, SSL_read, etc.) is used. The proactor event loop only
    supports the new implementation.

    The new asyncio.sslproto module adds _SSLPipe, SSLProtocol and
    _SSLProtocolTransport classes. _SSLPipe allows to "wrap" or "unwrap" a socket
    (switch between cleartext and SSL/TLS).

    Patch written by Antoine Pitrou. sslproto.py is based on gruvi/ssl.py of the
    gruvi project written by Geert Jansen.

    This change adds SSL support to ProactorEventLoop on Python 3.5 and newer!

    It becomes also possible to implement STARTTTLS: switch a cleartext socket to
    SSL.

This is the new cool asynchronous SSL implementation written by Antoine Pitrou and Geert Jansen. It took 3 years and new functional tests to discover the race condition.

WSARecv() cancellation causing data loss

Yet another very boring buildbot test failure

At May 30, 2018, the day after I fixed SSLProtocol race condition, I created bpo-33694.

test_asyncio.test_start_tls_server_1() got multiple fixes recently (see bpo-32458 and bpo-33674)... but it still fails on Python on x86 Windows7 3.x at revision bb9474f1fb2fc7c7ed9f826b78262d6a12b5f9e8 which contains all these fixes.

The test fails even when test_asyncio is re-run alone (not when other tests run in parallel).

Example of failure:

ERROR: test_start_tls_server_1 (test.test_asyncio.test_sslproto.ProactorStartTLSTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "...\lib\test\test_asyncio\test_sslproto.py", line 467, in test_start_tls_server_1
    self.loop.run_until_complete(run_main())
  File "...\lib\asyncio\base_events.py", line 566, in run_until_complete
    raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed.

The test fails also on x86 Windows7 3.7. Moreover, 3.7 got an additional failure:

ERROR: test_pipe_handle (test.test_asyncio.test_windows_utils.PipeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "...\lib\test\test_asyncio\test_windows_utils.py", line 73, in test_pipe_handle
    raise RuntimeError('expected ERROR_INVALID_HANDLE')
RuntimeError: expected ERROR_INVALID_HANDLE

Unable to reproduce the bug

Yury Selivanov failed to reproduce the issue in Windows 7 VM (on macOS) using:

run test_asyncio
run test_asyncio.test_sslproto
run test_asyncio.test_sslproto -m test_start_tls_server_1

Andrew Svetlov added:

I used SNDBUF to enforce send buffer overloading. It is not required by sendfile tests but I thought that better to have non-mocked way to test such situations. We can remove the socket buffers size manipulation at all without any problem.

But Yury Selivanov replied:

When I tried to do that I think I was having more failures with that test. But really up to you.

Next days, I reported more and more similar failures on Windows buildbots and AppVeyor (our Windows CI).

Root issue identified: pause_reading()

Since this bug became more and more frequent, I decided to work on it. Yury and Andrew failed to reproduce it.

At June 7, 2018, I managed to reproduce the bug on Linux by inserting a sleep at the right place... I understood one hour later that my patch is wrong: "it introduces a bug in the test".

On the other hand, I found the root cause: calling pause_reading() and resume_reading() on the transport is not safe. Sometimes, we loose data. See the ugly hack described in the TODO comment below:

class _ProactorReadPipeTransport(_ProactorBasePipeTransport,
                                 transports.ReadTransport):
    """Transport for read pipes."""
    (...)
    def pause_reading(self):
        if self._closing or self._paused:
            return
        self._paused = True

        if self._read_fut is not None and not self._read_fut.done():
            # TODO: This is an ugly hack to cancel the current read future
            # *and* avoid potential race conditions, as read cancellation
            # goes through `future.cancel()` and `loop.call_soon()`.
            # We then use this special attribute in the reader callback to
            # exit *immediately* without doing any cleanup/rescheduling.
            self._read_fut.__asyncio_cancelled_on_pause__ = True

            self._read_fut.cancel()
            self._read_fut = None
            self._reschedule_on_resume = True

        if self._loop.get_debug():
            logger.debug("%r pauses reading", self)

If you remove the "ugly hack", the test no longer hangs...

Extract of _ProactorReadPipeTransport.set_transport():

if self.is_reading():
    # reset reading callback / buffers / self._read_fut
    self.pause_reading()
    self.resume_reading()

This method cancels the pending overlapped WSARecv(), and then creates a new overlapped WSARecv().

Even after CancelIoEx(old overlapped), the IOCP loop still gets an event for the completion of the cancelled overlapped WSARecv(). Problem: since the Python future is cancelled, the event is ignored and so 176 bytes of data are lost.

I'm surprised that an overlapped WSARecv() cancelled by CancelIoEx() still returns data when IOCP polls for events.

Something else. The bug occurs when CancelIoEx() (on the current overlapped WSARecv()) fails internally with ERROR_NOT_FOUND. According to overlapped.c, it means:

/* CancelIoEx returns ERROR_NOT_FOUND if the I/O completed in-between */

HasOverlappedIoCompleted() returns 0 in that case.

The problem is that currently, Overlapped.cancel() also returns None in that case, and later the asyncio IOCP loop ignores the completion event and so drops incoming received data.

Release blocker bug?

Yury, Andrew, Ned: I set the priority to release blocker because I'm scared by what I saw. The START TLS has a race condition in its ProactorEventLoop implementation. But the bug doesn't see to be specific to START TLS, but rather to transport.set_transport(), and even more generally to transport.pause_reading() / transport.resume_reading(). The bug is quite severe: we loose data and it's really hard to know why (I spent a few hours to add many many print and try to reproduce on a very tiny reliable unit test). As an asyncio user, I expect that transports are 100% reliable, and I would first look into my code (like looking into start_tls() implementation in my case).

If the bug was very specific to start_tls(), I would suggest to "just" "disable" start_tls() on ProactorEventLoop (sorry, Windows!). But since the data loss seems to concern basically any application using ProactorEventLoop, I don't see any simple workaround.

My hope is that a fix can be written shortly to not block the 3.7.0 final release for too long :-(

Yury, Andrew: Can you please just confirm that it's a regression and that a release blocker is justified?

Functional test reproducing the bug

I wrote race.py script: simple echo client and server sending packets in both directions. Pause/resume reading the client transport every 100 ms to trigger the bug.

Using ProactorEventLoop and 2000 packets of 16 KiB, I easily reproduce the bug.

So again, it's nothing related to start_tls(), start_tls() was just one way to spot the bug.

The bug is in Proactor transport: the cancellation of overlapped WSARecv() sometime drops packets. The bug occurs when CancelIoEx() fails with ERROR_NOT_FOUND which means that the I/O (WSARecv()) completed.

One solution would be to not cancel WSARecv() on pause_reading(): wait until the current WSARecv() completes, store data somewhere but don't pass it to protocol.data_received(), and don't schedule a new WSARecv(). Once reading is resumed: call protocol.data_received() and schedule a new WSARecv().

That would be a workaround. I don't know how to really fix WSARecv() cancellation without loosing data. A good start would be to modify Overlapped.cancel() to return a boolean to notice if the overlapped I/O completed even if we just cancelled it. Currently, the corner case (CancelIoEx() fails with ERROR_NOT_FOUND) is silently ignored, and then the IOCP loop silently ignores the event of completed I/O...

Fix the bug: no longer cancel WSARecv()

At June 8, 2018, I pushed commit 79790bc3:

commit 79790bc35fe722a49977b52647f9b5fe1deda2b7
Author: Victor Stinner <vstinner@redhat.com>
Date:   Fri Jun 8 00:25:52 2018 +0200

    bpo-33694: Fix race condition in asyncio proactor (GH-7498)

    The cancellation of an overlapped WSARecv() has a race condition
    which causes data loss because of the current implementation of
    proactor in asyncio.

    No longer cancel overlapped WSARecv() in _ProactorReadPipeTransport
    to work around the race condition.

    Remove the optimized recv_into() implementation to get simple
    implementation of pause_reading() using the single _pending_data
    attribute.

    Move _feed_data_to_bufferred_proto() to protocols.py.

    Remove set_protocol() method which became useless.

I fixed the root issue (in Python 3.7 and future Python 3.8).

I used my race.py script to validate that the issue is fixed for real.

Conclusion

I fixed one race condition in the asynchronous handshake of SSLProtocol.

I found and fixed a data loss bug caused by WSARecv() cancellation.

Lessons learnt from these two bugs:

You should write an extensive test suite for your code.
You should keep an eye on your continuous integration (CI): any tiny test failure can hide a very severe bug.

Asyncio: Proactor ConnectPipe() Race Condition

2019-01-30T18:00:00+01:00

Between December 2014 and January 2015, once I succeeded to fix the root issue of the random asyncio crashes on Windows (Proactor Cancellation From Hell), I fixed more race conditions and bugs in ProactorEventLoop:

ConnectPipe() Race Condition
Race Condition in BaseSubprocessTransport._try_finish()
Close the transport on failure: ResourceWarning
Cleanup code handling pipes

Previous article: Proactor Cancellation From Hell. Next article: asyncio: WSARecv() cancellation causing data loss.

ConnectPipe() Race Condition

Once I succeeded to fix the root issue of the random asyncio crashes on Windows (Proactor Cancellation From Hell), I started to look at the ConnectPipe special case: asyncio issue #204: Investigate IocpProactor.accept_pipe() special case (don't register overlapped) (issue created at 25 Aug 2014).

At January 21, 2015, I opened bpo-23293: race condition related to IocpProactor.connect_pipe().

While fixing bpo-23095 (race condition when cancelling a _WaitHandleFuture), I saw that IocpProactor.connect_pipe() causes "GetQueuedCompletionStatus() returned an unexpected event" messages to be logged, but also to hang the test suite.

IocpProactor._register() contains the comment:

# Even if GetOverlappedResult() was called, we have to wait for the
# notification of the completion in GetQueuedCompletionStatus().
# Register the overlapped operation to keep a reference to the
# OVERLAPPED object, otherwise the memory is freed and Windows may
# read uninitialized memory.
#
# For an unknown reason, ConnectNamedPipe() behaves differently:
# the completion is not notified by GetOverlappedResult() if we
# already called GetOverlappedResult(). For this specific case, we
# don't expect notification (register is set to False).

IocpProactor.close() contains this comment:

# The operation was started with connect_pipe() which
# queues a task to Windows' thread pool.  This cannot
# be cancelled, so just forget it.

IocpProactor.connect_pipe() is implemented with QueueUserWorkItem() which starts a thread that cannot be interrupted. Because of that, this function requires special cases in _register() and close() methods of IocpProactor.

I proposed a solution to reimplement IocpProactor.connect_pipe() without a thread: asyncio issue #197: Rewrite IocpProactor.connect_pipe() with non-blocking calls to avoid non interruptible QueueUserWorkItem().

At January 22, 2015, I pushed commit 7ffa2c5f:

commit 7ffa2c5fdda8a9cc254edf67c4458b15db1252fa
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Thu Jan 22 22:55:08 2015 +0100

    Issue #23293, asyncio: Rewrite IocpProactor.connect_pipe()

The change adds _overlapped.ConnectPipe() which tries to connect to the pipe for asynchronous I/O (overlapped): call CreateFile() in a loop until it doesn't fail with ERROR_PIPE_BUSY. Use an increasing delay between 1 ms and 100 ms.

Race Condition in BaseSubprocessTransport._try_finish()

If the process exited before the _post_init() method was called, scheduling the call to _call_connection_lost() with call_soon() is wrong: connection_made() must be called before connection_lost().

Reuse the BaseSubprocessTransport._call() method to schedule the call to _call_connection_lost() to ensure that connection_made() and connection_lost() are called in the correct order.

At Dec 18, 2014, I pushed commit 1b9763d0. The explanation is long, but the change is basically a single line change, extract:

- self._loop.call_soon(self._call_connection_lost, None)
+ self._call(self._call_connection_lost, None)

Ordering properly callbacks in asyncio is challenging! The order matters for the semantics of asyncio: it is part of the design of the PEP 3156 -- Asynchronous IO Support Rebooted: the "asyncio" Module.

Close the transport on failure: ResourceWarning

At January 15, 2015, I pushed commit 4bf22e03, extract:

-  yield from transp._post_init()
+  try:
+      yield from transp._post_init()
+  except:
+      transp.close()
+      raise

Later, I will spend a lot of time (push many more changes) to ensure that resources are properly released (especially close transports on failure, similar to this change).

I will add many ResourceWarnings warnings in destructors when a transport, subprocess or event loop is not closed explicitly.

For example, notice the ResourceWarnings in the current destructor of _SelectorTransport:

class _SelectorTransport(transports._FlowControlMixin,
                         transports.Transport):

    def __del__(self, _warn=warnings.warn):
        if self._sock is not None:
            _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
            self._sock.close()

I even enhanced Python 3.6 to be able to provide the traceback where the leaked resource has been allocated thanks to my tracemalloc module. Example with filebug.py:

def func():
    f = open(__file__)
    f = None

func()

Output with Python 3.6:

$ python3 -Wd -X tracemalloc=5 filebug.py
filebug.py:3: ResourceWarning: unclosed file <_io.TextIOWrapper name='filebug.py' mode='r' encoding='UTF-8'>
  f = None
Object allocated at (most recent call first):
  File "filebug.py", lineno 2
    f = open(__file__)
  File "filebug.py", lineno 5
    func()

The line where the warning is emitted is usually useless to understand the bug, whereas the traceback is very useful to identify the leaked resource.

See my ResourceWarning documentation.

Cleanup code handling pipes

Thanks to the new implementation of connect_pipe(), I was able to push changes to simplify the code and remove various hacks in code handling pipes.

commit 2b77c546:

commit 2b77c5467f376257ae22cbfbcb3a0e5e6349e92d
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Thu Jan 22 23:50:03 2015 +0100

    asyncio, Tulip issue 204: Fix IocpProactor.accept_pipe()

    Overlapped.ConnectNamedPipe() now returns a boolean: True if the pipe is
    connected (if ConnectNamedPipe() failed with ERROR_PIPE_CONNECTED), False if
    the connection is in progress.

    This change removes multiple hacks in IocpProactor.

commit 3d2256f6:

commit 3d2256f671b7ed5c769dd34b27ae597cbc69047c
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Mon Jan 26 11:02:59 2015 +0100

    Issue #23293, asyncio: Cleanup IocpProactor.close()

    The special case for connect_pipe() is not more needed. connect_pipe() doesn't
    use overlapped operations anymore.

commit a19b7b3f:

commit a19b7b3fcafe52b98245e14466ffc4d6750ca4f1
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Mon Jan 26 15:03:20 2015 +0100

    asyncio: Fix ProactorEventLoop.start_serving_pipe()

    If a client connected before the server was closed: drop the client (close the
    pipe) and exit.

commit e0fd157b:

commit e0fd157ba0cc92e435e7520b4ff641ca68d72244
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Mon Jan 26 15:04:03 2015 +0100

    Issue #23293, asyncio: Rewrite IocpProactor.connect_pipe() as a coroutine

    Use a coroutine with asyncio.sleep() instead of call_later() to ensure that the
    schedule call is cancelled.

    Add also a unit test cancelling connect_pipe().

commit 41063d2a:

commit 41063d2a59a24e257cd9ce62137e36c862e3ab1e
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Mon Jan 26 22:30:49 2015 +0100

    asyncio, Tulip issue 204: Fix IocpProactor.recv()

    If ReadFile() fails with ERROR_BROKEN_PIPE, the operation is not pending: don't
    register the overlapped.

    I don't know if WSARecv() can fail with ERROR_BROKEN_PIPE. Since
    Overlapped.WSARecv() already handled ERROR_BROKEN_PIPE, let me guess that it
    has the same behaviour than ReadFile().

Asyncio: Proactor Cancellation From Hell

2019-01-28T20:20:00+01:00

Between 2014 and 2015, I was working on the new shiny asyncio module (module added to Python 3.4 released in March 2014). I helped to stabilize the Windows implementation because... well, nobody else was paying attention to it, and I was worried that test_asyncio randomly crashed on Windows.

One bug really annoyed me, I started to fix it in July 2014, but I only succeeded to fix the root issue in January 2015: six months later!

It was really difficult to find documentation on IOCP and asynchronous programming on Windows. I had to ask for help to someone who had access to the Windows source code to understand the bug...

Spoiler: cancelling an overlapped RegisterWaitForSingleObject() with UnregisterWait() is asynchronous. The asynchronous part is not well documented and it took me months of debug to understand it. Moreover, the bug was well hidden for various reasons that we will see below.

Next article: Asyncio: Proactor ConnectPipe() Race Condition.

Fix cancel() when called twice

July 2014, asyncio issue #195: while working on a SIGINT signal handler for the ProactorEventLoop on Windows (asyncio issue #191), I hit a bug on Windows: _WaitHandleFuture.cancel() crash if the wait event was already unregistered by finish_wait_for_handle(). The bug was that UnregisterWait() was called twice.

I pushed commit fea6a100 to fix this crash:

commit fea6a100dc51012cb0187374ad31de330ebc0035
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Fri Jul 25 00:54:53 2014 +0200

    Improve stability of the proactor event loop, especially operations on
    overlapped objects (...)

Main changes:

Fix a crash: don't call UnregisterWait() twice if a _WaitHandleFuture is cancelled twice.
Fix another crash: _OverlappedFuture.cancel() doesn't cancel the overlapped anymore if it is already cancelled or completed. Log also an error if the cancellation failed.
IocpProactor.close() now cancels futures rather than cancelling directly underlaying overlapped objects.
Add a destructor to the IocpProactor class which closes it

Clear reference from _OverlappedFuture to overlapped

July 2014, I created asyncio issue #196: _OverlappedFuture.set_result() should clear the its reference to the overlapped object.

It is important to explicitly clear references to Python objects as soon as possible to release resources. Otherwise, an object can remain alive longer than expected.

I noticed that _OverlappedFuture kept a reference to the undelying overlapped object even after the asynchronous operation completed. I started to work on a fix but I had many issues to fix completely this bug... it is just the beginning of a long journey.

Clear the reference on cancellation and error

I pushed a first fix: commit 18a28dc5 clears the reference to the overlapped in cancel() and set_exception() methods of _OverlappedFuture:

commit 18a28dc5c28ae9a953f537486780159ddb768702
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Fri Jul 25 13:05:20 2014 +0200

    * _OverlappedFuture.cancel() now clears its reference to the overlapped object.
      Make also the _OverlappedFuture.ov attribute private.
    * _OverlappedFuture.set_exception() now cancels the overlapped operation.
    * (...)

I started by this change because it didn't make the tests less stable.

Clear the reference in poll()

Clearing the reference to the overlapped in cancel() and set_exception() works well. But when I try to do the same on success (in set_result()), I get random errors. Example:

C:\haypo\tulip>\python33\python.exe runtests.py test_pipe
...
Exception RuntimeError: '<_overlapped.Overlapped object at 0x00000000035E7660> s
till has pending operation at deallocation, the process may crash' ignored
...
Fatal read error on pipe transport
protocol: <asyncio.streams.StreamReaderProtocol object at 0x00000000035EE668>
transport: <_ProactorDuplexPipeTransport fd=348>
Traceback (most recent call last):
  File "C:\haypo\tulip\asyncio\proactor_events.py", line 159, in _loop_reading
    data = fut.result()  # deliver data later in "finally" clause
  File "C:\haypo\tulip\asyncio\futures.py", line 271, in result
    raise self._exception
  File "C:\haypo\tulip\asyncio\windows_events.py", line 488, in _poll
    value = callback(transferred, key, ov)
  File "C:\haypo\tulip\asyncio\windows_events.py", line 279, in finish_recv
    return ov.getresult()
OSError: [WinError 996] Overlapped I/O event is not in a signaled state
...

It seems that the problem only occurs in the fast-path of IocpProactor._register(), when the overlapped is not added to _cache.

Clearing the reference in _poll(), when GetQueuedCompletionStatus() read the status, works! I pushed a second fix, commit 65dd69a3 changes _poll():

commit 65dd69a3da16257bd86b92900e5ec5a8dd26f1d9
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Fri Jul 25 22:36:05 2014 +0200

    IocpProactor._poll() clears the reference to the overlapped operation
    when the operation is done. (...)

Ignore false alarms

I tried to add the overlapped into _cache but then the event loop started to hang or to fail with new errors.

I analyzed an overlapped WSARecv() which has been cancelled. Just after calling CancelIoEx(), HasOverlappedIoCompleted() returns 0.

Even after GetQueuedCompletionStatus() read the status, HasOverlappedIoCompleted() still returns 0.

After hours of debug, I eventually found the main issue!

Sometimes GetQueuedCompletionStatus() returns an overlapped operation which has not completed yet. I modified IocpProactor._poll() to ignore the false alarm, commit 51e44ea6:

commit 51e44ea66aefb4229e506263acf40d35596d279c
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Sat Jul 26 00:58:34 2014 +0200

    _OverlappedFuture.set_result() now clears its reference to the
    overlapped object.

    IocpProactor._poll() now also ignores false alarms:
    GetQueuedCompletionStatus() returns the overlapped but it is still
    pending.

The fix adds this comment:

# FIXME: why do we get false alarms?

Keep a reference of overlapped

To stabilize the code, I modified ProactorIocp to keep a reference to the overlapped object (it already kept a reference previously but not in all cases). Otherwise the memory may be reused and GetQueuedCompletionStatus() may use random bytes and behaves badly. I pushed commit 42d3bdee:

commit 42d3bdeed6e34117b787d61a471563a0dba6a894
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Mon Jul 28 00:18:43 2014 +0200

    ProactorIocp._register() now registers the overlapped
    in the _cache dictionary, even if we already got the result. We need to keep a
    reference to the overlapped object, otherwise the memory may be reused and
    GetQueuedCompletionStatus() may use random bytes and behaves badly.

    There is still a hack for ConnectNamedPipe(): the overlapped object is not
    registered into _cache if the overlapped object completed directly.

    Log also an error in debug mode in ProactorIocp._loop() if we get an unexpected
    event.

    Add a protection in ProactorIocp.close() to avoid blocking, even if it should
    not happen. I still don't understand exactly why some the completion of some
    overlapped objects are not notified.

The change adds a long comment:

# Even if GetOverlappedResult() was called, we have to wait for the
# notification of the completion in GetQueuedCompletionStatus().
# Register the overlapped operation to keep a reference to the
# OVERLAPPED object, otherwise the memory is freed and Windows may
# read uninitialized memory.
#
# For an unknown reason, ConnectNamedPipe() behaves differently:
# the completion is not notified by GetOverlappedResult() if we
# already called GetOverlappedResult(). For this specific case, we
# don't expect notification (register is set to False).

I pushed another change to attempt to stabilize the code, commit 313a9809:

commit 313a9809043ed2ed1ad25282af7169e08cdc92a3
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Tue Jul 29 12:58:23 2014 +0200

    * _WaitHandleFuture.cancel() now notify IocpProactor through the overlapped
      object that the wait was cancelled.
    * Optimize IocpProactor.wait_for_handle() gets the result if the wait is
      signaled immediatly.
    (...)

asyncio issue #196 closed

The initial issue "_OverlappedFuture.set_result() should clear its reference to the overlapped object" has been fixed, so I closed this issue. I didn't know at this point that all bugs were not fixed yet...

I also opened the new asyncio issue #204 to investigate accept_pipe() special case. We will analyze this funny bug in another article.

bpo-23095: race condition when cancelling a _WaitHandleFuture

At December 21, 2014, five months after a long serie of changes to stabilize asyncio... asyncio was still crashing randomly on Windows! I created bpo-23095: race condition when cancelling a _WaitHandleFuture.

On Windows using the IOCP (proactor) event loop, I noticed race conditions when running the test suite of Trollius (my old deprecated asyncio port to Python 2). For example, sometimes the return code of a process was None, whereas this case must never happen. It looks like the wait_for_handle() method doesn't behave properly.

When I run the test suite of asyncio in debug mode (PYTHONASYNCIODEBUG=1), sometimes I see the message "GetQueuedCompletionStatus() returned an unexpected event" which should never occur neither.

I added debug traces. I saw that the IocpProactor.wait_for_handle() calls later PostQueuedCompletionStatus() through its internal C callback (PostToQueueCallback). It looks like sometimes the callback is called whereas the wait was cancelled/acked by UnregisterWait().

... I didn't understand the logic between RegisterWaitForSingleObject(), UnregisterWait() and the callback ....

It looks like sometimes the overlapped object created in Python (ov = _overlapped.Overlapped(NULL)) is destroyed, before PostToQueueCallback() is called. In the unit tests, it doesn't crash because a different overlapped object is created and it gets the same memory address (the memory allocator reuses a just freed memory block).

The implementation of wait_for_handle() had an optimization: it polls immediatly the wait to check if it already completed. I tried to remove it, but I got some different issues. If I understood correctly, this optimization hides other bugs and reduce the probability of getting the race condition.

wait_for_handle() is used to wait for the completion of a subprocess, so by all unit tests running subprocesses, but also in test_wait_for_handle() and test_wait_for_handle_cancel() tests. I suspect that running test_wait_for_handle() or test_wait_for_handle_cancel() triggers the bug.

Removing _winapi.CloseHandle(self._iocp) in IocpProactor.close() works around the bug. The bug looks to be an expected call to PostToQueueCallback() which calls PostQueuedCompletionStatus() on an IOCP. Not closing the IOCP means using a different IOCP for each test, so the unexpected call to PostQueuedCompletionStatus() has no effect on following tests.

I rewrote some parts of the IOCP code in asyncio. Maybe I introduced this issue during the refactoring. Maybe it already existed before but nobody noticed it, asyncio had fewer unit tests before.

Fixing the root issue: Overlapped Cancellation From Hell

I looked into Twisted implemented of proactor, but it didn't support subprocesses.

I looked at libuv: it supported processes but not cancelling a wait on a process handle...

I had to ask for help to someone who had access to the Windows source code to understand the bug...

After six months of intense debugging, I eventually identified the root issue (I pushed the first fix at July 25, 2014). I pushed the commit d0a28dee (bpo-23095):

commit d0a28dee78d099fcadc71147cba4affb6efa0c97
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Wed Jan 21 23:39:51 2015 +0100

    Issue #23095, asyncio: Rewrite _WaitHandleFuture.cancel()

This change fixes a race conditon related to _WaitHandleFuture.cancel() leading to a Python crash or "GetQueuedCompletionStatus() returned an unexpected event" logs. Previously, it was possible that the cancelled wait completes whereas the overlapped object was already destroyed. Sometimes, a different overlapped was allocated at the same address, emitting a log about unexpected completition (but no crash).

_WaitHandleFuture.cancel() now waits until the handle wait is cancelled (until the cancellation completes) before clearing its reference to the overlapped object. To wait until the cancellation completes, UnregisterWaitEx() is used with an event (instead of using UnregisterWait()).

To wait for this event, a new _WaitCancelFuture class was added. It's a simplified version of _WaitCancelFuture. For example, its cancel() method calls UnregisterWait(), not UnregisterWaitEx(). _WaitCancelFuture should not be cancelled.

The overlapped object is kept alive in _WaitHandleFuture until the wait is unregistered.

Later, I pushed a few more changes to fix corner cases.

commit 1ca9392c:

commit 1ca9392c7083972c1953c02e6f2cca54934ce0a6
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Thu Jan 22 00:17:54 2015 +0100

    Issue #23095, asyncio: IocpProactor.close() must not cancel pending
    _WaitCancelFuture futures

commit 752aba7f:

commit 752aba7f999b08c833979464a36840de8be0baf0
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Thu Jan 22 22:47:13 2015 +0100

    asyncio: IocpProactor.close() doesn't cancel anymore futures which are already
    cancelled

commit 24dfa3c1:

commit 24dfa3c1d6b21e731bd167a13153968bba8fa5ce
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Mon Jan 26 22:30:28 2015 +0100

    Issue #23095, asyncio: Fix _WaitHandleFuture.cancel()

    If UnregisterWaitEx() fais with ERROR_IO_PENDING, it doesn't mean that the wait
    is unregistered yet. We still have to wait until the wait is cancelled.

I think that this issue can now be closed: UnregisterWaitEx() really do what we need in asyncio.

I don't like the complexity of the IocpProactor._unregister() method and of the _WaitCancelFuture class, but it looks that it's how we are supposed to wait until a wait for a handle is cancelled...

Windows IOCP API is much more complex that what I expected. It's probably because some parts (especially RegisterWaitForSingleObject()) are implemented with threads in user land, not in the kernel.

In short, I'm very happy that have fixed this very complex but also very annoying IOCP bug in asyncio.

I got a nice comment from Guido van Rossum:

Congrats with the fix, and thanks for your perseverance!

Summary of the race condition

Events of the crashing unit test:

The loop (ProactorEventLoop) spawns a subprocess.
The loop creates a _WaitHandleFuture object which creates an overlapped to wait until the process completes (call RegisterWaitForSingleObject()): allocate memory for the overlapped.
The wait future is cancelled (call UnregisterWait()).
The overlapped is destroyed: free overlapped memory.
The overlapped completes: write into the overlapped memory.

The main issue is the order of the two last events.

Sometimes, the overlapped completed before the memory was freed: everything is fine.

Sometimes, the overlapped completed after the memory was freed: Python crashed (segmentation fault).

Sometimes, another _WaitHandleFuture was created in the meanwhile and created a second overlapped which was allocated at the same memory address than the freed memory of the previous overlapped. In this case, when the first overlapped completes, Python didn't crash but logged an unexpected completion message.

Sometimes, the write was done in freed memory: the write didn't crash Python, but caused bugs which didn't make sense.

There were even more cases causing even more surprising behaviors.

Summary of the fix:

(... similar steps for the beginning ...)
The wait future is cancelled: create an event to wait until the cancellation completes (call UnregisterWaitEx()).
Wait for the event.
The event is signalled which means that the cancellation completed: write into the overlapped memory.
The overlapped is destroyed: free overlapped memory.

Locale Bugfixes in Python 3

2019-01-09T00:30:00+01:00

This article describes a few locales bugs that I fixed in Python 3 between 2012 (Python 3.3) and 2018 (Python 3.7):

Support non-ASCII decimal point and thousands separator
Crash with non-ASCII decimal point
LC_NUMERIC encoding different than LC_CTYPE encoding
LC_MONETARY encoding different than LC_CTYPE encoding
Tests non-ASCII locales

See also my previous locale bugfixes: Python 3, locales and encodings

Introduction

Each language and each country has different ways to represent dates, monetary values, numbers, etc. Unix has "locales" to configure applications for a specific language and a specific country. For example, there are fr_BE for Belgium (french) and fr_FR for France (french).

In practice, each locale uses its own encoding and problems arise when an application uses a different encoding than the locale. There are LC_NUMERIC locale for numbers, LC_MONETARY locale for monetary and LC_CTYPE for the encoding. Not only it's possible to configure an application to use LC_NUMERIC with a different encoding than LC_CTYPE, but some users use such configuration!

In an application which only uses bytes for text, as Python 2 does mostly, it's mostly fine: in the worst case, users see mojibake, but the application doesn't "crash" (exit and/or data loss). On the other side, Python 3 is designed to use Unicode for text and fail with hard Unicode errors if it fails to decode bytes and fails to encode text.

Support non-ASCII decimal point and thousands separator

The Unicode type has been reimplemented in Python 3.3 to use "compact string": PEP 393 "Flexible String Representation". The new implementation is more complex and the format() function has been limited to ASCII for the decimal point and thousands separator (format a number using the "n" type).

In January 2012, Stefan Krah noticed the regression (compared to Python 3.2) and reported bpo-13706. I fixed the code to support non-ASCII in format (commit a4ac600d). But when I did more tests, I noticed that the "n" type doesn't decode properly the decimal point and thousands seprator which come from the localeconv() function which uses byte strings.

I fixed format(int, "n") with commit 41a863cb, decode decimal point and the thousands separator (localeconv() fields) from the locale encoding, rather than latin1, using PyUnicode_DecodeLocale():

commit 41a863cb81608c779d60b49e7be8a115816734fc
Author: Victor Stinner <victor.stinner@haypocalc.com>
Date:   Fri Feb 24 00:37:51 2012 +0100

    Issue #13706: Fix format(int, "n") for locale with non-ASCII thousands separator

     * Decode thousands separator and decimal point using PyUnicode_DecodeLocale()
       (from the locale encoding), instead of decoding them implicitly from latin1
     * Remove _PyUnicode_InsertThousandsGroupingLocale(), it was not used
     * Change _PyUnicode_InsertThousandsGrouping() API to return the maximum
       character if unicode is NULL
     * (...)

Note: I decided to not fix Python 3.2:

Hum, it is not trivial to redo the work on Python 3.2. I prefer to leave the code unchanged to not introduce a regression, and I wait until a Python 3.2 user complains (the bug exists since Python 3.0 and nobody complained).

Crash with non-ASCII decimal point

Six years later, in June 2018, I noticed that Python does crash when running tests on locales:

$ ./python
Python 3.8.0a0 (heads/master-dirty:bcd3a1a18d, Jun 23 2018, 10:31:03)
[GCC 8.1.1 20180502 (Red Hat 8.1.1-1)] on linux
>>> import locale
>>> locale.str(2.5)
'2.5'
>>> '{:n}'.format(2.5)
'2.5'

>>> locale.setlocale(locale.LC_ALL, '')
'fr_FR.UTF-8'
>>> locale.str(2.5)
'2,5'
>>> '{:n}'.format(2.5)
python: Objects/unicodeobject.c:474: _PyUnicode_CheckConsistency: Assertion `maxchar < 128' failed.
Aborted (core dumped)

I reported the issue as bpo-33954. The bug only occurrs for decimal point larger than U+00FF (code point greater than 255). It was a bug in my bpo-13706 fix: commit a4ac600d.

I pushed a second fix to properly support all cases, commit 59423e3d:

commit 59423e3ddd736387cef8f7632c71954c1859bed0
Author: Victor Stinner <vstinner@redhat.com>
Date:   Mon Nov 26 13:40:01 2018 +0100

    bpo-33954: Fix _PyUnicode_InsertThousandsGrouping() (GH-10623)

    Fix str.format(), float.__format__() and complex.__format__() methods
    for non-ASCII decimal point when using the "n" formatter.

    Changes:

    * Rewrite _PyUnicode_InsertThousandsGrouping(): it now requires
      a _PyUnicodeWriter object for the buffer and a Python str object
      for digits.
    * Rename FILL() macro to unicode_fill(), convert it to static inline function,
      add "assert(0 <= start);" and rework its code.

LC_NUMERIC encoding different than LC_CTYPE encoding

In August 2017, Petr Viktorin identified a bug in Koji (server building Fedora packages): UnicodeDecodeError in localeconv() makes test_float fail in Koji

"This is tripped by Python's test suite, namely test_float.GeneralFloatCases.test_float_with_comma"

He wrote a short reproducer script:

import locale
locale.setlocale(locale.LC_ALL, 'C.UTF-8')
locale.setlocale(locale.LC_NUMERIC, 'fr_FR.ISO8859-1')
print(locale.localeconv())

Two months later, Charalampos Stratakis reported the bug upstream: bpo-31900. The problem arises when the LC_NUMERIC locale uses a different encoding than the LC_CTYPE encoding.

The bug was already known:

2015-12-05: Serhiy Storchaka reported bpo-25812 with uk_UA locale
2016-11-03: Guillaume Pasquet reported bpo-28604 with en_GB locale

Moreover, the bug was known since 2009, Stefan Krah reported a very similar bug: bpo-7442. I was even involved in this issue in 2013, but then I forgot about it (as usual, I am working on too many issues in parallel :-)).

In 2010, PostgreSQL had the same issue and fixed the bug by changing temporarily the LC_CTYPE locale to the LC_NUMERIC locale.

In January 2018, I came back to this 9 years old bug. I was fixing bugs in the implementation of my PEP 540 "Add a new UTF-8 Mode". I pushed a large change to fix locale encodings in bpo-29240, commit 7ed7aead:

commit 7ed7aead9503102d2ed316175f198104e0cd674c
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Mon Jan 15 10:45:49 2018 +0100

    bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)

    Modify locale.localeconv(), time.tzname, os.strerror() and other
    functions to ignore the UTF-8 Mode: always use the current locale
    encoding.

    Changes: (...)

Stefan Krah asked:

I have the exact same questions as Marc-Andre. This is one of the reasons why I blocked the _decimal change. I don't fully understand the role of the new glibc, since #7442 has existed for ages -- and it is a open question whether it is a bug or not.

I replied:

Past 10 years, I repeated to every single user I met that "Python 3 is right, your system setup is wrong". But that's a waste of time. People continue to associate Python3 and Unicode to annoying bugs, because they don't understand how locales work.

Instead of having to repeat to each user that "hum, maybe your config is wrong", I prefer to support this non convential setup and work as expected ("it just works"). With my latest implementation, setlocale() is only done when LC_CTYPE and LC_NUMERIC are different, which is the corner case which "shouldn't occur in practice".

Marc-Andre Lemburg added:

Sounds like a good compromise :-)

After doing more tests on FreeBSD, Linux and macOS, I pushed commit cb064fc2 to fix bpo-31900 by changing temporarily the LC_CTYPE locale to the LC_NUMERIC locale:

commit cb064fc2321ce8673fe365e9ef60445a27657f54
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Mon Jan 15 15:58:02 2018 +0100

    bpo-31900: Fix localeconv() encoding for LC_NUMERIC (#4174)

    * Add _Py_GetLocaleconvNumeric() function: decode decimal_point and
      thousands_sep fields of localeconv() from the LC_NUMERIC encoding,
      rather than decoding from the LC_CTYPE encoding.
    * Modify locale.localeconv() and "n" formatter of str.format() (for
      int, float and complex to use _Py_GetLocaleconvNumeric()
      internally.

I dislike my own fix because changing temporarily the LC_CTYPE locale impacts all threads, not only the current thread. But we failed to find another solution. The LC_CTYPE locale is only changed if the LC_NUMERIC locale is different than the LC_CTYPE locale and if the decimal point or the thousands separator is non-ASCII.

Note: I proposed a change to fix the same bug in the decimal module: PR #5191, but I abandonned my patch.

LC_MONETARY encoding different than LC_CTYPE encoding

Fixing bpo-31900 drained all my energy, but sadly... there was a similar bug with LC_MONETARY!

At 2016-11-03, Guillaume Pasquet reported bpo-28604: Exception raised by python3.5 when using en_GB locale.

The fix is similar to the LC_NUMERIC fix: change temporarily the LC_CTYPE locale to the LC_MONETARY locale, commit 02e6bf7f:

commit 02e6bf7f2025cddcbde6432f6b6396198ab313f4
Author: Victor Stinner <vstinner@redhat.com>
Date:   Tue Nov 20 16:20:16 2018 +0100

    bpo-28604: Fix localeconv() for different LC_MONETARY (GH-10606)

    locale.localeconv() now sets temporarily the LC_CTYPE locale to the
    LC_MONETARY locale if the two locales are different and monetary
    strings are non-ASCII. This temporary change affects other threads.

    Changes:

    * locale.localeconv() can now set LC_CTYPE to LC_MONETARY to decode
      monetary fields.
    * (...)

Tests non-ASCII locales

To test my bugfixes, I used manual tests. The first issue was to identify locales with problematic characters: non-ASCII decimal point or thousands separator for example. I wrote my own "test suite" for Windows, Linux, macOS and FreeBSD on my website: Test non-ASCII characters with locales.

Example with localeconv() on Fedora 27:

LC_ALL locale	Encoding	Field	Bytes	Text
es_MX.utf8	UTF-8	thousands_sep	`0xE2 0x80 0x89`	U+2009
fr_FR.UTF-8	UTF-8	currency_symbol	`0xE2 0x82 0xAC`	U+20AC (€)
ps_AF.utf8	UTF-8	thousands_sep	`0xD9 0xAC`	U+066C (٬)
uk_UA.koi8u	KOI8-U	currency_symbol	`0xC7 0xD2 0xCE 0x2E`	U+0433 U+0440 U+043d U+002E (грн.)
uk_UA.koi8u	KOI8-U	thousands_sep	`0x9A`	U+00A0

Manual tests became more and more complex, since there are so many cases: each operating system use different locale names and the result depends on the libc version. After months of manual tests, I wrote my small personal portable locale test suite: test_all_locales.py. It supports:

FreeBSD 11
macOS
Fedora (Linux)

Example:

def test_zh_TW_Big5(self):
    loc = "zh_TW.Big5" if BSD else "zh_TW.big5"
    if FREEBSD:
        currency_symbol = u'\uff2e\uff34\uff04'
        decimal_point = u'\uff0e'
        thousands_sep = u'\uff0c'
        date_str = u'\u661f\u671f\u56db 2\u6708'
    else:
        currency_symbol = u'NT$'
        decimal_point = u'.'
        thousands_sep = u','
        if MACOS:
            date_str =  u'\u9031\u56db 2\u6708'
        else:
            date_str = u'\u9031\u56db \u4e8c\u6708'

    self.set_locale(loc, "Big5")

    lc = locale.localeconv()
    self.assertLocaleEqual(lc['currency_symbol'], currency_symbol)
    self.assertLocaleEqual(lc['decimal_point'], decimal_point)
    self.assertLocaleEqual(lc['thousands_sep'], thousands_sep)

    self.assertLocaleEqual(time.strftime('%A %B', FEBRUARY), date_str)

The best would be to integrate directly these tests into the Python test suite, but it's not portable nor future-proof, since most constants are hardcoded and depends on the operating sytem and the libc version.

Python 3, locales and encodings

2018-09-06T16:00:00+02:00

Recently, I worked on a change which looked simple: move the code to initialize the sys.stdout encoding before Py_Initialize(). While I was on it, I also decided to move the code which selects the Python "filesystem encoding". I didn't expect that I would spend 2 weeks on these issues... This article tells me about my recent journey in locales and encodings on AIX, HP-UX, Windows, Linux, macOS, Solaris and FreeBSD.

Table of Contents:

Lying HP-UX
Standard streams and filesystem encodings
POSIX locale on FreeBSD
C locale on Windows
Back to stdio encoding
Back to filesystem encoding
Use surrogatepass on Windows
Filesystem encoding documentation
Final FreeBSD 10 issue
Configuration of locales and encodings

Lying HP-UX

At 2018-08-14, Michael Osipov reported bpo-34403: "test_utf8_mode.test_cmd_line() fails on HP-UX due to false assumptions":

======================================================================
FAIL: test_cmd_line (test.test_utf8_mode.UTF8ModeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  (...)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
 : roman8:['h\xc3\xa9\xe2\x82\xac']

Interesting, HP-UX uses "roman8" as its locale encoding. What is this "new" encoding? Wikipedia: HP Roman-8. Oh, that's even older than the common ISO 8859 encodings like Latin1!

Michael Felt was working on a similar test_utf8_mode failure on AIX, so they tried to debug the issue together, but failed to understand the issue. Osipov proposed to give up and just skip the test on HP-UX...

I showed up and proposed a fix for the unit test: PR 8967. The test was hardcoding the expected locale encoding. I modified the test to query the locale encoding at runtime instead.

Bad surprise, the test still fails, oh. I commented:

Hum, it looks like a bug in the C library of HP-UX.

I wrote a C program calling mbstowcs() to check what is the actual encoding used by the C library: c_locale.c. Result:

Well, it confirms what I expected: nl_langinfo(CODESET) announces "roman8", but mbstowcs() uses Latin1 encoding in practice.

So I wrote a workaround similar to the one used on FreeBSD and Solaris: check if the libc is announcing an encoding different than the real encoding, and if it's the case: force the usage of the ASCII encoding in Python. See my commit d500e530:

Author: Victor Stinner <vstinner@redhat.com>
Date:   Tue Aug 28 17:27:36 2018 +0200

    bpo-34403: On HP-UX, force ASCII for C locale (GH-8969)

    On HP-UX with C or POSIX locale, sys.getfilesystemencoding() now returns
    "ascii" instead of "roman8" (when the UTF-8 Mode is disabled and the C locale
    is not coerced).

    nl_langinfo(CODESET) announces "roman8" whereas it uses the Latin1
    encoding in practice.

Extract of the heuristic code:

if (strcmp(encoding, "roman8") == 0) {
    unsigned char ch = (unsigned char)0xA7;
    wchar_t wch;
    size_t res = mbstowcs(&wch, (char*)&ch, 1);
    if (res != (size_t)-1 && wch == L'\xA7') {
        /* On HP-UX withe C locale or the POSIX locale,
           nl_langinfo(CODESET) announces "roman8",
           whereas mbstowcs() uses Latin1 encoding in practice.
           Force ASCII in this case.  Roman8 decodes 0xA7
           to U+00CF. Latin1 decodes 0xA7 to U+00A7. */
        return 1;
    }
}

Python 3.8 will handle better Unicode support on HP-UX. The test_utf8_mode failure was just a hint for a real underlying bug!

Standard streams and filesystem encodings

While reworking the Python initialization, I tried to move all configuration parameters to a new _PyCoreConfig structure. But I know that I missed at least the standard streams encoding (ex: sys.stdout.encoding). My first attempt failed to move the code, it broke many tests. I created bpo-34485: "_PyCoreConfig: add stdio_encoding and stdio_errors".

While I was working on stdio encoding, I also recalled that the Python filesystem encoding is also initialized "late". I also created bpo-34523: "Choose the filesystem encoding before Python initialization (add _PyCoreConfig.filesystem_encoding)" to move this code as well.

I quickly had an implementation, but it didn't go as well as expected...

POSIX locale on FreeBSD

bpo-34485: For me, the "C" and "POSIX" locales were the same locale: C is an alias to POSIX, or the opposite, it didn't really matter for me. But Python handles them differently in some corner cases. For example, Nick Coghlan's PEP 538 (C locale coercion) is only enabled if the LC_CTYPE locale is equal to "C", not if it's equal to "POSIX".

In Python 3.5, I changed stdin and stdout error handlers from strict to surrogateescape if the LC_CTYPE locale is "C": bpo-19977. But when I tested my stdio and filesystem changes on Linux, FreeBSD and Windows, I noticed that I forgot to handle the "POSIX" locale. On FreeBSD, LC_ALL=POSIX and LC_ALL=C behave differently:

With LC_ALL=POSIX environment, setlocale(LC_CTYPE, "") returns "POSIX"
With LC_ALL=C environment, setlocale(LC_CTYPE, "") returns "C"

I fixed that to also use the "surrogateescape" error handler for the POSIX locale on FreeBSD. Commit 315877dc:

Author: Victor Stinner <vstinner@redhat.com>
Date:   Wed Aug 29 09:58:12 2018 +0200

    bpo-34485: stdout uses surrogateescape on POSIX locale (GH-8986)

    Standard streams like sys.stdout now use the "surrogateescape" error
    handler, instead of "strict", on the POSIX locale (when the C locale is not
    coerced and the UTF-8 Mode is disabled).

    Add tests on sys.stdout.errors with LC_ALL=POSIX.

The most important change is just one line:

-        if (strcmp(ctype_loc, "C") == 0) {
+        if (strcmp(ctype_loc, "C") == 0 || strcmp(ctype_loc, "POSIX") == 0) {
             return "surrogateescape";
         }

bpo-34527: Since I was testing various configurations, I also noticed that my UTF-8 Mode (PEP 540) had the same bug. Python 3.7 enables it if the LC_CTYPE locale is equal to "C", but not if it's equal to "POSIX". I also changed that (commit 5cb25895).

C locale on Windows

While testing my changes on Windows, I noticed that Python starts with the LC_CTYPE locale equal to "C", whereas locale.setlocale(locale.LC_CTYPE, "") changes the LC_CTYPE locale to something like English_United States.1252 (English with the code page 1252). Example with Python 3.6:

C:\> python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32
>>> import locale
>>> locale.setlocale(locale.LC_CTYPE, None)
'C'
>>> locale.setlocale(locale.LC_CTYPE, "")
'English_United States.1252'
>>> locale.setlocale(locale.LC_CTYPE, None)
'English_United States.1252'

On UNIX, Python 2 starts with the default C locale, whereas Python 3 always sets the LC_CTYPE locale to my preference. Example on Fedora 28 with LANG=fr_FR.UTF-8:

$ python2 -c 'import locale; print(locale.setlocale(locale.LC_CTYPE, None))'
C
$ python3 -c 'import locale; print(locale.setlocale(locale.LC_CTYPE, None))'
fr_FR.UTF-8

I modified Windows to behave as UNIX, commit 177d921c:

Author: Victor Stinner <vstinner@redhat.com>
Date:   Wed Aug 29 11:25:15 2018 +0200

    bpo-34485, Windows: LC_CTYPE set to user preference (GH-8988)

    On Windows, the LC_CTYPE is now set to the user preferred locale at
    startup: _Py_SetLocaleFromEnv(LC_CTYPE) is now called during the
    Python initialization. Previously, the LC_CTYPE locale was "C" at
    startup, but changed when calling setlocale(LC_CTYPE, "") or
    setlocale(LC_ALL, "").

    pymain_read_conf() now also calls _Py_SetLocaleFromEnv(LC_CTYPE) to
    behave as _Py_InitializeCore(). Moreover, it doesn't save/restore the
    LC_ALL anymore.

    On Windows, standard streams like sys.stdout now always use
    surrogateescape error handler by default (ignore the locale).

Example:

C:\> python3.6 -c "import locale; print(locale.setlocale(locale.LC_CTYPE, None))"
C
C:\> python3.8 -c "import locale; print(locale.setlocale(locale.LC_CTYPE, None))"
English_United States.1252

On Windows, Python 3.8 now starts with the LC_CTYPE locale set to my preference, as it was already previously done on UNIX.

Back to stdio encoding

After all previous changes and fixes, I was able to push my commit dfe0dc74:

Author: Victor Stinner <vstinner@redhat.com>
Date:   Wed Aug 29 11:47:29 2018 +0200

    bpo-34485: Add _PyCoreConfig.stdio_encoding (GH-8881)

    * Add stdio_encoding and stdio_errors fields to _PyCoreConfig.
    * Add unit tests on stdio_encoding and stdio_errors.

Back to filesystem encoding

Commit b2457efc:

Author: Victor Stinner <vstinner@redhat.com>
Date:   Wed Aug 29 13:25:36 2018 +0200

    bpo-34523: Add _PyCoreConfig.filesystem_encoding (GH-8963)

    _PyCoreConfig_Read() is now responsible to choose the filesystem
    encoding and error handler. Using Py_Main(), the encoding is now
    chosen even before calling Py_Initialize().

    _PyCoreConfig.filesystem_encoding is now the reference, instead of
    Py_FileSystemDefaultEncoding, for the Python filesystem encoding.

    Changes:

    * Add filesystem_encoding and filesystem_errors to _PyCoreConfig
    * _PyCoreConfig_Read() now reads the locale encoding for the file
      system encoding.
    * PyUnicode_EncodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize()
      now use the interpreter configuration rather than
      Py_FileSystemDefaultEncoding and Py_FileSystemDefaultEncodeErrors
      global configuration variables.
    * Add _Py_SetFileSystemEncoding() and _Py_ClearFileSystemEncoding()
      private functions to only modify Py_FileSystemDefaultEncoding and
      Py_FileSystemDefaultEncodeErrors in coreconfig.c.
    * _Py_CoerceLegacyLocale() now takes an int rather than
      _PyCoreConfig for the warning.

Use surrogatepass on Windows

While working on the filesystem encoding change, I had a bug in _freeze_importlib.exe which failed at startup:

ValueError: only 'strict' and 'surrogateescape' error handlers are supported, not 'surrogatepass'

I used the following workaround in _freeze_importlib.c:

#ifdef MS_WINDOWS
    /* bpo-34523: initfsencoding() is not called if _install_importlib=0,
       so interp->fscodec_initialized value remains 0.
       PyUnicode_EncodeFSDefault() doesn't support the "surrogatepass" error
       handler in such case, whereas it's the default error handler on Windows.
       Force the "strict" error handler to work around this bootstrap issue. */
    config.filesystem_errors = "strict";
#endif

But I wasn't fully happy with the workaround. When running more manual tests, I found that the PYTHONLEGACYWINDOWSFSENCODING environment variable wasn't handled properly. I pushed a first fix, commit c5989cd8:

Author: Victor Stinner <vstinner@redhat.com>
Date:   Wed Aug 29 19:32:47 2018 +0200

    bpo-34523: Py_DecodeLocale() use UTF-8 on Windows (GH-8998)

    Py_DecodeLocale() and Py_EncodeLocale() now use the UTF-8 encoding on
    Windows if Py_LegacyWindowsFSEncodingFlag is zero.

    pymain_read_conf() now sets Py_LegacyWindowsFSEncodingFlag in its
    loop, but restore its value at exit.

My intent was to be able to use the surrogatepass error handler. If Py_DecodeLocale() is hardcoded to use UTF-8 on Windows, we should get access to the surrogatepass error handler. Previously, mbstowcs() function was used and this function only support strict or surrogateescape error handlers.

I pushed a second big change to add support for the surrogatepass error handler in locale codecs, commit 3d4226a8:

Author: Victor Stinner <vstinner@redhat.com>
Date:   Wed Aug 29 22:21:32 2018 +0200

    bpo-34523: Support surrogatepass in locale codecs (GH-8995)

    Add support for the "surrogatepass" error handler in
    PyUnicode_DecodeFSDefault() and PyUnicode_EncodeFSDefault()
    for the UTF-8 encoding.

    Changes:

    * _Py_DecodeUTF8Ex() and _Py_EncodeUTF8Ex() now support the
      surrogatepass error handler (_Py_ERROR_SURROGATEPASS).
    * _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx() now use
      the _Py_error_handler enum instead of "int surrogateescape" to pass
      the error handler. These functions now return -3 if the error
      handler is unknown.
    * Add unit tests on _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx()
      in test_codecs.
    * Rename get_error_handler() to _Py_GetErrorHandler() and expose it
      as a private function.
    * _freeze_importlib doesn't need config.filesystem_errors="strict"
      workaround anymore.

PyUnicode_DecodeFSDefault() and PyUnicode_EncodeFSDefault() functions use Py_DecodeLocale() and Py_EncodeLocale() before the Python codec of the filesystem encoding is loaded. With this big change, Py_DecodeLocale() and Py_EncodeLocale() now really behave as the Python codec.

Previously, Python started with the surrogateescape error handler, and switched to the surrogatepass error handler once the Python codec was loaded.

Filesystem encoding documentation

One "last" change, I documented how Python selects the filesystem encoding, commit de427556:

Author: Victor Stinner <vstinner@redhat.com>
Date:   Wed Aug 29 23:26:55 2018 +0200

    bpo-34523: Py_FileSystemDefaultEncoding NULL by default (GH-9003)

    * Py_FileSystemDefaultEncoding and Py_FileSystemDefaultEncodeErrors
      default value is now NULL: initfsencoding() set them
      during Python initialization.
    * Document how Python chooses the filesystem encoding and error
      handler.
    * Add an assertion to _PyCoreConfig_Read().

Documentation:

/* Python filesystem encoding and error handler:
   sys.getfilesystemencoding() and sys.getfilesystemencodeerrors().

   Default encoding and error handler:

   * if Py_SetStandardStreamEncoding() has been called: they have the
     highest priority;
   * PYTHONIOENCODING environment variable;
   * The UTF-8 Mode uses UTF-8/surrogateescape;
   * locale encoding: ANSI code page on Windows, UTF-8 on Android,
     LC_CTYPE locale encoding on other platforms;
   * On Windows, "surrogateescape" error handler;
   * "surrogateescape" error handler if the LC_CTYPE locale is "C" or "POSIX";
   * "surrogateescape" error handler if the LC_CTYPE locale has been coerced
     (PEP 538);
   * "strict" error handler.

   Supported error handlers: "strict", "surrogateescape" and
   "surrogatepass". The surrogatepass error handler is only supported
   if Py_DecodeLocale() and Py_EncodeLocale() use directly the UTF-8 codec;
   it's only used on Windows.

   initfsencoding() updates the encoding to the Python codec name.
   For example, "ANSI_X3.4-1968" is replaced with "ascii".

   On Windows, sys._enablelegacywindowsfsencoding() sets the
   encoding/errors to mbcs/replace at runtime.


   See Py_FileSystemDefaultEncoding and Py_FileSystemDefaultEncodeErrors.
   */
char *filesystem_encoding;
char *filesystem_errors;

Final FreeBSD 10 issue

bpo-34544: The stdio and filesystem encodings are now properly selected before Py_Initialize(), the LC_CTYPE locale should be properly initialized, the "POSIX" locale is now properly handled, but the FreeBSD 10 buildbot still complained about my recent changes... Many test_c_locale_coerce tests started to fail with:

Fatal Python error: get_locale_encoding: failed to get the locale encoding: nl_langinfo(CODESET) failed

Sadly, I wasn't able to reproduce the issue on my FreeBSD 11 VM. I also got access to the FreeBSD CURRENT buildbot, but I also failed to reproduce the bug there. I was supposed to get access to the FreeBSD 10 buildbot, but there was a DNS issue.

I had to guess the origin of the bug and I attempted a fix, commit f01b2a1b:

Author: Victor Stinner <vstinner@redhat.com>
Date:   Mon Sep 3 14:38:21 2018 +0200

    bpo-34544: Fix setlocale() in pymain_read_conf() (GH-9041)

    bpo-34485, bpo-34544: On some FreeBSD, nl_langinfo(CODESET) fails if
    LC_ALL or LC_CTYPE is set to an invalid locale name. Replace
    _Py_SetLocaleFromEnv(LC_CTYPE) with _Py_SetLocaleFromEnv(LC_ALL) to
    initialize properly locales.

    Partially revert commit 177d921c8c03d30daa32994362023f777624b10d.

... but it didn't work.

I decided to install a FreeBSD 10 VM and one week later... I finally succeded to reproduce the issue!

The bug was that the _Py_CoerceLegacyLocale() function doesn't restore the LC_CTYPE to its previous value if it attempted to coerce the LC_CTYPE locale but no locale worked.

Previously, it didn't matter, since the LC_CTYPE locale was initialized again later, or it was saved/restored indirectly. But with my latest changes, the LC_CTYPE was left unchanged.

The fix is just to restore LC_CTYPE if _Py_CoerceLegacyLocale() fails, commit 8ea09110:

Author: Victor Stinner <vstinner@redhat.com>
Date:   Mon Sep 3 17:05:18 2018 +0200

    _Py_CoerceLegacyLocale() restores LC_CTYPE on fail (GH-9044)

    bpo-34544: If _Py_CoerceLegacyLocale() fails to coerce the C locale,
    restore the LC_CTYPE locale to the its previous value.

Finally, I succeded to do what I wanted to do initially, remove the code which saved/restored the LC_ALL locale: pymain_read_conf() is now really responsible to set the LC_CTYPE locale, and it doesn't modify the LC_ALL locale anymore.

Configuration of locales and encodings

Python has many options to configure the locales and encodings.

Main options of Python 3.7:

Legacy Windows stdio (PEP 528)
Legacy Windows filesystem encoding (PEP 529)
C locale coercion (PEP 538)
UTF-8 mode (PEP 540)

The combination of C locale coercion and UTF-8 mode is non-obvious and should be carefully tested!

Environment variables:

PYTHONCOERCECLOCALE=0
PYTHONCOERCECLOCALE=1
PYTHONCOERCECLOCALE=warn
PYTHONIOENCODING=:<errors>
PYTHONIOENCODING=<encoding>:<errors>
PYTHONIOENCODING=<encoding>
PYTHONLEGACYWINDOWSFSENCODING=1
PYTHONLEGACYWINDOWSSTDIO=1
PYTHONUTF8=0
PYTHONUTF8=1

Command line options:

-X utf8=0
-X utf8 or -X utf8=1
-E or -I (ignore PYTHON* environment variables)

Global configuration variables:

Py_FileSystemDefaultEncodeErrors
Py_FileSystemDefaultEncoding
Py_LegacyWindowsFSEncodingFlag
Py_LegacyWindowsStdioFlag
Py_UTF8Mode

_PyCoreConfig:

coerce_c_locale
coerce_c_locale_warn
filesystem_encoding
filesystem_errors
stdio_encoding
stdio_errors

The LC_CTYPE locale depends on 3 environment variables:

LC_ALL
LC_CTYPE
LANG

Depending on the platform, the following configuration gives a different LC_CTYPE locale:

LC_ALL= LC_CTYPE= LANG= (no variable set)
LC_ALL= LC_CTYPE=C LANG= (C locale)
LC_ALL= LC_CTYPE=POSIX LANG= (POSIX locale)

In case of doubt, I also tested:

LC_ALL=C LC_CTYPE= LANG= (C locale)
LC_ALL=POSIX LC_CTYPE= LANG= (POSIX locale)

The LC_CTYPE encoding (locale encoding) can be queried using nl_langinfo(CODESET). On FreeBSD, Solaris, HP-UX and maybe other platforms, nl_langinfo(CODESET) announces an encoding which is different than the codec used by mbstowcs() and wcstombs() functions, and so Python forces the usage of the ASCII encoding.

The test matrix of all these configurations and all platforms is quite big. Honestly, I would not bet that Python 3.8 will behave properly in all possible cases. At least, I tried to fix all issues that I spotted! Moreover, I added many tests which should help to detect bugs and prevent regressions.

Python 3.7 UTF-8 Mode

2018-03-27T20:00:00+02:00

Since Python 3.0 was released in 2008, each time an user reported an encoding issue, someone showed up and asked why Python does not "simply" always use UTF-8. Well, it's not that easy. UTF-8 is the best encoding in most cases, but it is still not the best encoding in all cases, even in 2018. The locale encoding remains the best default filesystem encoding for Python. I would say that the locale encoding is the least bad filesystem encoding.

This article tells the story of my PEP 540: Add a new UTF-8 Mode which adds an opt-in option to "use UTF-8" everywhere". Moreover, the UTF-8 Mode is enabled by the POSIX locale: Python 3.7 now uses UTF-8 for the POSIX locale. My PEP 540 is complementary to Nick Coghlan's PEP 538.

When I started to write this article, I wrote something like: "Hey! I added a new option to use UTF-8, enjoy!". Written like that, it seems like using UTF-8 was an obvious choice and that it was really easy to write such PEP. No. Nothing was obvious, nothing was simple.

It took me one year to design and implement my PEP 540, and to get it accepted. I wrote five articles before this one to show that the PEP 540 only came after a long painful journey, starting with Python 3.0, to choose the best Python encoding. My PEP rely on the all the great work done previously.

This article is the sixth and last in a series of articles telling the history and rationale of the Python 3 Unicode model for the operating system:

Fallback to UTF-8 if getting the locale encoding fails?

May 2010, I reported bpo-8610: "Python3/POSIX: errors if file system encoding is None". I asked what should be the default encoding when getting the locale encoding fails. I proposed to fallback to UTF-8. I wrote:

UTF-8 is also an optimist choice: I bet that more and more operating systems will move to UTF-8.

Marc-Andre commented:

Ouch, that was a poor choice. In Python we have a tradition to avoid guessing, if possible. Since we cannot guarantee that the file system will indeed use UTF-8, it would have been safer to use ASCII. Not sure why this reasoning wasn't applied for the file system encoding.

In practice, Python already used UTF-8 when the filesystem encoding was set to None. I pushed the commit b744ba1d into the Python 3.2 development branch to make the default encoding (UTF-8) more obvious. But before Python 3.2 was released, I removed the fallback with my commit e474309b (Oct 2010):

initfsencoding(): get_codeset() failure is now a fatal error

Don't fallback to UTF-8 anymore to avoid mojibake. I never got any error from his function.

The utf8 option proposed for Windows

August 2016, bpo-27781: when Steve Dower was working on changing the filesystem encoding to UTF-8, I was not sure that Windows should use UTF-8 by default. I was more in favor on making the backward incompatible change an opt-in option. I wrote:

If you go in this direction, I would like to follow you for the UNIX/BSD side to make the switch portable. I was thinking about "-X utf8" which avoids to change the command line parser.

If we agree on a plan, I would like to write it down as a PEP since I expect a lot of complains and questions which I would prefer to only answer once (see for example the length of your thread on python-ideas where each people repeated the same things multiple times ;-))

I added:

I mean that python3 -X utf8 should force sys.getfilesystemencoding() to UTF-8 on UNIX/BSD, it would ignore the current locale setting.

Since Steve chose to change the default to UTF-8 on Windows, my -X utf8 option idea was ignored in this issue.

The utf8 option proposed for the POSIX locale

September 2016: Jan Niklas Hasse opened bpo-28180 about Docker images, "sys.getfilesystemencoding() should default to utf-8".

I proposed again my option:

I proposed to add -X utf8 command line option for UNIX to force utf8 encoding. Would it work for you?

Jan Niklas Hasse answered:

Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with

December 2016, I added:

Usually, when a new option is added to Python, we add a command line option (-X utf8) but also an environment variable: I propose PYTHONUTF8=1.

Use your favorite method to define the env var "system wide" in your docker containers.

Note: Technically, I'm not sure that it's possible to support -E option with PYTHONUTF8, since -E comes from the command line, and we first need to decode command line arguments with an encoding to parse these options.... Chicken-and-egg issue ;-)

Nick Coghlan wrote his PEP 538 "Coercing the C locale to a UTF-8 based locale" which has been approved in May 2017 and finally implemented in June 2017.

Again, my utf8 idea was ignored in this issue.

First version of my PEP 540: Add a new UTF-8 Mode

January 2017, as a follow-up of bpo-27781 and bpo-28180, I wrote the PEP 540: Add a new UTF-8 Mode and I posted it to python-ideas for comments.

Abstract:

Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system data instead of the locale encoding. Add -X utf8 command line option and PYTHONUTF8 environment variable.

After ten hours after and a few messages, I wrote a second version:

I modified my PEP: the POSIX locale now enables the UTF-8 mode.

INADA Naoki wrote:

I want UTF-8 mode is enabled by default (opt-out option) even if locale is not POSIX, like PYTHONLEGACYWINDOWSFSENCODING.

Users depends on locale know what locale is and how to configure it. They can understand difference between locale mode and UTF-8 mode and they can opt-out UTF-8 mode.

But many people lives in "UTF-8 everywhere" world, and don't know about locale.

Always ignoring the locale to always use UTF-8 would be a backward incompatible change. I wasn't brave enough to propose it, I only wanted to propose an opt-in option, except of the specific case of the POSIX locale.

Not only people had different opinons, but most people had strong opinions on how to handle Unicode and were not ready for compromises.

Third version of my PEP 540

One week and 59 emails later, I implemented my PEP 540 and I wrote a third version of my PEP:

I made multiple changes since the first version of my PEP:

The UTF-8 Strict mode now only uses strict for inputs and outputs: it keeps surrogateescape for operating system data. Read the "Use the strict error handler for operating system data" alternative for the rationale.

The POSIX locale now enables the UTF-8 mode. See the "Don't modify the encoding of the POSIX locale" alternative for the rationale.

Specify the priority between -X utf8, PYTHONUTF8, PYTHONIOENCODING, etc.

The PEP version 3 has a longer rationale with more example. (...)

The new thread also got 19 emails, total: 78 emails in one month. The same month, Nick Coghlan's PEP 538 was also under discussion.

Silence during one year

Because of the tone of the python-ideas threads and because I didn't know how to deal with Nick Coghlan's PEP 538, I decided to do nothing during one year (January to December 2017).

April 2017, Nick proposed INADA Naoki as the BDFL Delegate for his PEP 538 and my PEP 540. Guido accepted to delegate.

May 2017, Naoki approved Nick's PEP 538, and Nick implemented it.

PEP 540 version 3 posted to python-dev

At the end of 2017, when I looked at my contributions in Python 3.7 in the What’s New In Python 3.7 document, I didn't see any significant contribution. I wanted to propose something. Moreover, the deadline for the Python 3.7 feature freeze (first beta version) was getting close, end of January 2018: see the PEP 537: Python 3.7 Release Schedule.

December 2017, I decided to move to the next step: I sent my PEP to the python-dev mailing list.

Guido van Rossum complained about the length of the PEP:

I've been discussing this PEP offline with Victor, but he suggested we should discuss it in public instead.

I am very worried about this long and rambling PEP, and I propose that it not be accepted without a major rewrite to focus on clarity of the specification. The "Unicode just works" summary is more a wish than a proper summary of the PEP.

(...)

So I guess PEP acceptance week is over. :-(

PEP rewritten from scratch

Even if I was not fully convinced myself that my PEP was a good idea, I wanted to get an official vote, to know if my idea should be implemented or abandonned. I decided to rewrite my PEP from scratch:

PEP version 3 (before rewrite): 1,017 lines
PEP version 4 (after rewrite): 263 lines (26% of the previous version)

I reduced the rationale to the strict minimum, to explain key points of the PEP:

Locale encoding and UTF-8
Passthough undecodable bytes: surrogateescape
Strict UTF-8 for correctness
No change by default for best backward compatibility

Reading JPEG pictures with surrogateescape

December 2017, I sent the shorter PEP version 4 to python-dev.

INADA Naoki, the BDFL-delegate, spotted a design issue:

And I have one worrying point. With UTF-8 mode, open()'s default encoding/error handler is UTF-8/surrogateescape.

(...)

And opening binary file without "b" option is very common mistake of new developers. If default error handler is surrogateescape, they lose a chance to notice their bug.

He gave a concrete example:

With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not UTF-8/surrogateescape.

For example, this code raises UnicodeDecodeError with PEP 538 if the file is JPEG file.
with open(fn) as f:
    f.read()

I replied:

While I'm not strongly convinced that open() error handler must be changed for surrogateescape, first I would like to make sure that it's really a very bad idea before changing it :-)

(...)

Using a JPEG image, the example is obviously wrong.

But using surrogateescape on open() has been chosen to read text files which are mostly correctly encoded to UTF-8, except a few bytes.

I'm not sure how to explain the issue. The Mercurial wiki page has a good example of this issue that they call the "Makefile problem".

Guido van Rossum finished to convinced me:

You will quickly get decoding errors, and that is INADA's point. (Unless you use encoding='Latin-1'.) His worry is that the surrogateescape error handler makes it so that you won't get decoding errors, and then the failure mode is much harder to debug.

I wrote a 5th version of my PEP:

I made the following two changes to the PEP 540:

open() error handler remains "strict"

Remove the "Strict UTF8 mode" which doesn't make much sense anymore

Last question on locale.getpreferredencoding()

December 2017, INADA Naoki asked:

Or locale.getpreferredencoding() returns 'UTF-8' in UTF-8 mode too?

Oh, that's a good question! I looked at the code and agreed to return UTF-8:

I checked the stdlib, and I found many places where locale.getpreferredencoding() is used to get the user preferred encoding:

builtin open(): default encoding

cgi.FieldStorage: encode the query string

encoding._alias_mbcs(): check if the requested encoding is the ANSI code page

gettext.GNUTranslations: lgettext() and lngettext() methods

xml.etree.ElementTree: ElementTree.write(encoding='unicode')

In the UTF-8 mode, I would expect that cgi, gettext and xml.etree all use the UTF-8 encoding by default. So locale.getpreferredencoding() should return UTF-8 if the UTF-8 mode is enabled.

I sent a 6th version of my PEP:

locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 Mode.

Moreover, I also wrote a new much better written "Relationship with the locale coercion (PEP 538)" section replacing the "Annex: Differences between PEP 538 and PEP 540" section. The new section was asked by many people who were confused by the relationship between PEP 538 and PEP 540.

Finally, one year after the first PEP version, INADA Naoki approved my PEP!

First incomplete implementation

I started to work on the implementation of my PEP 540 in March 2017. Once the PEP has been approved, I asked INADA Naoki for a review. He asked me to fix the command line parsing to handle properly the -X utf8 option:

And when -X utf8 option is found, we can decode from char **argv again. Since mbstowcs() doesn't guarantee round tripping, it is better than re-encode wchar_t **argv.

Implementing properly the -X utf8 option was tricky. Parsing the command line was done on wchar_t* C strings (Unicode), which requires to decode the char** argv C array of byte strings (bytes). Python starts by decoding byte strings from the locale encoding. If the utf8 option is detected, argv byte strings must be decoded again, but now from UTF-8. The problem was that the code was not designed for that, and it required to refactor a lot of code in Py_Main().

I replied:

main() and Py_Main() are very complex. With the PEP 432, Nick Coghlan, Eric Snow and me are working on making this code better. See for example bpo-32030.

(...)

For all these reasons, I propose to merge this uncomplete PR and write a different PR for the most complex part, re-encode wchar_t* command line arguments, implement Py_UnixMain() or another even better option?

I wanted to get my code merged as soon as possible to make sure that it will get into the first Python 3.7 beta, to get a longer testing period before Python 3.7 final.

December 2017, bpo-29240, I pushed my commit 91106cd9:

PEP 540: Add a new UTF-8 Mode

Add -X utf8 command line option, PYTHONUTF8 environment variable and a new sys.flags.utf8_mode flag.

locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 mode. As a side effect, open() now uses the UTF-8 encoding by default in this mode.

Split Py_Main() into subfunctions

November 2017, I created bpo-32030 to split the big Py_Main() function into smaller subfunctions. My motivation was to be able to properly implement my PEP 540.

It will take me 3 months of work and 45 commits to completely cleanup Py_Main() and put almost all Python configuration options into the private C _PyCoreConfig structure.

Parse again the command line when -X utf8 is used

December 2017, bpo-32030, thanks to the Py_Main() refactoring, I was able to finish the implementation of my PEP.

I pushed my commit 9454060e:

Py_Main() re-reads config if encoding changes

If the encoding change (C locale coerced or UTF-8 Mode changed), Py_Main() now reads again the configuration with the new encoding.

If the encoding changed after reading the Python configuration, cleanup the configuration and read again the configuration with the new encoding. The key feature here allowed by the refactoring is to be able to cleanup properly all the configuration.

UTF-8 Mode and the locale encoding

January 2018, while working on bpo-31900 "localeconv() should decode numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding", I tested various combinations of locales and encodings. I found bugs with the UTF-8 mode.

When the UTF-8 mode is enabled explicitly by -X utf8, the intent is to use UTF-8 "everywhere". Right. But there are some places, where the current locale encoding is really the correct encoding, like the time.strftime() function.

bpo-29240: I pushed a first fix, commit cb3ae558:

Ignore UTF-8 Mode in the time module

time.strftime() must use the current LC_CTYPE encoding, not UTF-8 if the UTF-8 mode is enabled.

I tested more cases and found... more bugs. More functions must really use the current locale encoding, rather than UTF-8 if the UTF-8 Mode is enabled.

I pushed a second fix, commit 7ed7aead:

Fix locale encodings in UTF-8 Mode

Modify locale.localeconv(), time.tzname, os.strerror() and other functions to ignore the UTF-8 Mode: always use the current locale encoding.

The second fix documented the encoding used by the public C functions Py_DecodeLocale() and Py_EncodeLocale():

Encoding, highest priority to lowest priority:

UTF-8 on macOS and Android;

UTF-8 if the Python UTF-8 mode is enabled;

ASCII if the LC_CTYPE locale is "C", nl_langinfo(CODESET) returns the ASCII encoding (or an alias), and mbstowcs() and wcstombs() functions uses the ISO-8859-1 encoding.

the current locale encoding.

The fix was complex to be written because I had to extend Py_DecodeLocale() and Py_EncodeLocale() to support internally the strict error handler. I also extended to API to report an error message (called "reason") on failure.

For example, Py_DecodeLocale() has the prototype:

wchar_t*
Py_DecodeLocale(const char* arg, size_t *wlen)

whereas the new extended and more generic _Py_DecodeLocaleEx() has a much more complex prototype:

int
_Py_DecodeLocaleEx(const char* arg, wchar_t **wstr, size_t *wlen,
                   const char **reason,
                   int current_locale, int surrogateescape)

To decode, there are two main use cases:

(FILENAME) Use UTF-8 if the UTF-8 Mode is enabled, or the locale encoding otherwise. See Py_DecodeLocale() documentation for the exact used encoding, the truth is more complex.
(LOCALE) Always use the current locale encoding

(FILENAME) examples:

Py_DecodeLocale(), PyUnicode_DecodeFSDefaultAndSize(): use the surrogateescape error handler
os.fsdecode()
os.listdir()
os.environ
sys.argv
etc.

(LOCALE) examples:

PyUnicode_DecodeLocale(): the error handler is passed as an argument and must be strict or surrogateescape
time.strftime()
locale.localeconv()
time.tzname
os.strerror()
readline module: internal decode() function
etc.

Summary of PEP 540 history

Version 1: first version sent to python-ideas
Version 2: the POSIX locale now enables the UTF-8 mode
Version 3: the UTF-8 Strict mode now only uses the strict error handler for inputs and outputs
Version 4: PEP rewritten from scratch to be shorter
Version 5: open() error handler remains strict, and the "Strict UTF8 mode" has been removed
Version 6: locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 Mode.

Abstract of the final approved PEP:

Add a new "UTF-8 Mode" to enhance Python's use of UTF-8. When UTF-8 Mode is active, Python will:

use the utf-8 encoding, irregardless of the locale currently set by the current platform, and

change the stdin and stdout error handlers to surrogateescape.

This mode is off by default, but is automatically activated when using the "POSIX" locale.

Add the -X utf8 command line option and PYTHONUTF8 environment variable to control UTF-8 Mode.

Conclusion

It's now time for a well deserved nap... until the next major Unicode issue in Python.

(I love tigers: my favorite animals!)

Python 3.7 and the POSIX locale

2018-03-23T13:00:00+01:00

During the childhood of Python 3, encodings issues were common, even on well configured systems. Python used UTF-8 rather than the locale encoding, and so commonly produced mojibake. For these reasons, when users complained about the Python behaviour with the POSIX locale, bug reports were closed with a message like: "your system is not properly configured, please fix your locale".

I only started to make a shy change for the POSIX locale in Python 3.5 at the end of 2013: use surrogateescape for stdin and stdout. We will have to wait for Nick Coghlan in 2017 for significant changes in Python 3.7.

This article explains the slow transition, six years since the first bug report (2011) to the significant change (2017), from "you must fix your locale" to "maybe Python can do something for you".

This article is the fifth in a series of articles telling the history and rationale of the Python 3 Unicode model for the operating system:

First rejected attempt, 2011

December 2011, Martin Packman, a Bazaar developer, reported bpo-13643 to propose to use UTF-8 in Python if the locale encoding is ASCII:

Currently when running Python on a non-OSX posix environment under either the C locale, or with an invalid or missing locale, it's not possible to operate using unicode filenames outside the ascii range. Using bytes works, as does reading expecting unicode, using the surrogates hack.

This makes robustly working with non-ascii filenames on different platforms needlessly annoying, given no modern nix should have problems just using UTF-8 in these cases.

See the downstream bzr bug for more.

One option is to just use UTF-8 for encoding and decoding filenames when otherwise ascii would be used. As a strict superset, this shouldn't break too many existing assumptions, and it's unlikely that non-UTF-8 filenames will accidentally be mangled due to a locale setting blip. See the attached patch for this behaviour change. It does not include a test currently, but it's possible to write one using subprocess and overriden LANG and LC_ALL vars.

He added:

This is more about un-encodable filenames.

At the moment work with non-ascii filenames in Python robustly requires two branches, one using unicode and one that encodes to bytestrings and deals with the case where the name can't be represented in the declared filesystem encoding.

That may be something that just had to be lived with, but it's a little annoying when even without a UTF-8 locale for a particular process, that's what most systems will want on disk.

At this time, I was still traumatised by the PYTHONFSENCODING mess: using a filesystem encoding different than the locale encoding caused many issues (see Python 3.2 Painful History of the Filesystem Encoding). I wrote:

It was already discussed: using a different encoding for filenames and for other things is really not a good idea. (...)

and I added:

The right fix is to fix your locale, not Python.

Antoine Pitrou suggested to fix the operating system, not Python:

So why don't these supposedly "modern" systems at least set the appropriate environment variables for Python to infer the proper character encoding? (since these "modern" systems don't have a well-defined encoding...)

Answer: because they are not modern at all, they are antiquated, inadapted and obsolete pieces of software designed and written by clueless Anglo-American people. Please report bugs against these systems. The culprit is not Python, it's the Unix crap and the utterly clueless attitude of its maintainers ("filesystems are just bytes", yeah, whatever...).

Martin Pool wrote:

The standard encoding is UTF-8. Python shouldn't need to have a variable set to tell it this.

Antoine replied:

How so? I don't know of any Linux or Unix spec which says so.

Four days and 34 messages later, Terry J. Reedy closed the issue:

Martin, after reading most all of the unusually large sequence of messages, I am closing this because three of the core developers with the most experience in this area are dead-set against your proposal.

That does not make it 'wrong', but does mean that it will not be approved and implemented without new data and more persuasive arguments than those presented so far. I do not see that continued repetition of what has been said so far will change anything.

Getting many messages in short time is common when discussing Unicode issues :-)

March 2011, Armin Ronacher and Carl Meyer reported a similar issue: bpo-11574 and [Python-Dev] Low-Level Encoding Behavior on Python 3. I closed the issue as "wont fixed" in April 2012.

Second attempt, 2013

November 2013, Sworddragon reported bpo-19846: LANG=C python3 -c 'print("\xe4")' fails with an UnicodeEncodeError.

Antoine Pitrou wrote a patch to use UTF-8 when the locale encoding is ASCII, same approach than the first attempt bpo-13643.

The patch was incomplete and so caused many issues. Python used the C codec of the locale encoding during Python initialization, and so Python had to use the locale encoding as its filesystem encoding.

I listed all functions that should be modified to fix issues and get a fully working solution. Nobody came up with a full implementation, likely because too many changes were required.

One month and 66 messages (almost the double of the previous attempt) later, again, I closed the issue:

I'm closing the issue as invalid, because Python 3 behaviour is correct and must not be changed.

Standard streams (sys.stdin, sys.stdout, sys.stderr) uses the locale encoding. (...) These encodings and error handlers can be overriden by the PYTHONIOENCODING.

My full long comment describes encodings used on each platform.

Use surrogateescape for stdin and stdout in Python 3.5

December 2013: Just after closing the second attempt bpo-19846, I created bpo-19977 to propose to use the surrogateescape error handler in sys.stdin and sys.stdout for the POSIX locale.

R. David Murray disliked my idea:

Reintroducing moji-bake intentionally doesn't sound like a particularly good idea, wasn't that what python3 was supposed to help prevent?

It does seem like a utf-8 default is the Way of the Future. Or even the present, most places.

March 2014, since Serhiy Storchaka and Nick Coghlan supported my idea, I pushed my commit 7143029d in Python 3.5:

Issue #19977: When the LC_TYPE locale is the POSIX locale (C locale), sys.stdin and sys.stdout are now using the surrogateescape error handler, instead of the strict error handler.

Previously, Python 3 was very strict on encodings, all core developers were convinced to be able to force developers to fix their applications. This change is one the first Python 3 change which can produce "mojibake" on purpose.

Six years after the Python 3.0 release, we started to understand that while developers can fix their code, we cannot ask users to fix their configuration ("fix their locale").

Read /etc/locale.conf?

April 2014, Nick Coghlan created bpo-21368: "Check for systemd locale on startup if current locale is set to POSIX".

If a modern Linux system is using systemd as the process manager, then there will likely be a "/etc/locale.conf" file providing settings like LANG - due to problematic requirements in the POSIX specification, this file (when available) is likely to be a better "source of truth" regarding the system encoding than the environment where the interpreter process is started, at least when the latter is claiming ASCII as the default encoding.

I disliked the idea:

I don't think that Python should read such configuration file. If you consider that something is wrong here, please report the issue to the C library.

Since no consensus was found, no action was taken.

Misconfigured locales in Docker images

September 2016: Jan Niklas Hasse opened bpo-28180, "sys.getfilesystemencoding() should default to utf-8".

Working with Docker I often end up with an environment where the locale isn't correctly set. In these cases it would be great if sys.getfilesystemencoding() could default to 'utf-8' instead of 'ascii', as it's the encoding of the future and ascii is a subset of it anyway.

December 2016, Jan Niklas Hasse mentioned the C.UTF-8 locale:

glibc C.UTF-8 article mentions that C.UTF-8 should be glibc's default.

This bug report also mentions Python. It hasn't been fixed yet, though :/

Marc-Andre Lemburg added:

If we just restrict this to the file system encoding (and not the whole LANG setting), how about:

default the file system encoding to 'utf-8' and use the surrogate escape handler as default error handler

add a PYTHONFSENCODING env var to set the file system encoding to something else (*)

(*) I believe we discussed this at some point already, but don't remember the outcome.

The removed PYTHONFSENCODING environment variable, using a filesystem encoding different than the locale encoding, caused many issues: see Python 3.2 Painful History of the Filesystem Encoding.

Nick Coghlan proposed to experiment using the C.UTF-8 locale in Fedora 26:

For Fedora 26, I'm going to explore the feasibility of patching our system 3.6 installation such that the python3 command itself (rather than the shared library) checks for "LC_CTYPE=C" as almost the first thing it does, and forcibly sets LANG and LC_ALL to C.UTF-8 if it gets an answer it doesn't like. If we're able to do that successfully in the more constrained environment of a specific recent Fedora release, then I think it will bode well for doing something similar by default in CPython 3.7

Downstream Fedora issue proposing the above idea for F26.

Fedora 26 integrated a downstream change in Python 3.6: see Python 3 C.UTF-8 locale.

PEP 538: Coercing the C locale to a UTF-8 based locale

December 2016, as a follow-up of bpo-28180, Nick Coghlan wrote the PEP 538: Coercing the legacy C locale to a UTF-8 based locale and posted it to python-ideas list and to the linux-sig list.

April 2017, Nick proposed INADA Naoki as the BDFL Delegate for his PEP. Guido accepted to delegate.

May 2017, after 5 months of discussions and changes, INADA Naoki approved the PEP.

June 2017, bpo-28180: Nick Coghlan pushed the commit 6ea4186d:

bpo-28180: Implementation for PEP 538 (#659)

Conclusion

A first attempt to use a different encoding for the POSIX locale was rejected in 2011. A second attempt was also rejected in 2013.

I modified Python 3.5 in 2014 to use the surrogateescape error handler in stdin and stdout for the POSIX locale. Six years after the Python 3.0 release, we started to understand that while developers can fix their code, we cannot ask users to "fix their locale" (configure properly their locale).

In 2016, the problem occurred again with misconfigured locales in Docker images. In 2017, Nick Coghlan wrote the PEP 538 "Coercing the legacy C locale to a UTF-8 based locale" which has been approved by INADA Naoki and implemented in Python 3.7.

Python 3.6 now uses UTF-8 on Windows

2018-03-22T17:00:00+01:00

September 2016, a few days before the CPython core dev sprint, Steve Dower proposed two major backward incompatible changes for Python 3.6 on Windows: PEP 528: Change Windows console encoding to UTF-8 and PEP 529: Change Windows filesystem encoding to UTF-8. At the first read, I was sure that the PEP 529 will break all applications on Windows. This article tells the story behind the PEPs approval.

This article is the fourth in a series of articles telling the history and rationale of the Python 3 Unicode model for the operating system:

PEP 529

September 2016, Steve Dower, who works for Microsoft, wrote the PEP 529: Change Windows filesystem encoding to UTF-8 and posted it to python-dev for comments.

Abstract:

Historically, Python uses the ANSI APIs for interacting with the Windows operating system, often via C Runtime functions. However, these have been long discouraged in favor of the UTF-16 APIs. Within the operating system, all text is represented as UTF-16, and the ANSI APIs perform encoding and decoding using the active code page. See Naming Files, Paths, and Namespaces for more details.

This PEP proposes changing the default filesystem encoding on Windows to utf-8, and changing all filesystem functions to use the Unicode APIs for filesystem paths. This will not affect code that uses strings to represent paths, however those that use bytes for paths will now be able to correctly round-trip all valid paths in Windows filesystems. Currently, the conversions between Unicode (in the OS) and bytes (in Python) were lossy and would fail to round-trip characters outside of the user's active code page.

Notably, this does not impact the encoding of the contents of files. These will continue to default to locale.getpreferredencoding() (for text files) or plain bytes (for binary files). This only affects the encoding used when users pass a bytes object to Python where it is then passed to the operating system as a path name.

My analysis

Here is my analysis on the rationale for the PEP 529 change.

On Unix, the native type for filenames is bytes. A filename is seen by the Linux kernel as an opaque object. The ext4 filesystem stores filenames as bytes. If a Python 2 application uses Unicode for filenames, filesystem operations can fail with a Unicode error (encoding or decoding error) depending on the locale encoding. If the locale encoding is ASCII, Unicode errors are likely to occur at the first non-ASCII filename. For example, Mercurial handles filenames as bytes.

On Python 3, handling filenames as Unicode works thanks to the surrogateescape error handler. Most Python 2 applications ported to Python 3 keep their Python 2 support, and so still handle filenames bytes.

Problems arise when such software is used on Windows.

On Windows, the native type for filenames is Unicode. Many functions come in two flavors: "ANSI" (bytes) and "Wide" (Unicode) versions. In my opinion, the ANSI flavor mostly exists for backward compatibility. In Python 3.5, passing a filename as bytes uses the ANSI flavor, whereas the Wide flavor is used for Unicode filenames. The ANSI flavor uses the ANSI code page which is very limited compared to Unicode, usually only 256 code points or less. Some filenames not encodable to the ANSI code page simply cannot be opened, renamed, etc. using the ANSI API.

The other issue is that some developers only develop on Unix (ex: Linux or macOS) and never test their application on Windows.

For a better rationale, read the Background section of Steve Dower's PEP :-)

Discussion at the CPython sprint and Guido's approval

Honestly, at the first read, I was sure that the PEP 529 will break all applications on Windows.

Hopefully, thanks to the PSF and Instagram, I was able to attend my first CPython sprint at Instagram headquarters: CPython sprint, september 2016. I discussed there with Steve who reassured me and explained me his PEP. Later, we talked with Guido van Rossum.

Even if I liked the idea of using UTF-8, I was still not fully confident that the change will not break the world. We agreed to try the change during Python 3.6 beta phase, but revert it if something bad happens.

Following this talk, Guido accepted the PEP under conditions

I'm hijacking this thread to provisionally accept PEP 529. (I'll also do this for PEP 528, in its own thread.)

I've talked things over with Steve and Victor and we're going to do an experiment (as now written up in the PEP) to tease out any issues with this change during the beta. If serious problems crop up we may have to roll back the changes and reject the PEP -- we won't get another chance at getting this right. (That would also mean that using the binary filesystem APIs will remain deprecated and will eventually be disallowed; as long as the PEP remains accepted they are undeprecated.)

Congrats Steve! Thanks for the massive amount of work on the implementation and the thinking that went into the design. Thanks everyone else for their feedback.

—Guido

I was honoured that Guido listened to my Unicode experience to take a decision on the PEP ;-)

Steve chose the right timing to get his PEP accepted. Thanks to the sprint which helped to quickly discussed such backward incompatible change, the PEP has been approved in just 12 days! For comparison, some of my PEPs like my PEP 446: Make newly created file descriptors non-inheritable (another backward incompatible change) took 8 months to get accepted.

PEP 528: Windows console

Just before the PEP 529, Steve Dower also wrote PEP 528: Change Windows console encoding to UTF-8. This change only impacts the Windows console, so there is a lower risk of breaking the world.

This PEP was also quickly approved by Guido during the CPython sprint. Steve implemented it in Python 3.6.

Even if it's smaller change, it is yet another change towards using UTF-8 everywhere.

Great success!

Hopefully, I was wrong about the risk of breaking the world. No user complained about these two backward incompatible changes: Python 3.6 on Windows is a success!

Python 3.6 now has a better Unicode support on Windows thanks to the PEP 528 and PEP 529!

Conclusion

September 2016: Steve Dower proposed two major backward incompatible changes for Python 3.6 on Windows: PEP 528: Change Windows console encoding to UTF-8 and PEP 529: Change Windows filesystem encoding to UTF-8.

At the first read, I was sure that the PEP 529 (filesystem encoding) will break all applications on Windows.

Thanks to the CPython core dev sprint, I was able to discuss with Steve who reassured me and explained me his PEP 529. We agreed with Guido van Rossum to try the change during Python 3.6 beta phase, but revert it if something bad happens. I was honoured that Guido listened to my Unicode experience to take a decision on the PEP.

The PEP 528: Change Windows console encoding to UTF-8 was also quickly approved, another change towards using UTF-8 everywhere.

No user complained about these two backward incompatible changes: Python 3.6 on Windows is a success!

Python 3.6 now has a better Unicode support thanks on Windows to the PEP 528 and PEP 529!

Python 3.2 Painful History of the Filesystem Encoding

2018-03-15T23:00:00+01:00

Between Python 3.0 released in 2008 and Python 3.4 released in 2014, the Python filesystem encoding changed multiple times. It took 6 years to choose the best Python filesystem encoding on each platform.

I have been officially promoted as a core developer in January 2010 by Martin von Loewis. I spent the whole year of 2010 to fix dozens of encoding issues during the development of Python 3.2, following my Unicode work started in 2008.

This article is focused on the long discussions to choose the best Python filesystem encoding on each platform in 2010 for Python 3.2.

This article is the third in a series of articles telling the history and rationale of the Python 3 Unicode model for the operating system:

Python 3.0 loves UTF-8

When Python 3.0 was released, it was unclear which encodings should be used for:

File content: open().read()
Filenames: os.listdir(), open(), etc.
Command line arguments: sys.argv and subprocess.Popen arguments
Environment variables: os.environ
etc.

Python 3.0 was forked from Python 2.6 and functions were modified to use Unicode. Many Python 3 functions only used UTF-8 because the implementation were modified to use the default encoding which is UTF-8: it was not a deliberate choice.

While UTF-8 is a good choice in most cases, it is not the best choice in all cases. Almost everything worked well in Python 3.0 when all data used UTF-8, but Python 3.0 failed badly if the locale encoding was not UTF-8.

Python 3.1, 3.2 and 3.3 will get a lot of changes to adjust encodings in all corners of the standard library.

Python 3.1 got the surrogateescape error handler (PEP 383) which reduced Unicode errors: read my previous article Python 3.1 surrogateescape error handler (PEP 383).

Add sys.setfilesystemencoding()

September 2008, bpo-3187: To fix os.listdir(str) to support undecodable filenames, Martin v. Löwis proposed a new function to change the filesystem encoding:

Here is a patch that solves the issue in a different way: it introduces sys.setfilesystemencoding. If applications invoke sys.setfilesystemencoding("iso-8859-1"), all file names can be successfully converted into a character string.

The ISO-8859-1 encoding has a very interesting property for bytes: it maps exactly the 0x00 - 0xff byte range to the U+0000 - U+00ff Unicode range, the decoder cannot fail:

$ python3.6 -q
>>> all(ord((b'%c' % byte).decode('iso-8859-1')) == byte for byte in range(256))
True
>>> all(ord(('%c' % char).encode('iso-8859-1')) == char for char in range(256))
True

Guido van Rossum commented:

I will check in Victor's changes (with some edits).

Together this means that the various suggested higher-level solutions (like returning path-like objects, or some kind of roudtripping almost-but-not-quite-utf-8 encoding) can be implemented in pure Python.

October 2008, Martin v. Löwis pushed the commit 04dc25c5:

Issue #3187: Add sys.setfilesystemencoding.

Python 3.0 will be the first major release with this function.

In retrospective, I see this function as asking developers and users to be smart and choose the encoding themself.

While the ISO-8859-1 encoding trick is tempting, we will see later that setfilesystemencoding() is broken by design and so cannot be used in practice.

What if getting the locale encoding fails?

May 2010, I reported bpo-8610, "Python3/POSIX: errors if file system encoding is None":

On POSIX (but not on Mac OS X), Python3 calls get_codeset() to get the file system encoding. If this function fails, sys.getfilesystemencoding() returns None.

I pushed the commit b744ba1d:

Issue #8610: Load file system codec at startup, and display a fatal error on failure. Set the file system encoding to utf-8 (instead of None) if getting the locale encoding failed, or if nl_langinfo(CODESET) function is missing.

This change adds the function initfsencoding(): logic to initialize the filesystem encoding.

In practice, Python already used UTF-8 when the filesystem encoding was set to None, but this change makes the default more obvious. The change also makes the error case better defined: Python exits immediately with a fatal error.

Support locale encodings different than UTF-8

My biggest Unicode project in Python 3 was to fix the encoding in all corners of the standard library. This task kept me busy between Python 3.0 and Python 3.4, at least.

May 2010, I created bpo-8611:

Python3 is unable to start (bootstrap failure) on a POSIX system if the locale encoding is different than utf8 and the Python path (standard library path where the encoding module is stored) contains a non-ASCII character. (Windows and Mac OS X are not affected by this issue because the file system encoding is hardcoded.)

For example, bpo-8242 "Improve support of PEP 383 (surrogates) in Python3" is a meta issue tracking multiple issues:

bpo-7606: test_xmlrpc fails with non-ascii path
bpo-8092: utf8, backslashreplace and surrogates
bpo-8383: pickle is unable to encode unicode surrogates
bpo-8390: tarfile: use surrogates for undecode fields
bpo-8391: os.execvpe() doesn't support surrogates in env
bpo-8393: subprocess: support undecodable current working directory on POSIX OS
bpo-8394: ctypes.dlopen() doesn't support surrogates
bpo-8412: os.system() doesn't support surrogates nor bytes
bpo-8467: subprocess: surrogates of the error message (Python implementation on non-Windows)
bpo-8468: bz2: support surrogates in filename, and bytes/bytearray filename
bpo-8477: _ssl: support surrogates in filenames, and bytes/bytearray filenames
bpo-8485: Don't accept bytearray as filenames, or simplify the API

I fixed all these issues, and reported most of them.

October 2010, finally, five months later, I succeeded to close the issue!

Starting at r85691, the full test suite of Python 3.2 pass with ASCII, ISO-8859-1 and UTF-8 locale encodings in a non-ascii directory. The work on this issue is done.

At that time, I didn't know that it will take me a few more years to really fix all encoding issues. For example, it will take me 3 years to modify the core of the import machinery to pass filenames as Unicode on Windows: bpo-3080 Full unicode import system.

Add PYTHONFSENCODING environment variable

May 2010, while discussing how to fix bpo-8610 "Python3/POSIX: errors if file system encoding is None", I asked what is the best encoding if reading the locale encoding fails. As a follow-up, Marc-Andre Lemburg created bpo-8622:

As discussed on issue8610, we need a way to override the automatic detection of the file system encoding - for much the same reasons we also do for the I/O encoding: the detection mechanism isn't fail-safe.

We should add a new environment variable with the same functionality as PYTHONIOENCODING:
PYTHONFSENCODING: Encoding[:errors] used for file system.

I implemented the idea since I liked it. August 2010, I pushed the commit 94908bbc:

Issue #8622: Add PYTHONFSENCODING environment variable to override the filesystem encoding.

initfsencoding() displays also a better error message if get_codeset() failed.

Remove sys.setfilesystemencoding()

August 2010, just after adding PYTHONFSENCODING, I opened bpo-9632 to remove the sys.setfilesystemencoding() function:

The sys.setfilesystemencoding() function is dangerous because it introduces a lot of inconsistencies: this function is unable to reencode all filenames of all objects (eg. Python is unable to find filenames in user objects or 3rd party libraries). Eg. if you change the filesystem from utf8 to ascii, it will not be possible to use existing non-ascii (unicode) filenames: they will raise UnicodeEncodeError.

As sys.setdefaultencoding() in Python2, I think that sys.setfilesystemencoding() is the root of evil :-) PYTHONFSENCODING (issue #8622) is the right solution to set the filesysteme encoding.

Marc-Andre Lemburg complained that applications embedding Python may want to set the encoding used by Python. I proposed to use the PYTHONFSENCODING environment variable as a workaround, even if it was not the best option.

One month later, I pushed the commit 5b519e02:

Issue #9632: Remove sys.setfilesystemencoding() function: use PYTHONFSENCODING environment variable to set the filesystem encoding at Python startup. sys.setfilesystemencoding() created inconsistencies because it was unable to reencode all filenames of all objects.

Reencode filenames when setting the filesystem encoding

August 2010, I created bpo-9630: "Reencode filenames when setting the filesystem encoding".

Since the beginning of 2010, I identified a design flaw in the Python initialization. Python starts by decoding strings from the default encoding UTF-8. Later, Python reads the locale encoding and loads the Python codec of this encoding. Then Python decodes string from the locale encoding. Problem: if the locale encoding is not UTF-8, encoding strings decoded from UTF-8 to the locale encoding can fail in different ways.

I wrote a patch to "reencode" filenames of all module and code objects once the filesystem encoding is set, in initfsencoding(),

When I wrote the patch, I knew that it was an ugly hack and not the proper design. I proposed to try to avoid importing any Python module before the Python codec of the locale encoding is loaded, but there was a pratical issue. Python only has builtin implementation (written in C) of the most popular encodings like ASCII and UTF-8. Some encodings like ISO-8859-15 are only implemented in Python.

I also proposed to "unload all modules, clear all caches and delete all code objects" after setting the filesystem encoding. This option would be very inefficient and make Python startup slower, whereas Python 3 startup was already way slower than Python 2 startup.

September 2010, I pushed the commit c39211f5:

Issue #9630: Redecode filenames when setting the filesystem encoding

Redecode the filenames of:

all modules: __file__ and __path__ attributes

all code objects: co_filename attribute

sys.path

sys.meta_path

sys.executable

sys.path_importer_cache (keys)

Keep weak references to all code objects until initfsencoding() is called, to be able to redecode co_filename attribute of all code objects.

The list of weak references to code objects really looks like a hack and I disliked it, but I failed to find a better way to fix Python startup.

PYTHONFSENCODING dead end

Even with my latest big and ugly "redecode filenames when setting the filesystem encoding" fix, there were issues when the filesystem encoding was different than the locale encoding. I identified 4 bugs:

bpo-9992, sys.argv: decoded from the locale encoding, but subprocess encodes process arguments to the filesystem encoding
bpo-10014, sys.path: decoded from the locale encoding, but import encodes paths to the filesystem encoding
bpo-10039, the script name: read on the command line (ex: python script.py) which is decoded from the locale encoding, whereas it is used to fill sys.path[0] and import encodes paths to the filesystem encoding.
bpo-9988, PYTHONWARNINGS environment variable: decoded from the locale encoding, but subprocess encodes environment variables to the filesystem encoding.

October 2010, I wrote an email to the python-dev list: Inconsistencies if locale and filesystem encodings are different. I proposed two solutions:

(a) use the same encoding to encode and decode values (it can be different for each issue).
(b) remove PYTHONFSENCODING variable and raise an error if locale and filesystem encodings are different (ensure that both encodings are the same).

Marc-Andre Lemburg replied:

You have to differentiate between the meaning of a file system encoding and the locale:

A file system encoding defines how the applications interact with the file system.

A locale defines how the user expects to interact with the application.

It is well possible that the two are different. Mac OS X is just one example. Another common example is having a Unix account using the C locale (=ASCII) while working on a UTF-8 file system.

This email is a good example of dilemma we had when having to choose one encoding. There is a big temptation to use multiple encodings, but at the end, data are not isolated. A filename can be found in command line arguments (python3 script.py file.txt), in environment variables (LOG_FILE=log.txt), in file content (ex: Makefile or a configuration file), etc. Using multiple encodings does not work in practice.

Remove PYTHONFSENCODING

September 2010, I reported bpo-9992: Command-line arguments are not correctly decoded if locale and fileystem encodings are different.

I proposed a patch to use the locale encoding to decode and encode command line arguments, rather than using the filesystem encoding.

Martin v. Löwis proposed to use the locale encoding for the command line arguments, environment variables and all filenames. My summary:

You mean that we should use the following encoding:

Mac OS X: UTF-8

Windows: unicode for command line/env, mbcs to decode filenames

others OSes: locale encoding

To do that, we have to:

"others OSes": delete the PYTHONFSENCODING variable

Mac OS X: use UTF-8 to decode the command line arguments (we can use PyUnicode_DecodeUTF8() + PyUnicode_AsWideCharString() before Python is initialized)

October 2010, I pushed the commit 8f6b6b0c:

Issue #9992: Remove PYTHONFSENCODING environment variable.

Two days later, I pushed an important change to use the locale encoding and remove the ugly redecode_filenames() hack, commit f3170cce:

Use locale encoding if Py_FileSystemDefaultEncoding is not set

PyUnicode_EncodeFSDefault(), PyUnicode_DecodeFSDefaultAndSize() and PyUnicode_DecodeFSDefault() use the locale encoding instead of UTF-8 if Py_FileSystemDefaultEncoding is NULL

redecode_filenames() functions and _Py_code_object_list (issue #9630) are no more needed: remove them

This change has been made possible by enhancements of PyUnicode_EncodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize(). Previously, these functions used UTF-8 before the filesystem was set. With my change, these functions now use the C implementation of the locale encoding: use mbstowcs() to decode and wcstombs() to encode. In practice, the code is more complex because Python uses the surrogateescape error handler.

Using the C implementation of the locale encoding fixed a lot of "bootstrap" issues of the Python initialization. It works because the Python codec of the locale encoding is 100% compatible with the C implementation of the locale codec.

Encodings used by Python 3.2

February 2011, Python 3.2 has been released. Summary of the used filesystem encodings:

ANSI code page on Windows;
UTF-8 on macOS;
locale encoding on other platforms.

Note: UTF-8 is used if the nl_langinfo(CODESET) function is not available.

Force ASCII encoding on FreeBSD and Solaris

November 2012, I created bpo-16455:

On FreeBSD and OpenIndiana, sys.getfilesystemencoding() returns 'ascii' when the locale is not set, whereas the locale encoding is ISO-8859-1 in practice.

This inconsistency causes different issues.

December 2012, I pushed the commit d45c7f8d:

Issue #16455: On FreeBSD and Solaris, if the locale is C, the ASCII/surrogateescape codec is now used, instead of the locale encoding, to decode the command line arguments. This change fixes inconsistencies with os.fsencode() and os.fsdecode() because these operating systems announces an ASCII locale encoding, whereas the ISO-8859-1 encoding is used in practice.

Extract of the main comment:

Workaround FreeBSD and OpenIndiana locale encoding issue with the C locale. On these operating systems, nl_langinfo(CODESET) announces an alias of the ASCII encoding, whereas mbstowcs() and wcstombs() functions use the ISO-8859-1 encoding. The problem is that os.fsencode() and os.fsdecode() use locale.getpreferredencoding() codec. For example, if command line arguments are decoded by mbstowcs() and encoded back by os.fsencode(), we get a UnicodeEncodeError instead of retrieving the original byte string.

The workaround is enabled if setlocale(LC_CTYPE, NULL) returns "C", nl_langinfo(CODESET) announces "ascii" (or an alias to ASCII), and at least one byte in range 0x80-0xff can be decoded from the locale encoding. The workaround is also enabled on error, for example if getting the locale failed.

Python 3.4 will be the first major release getting fix (March 2014), but I also backported the change to Python 3.2 and 3.3 branches.

Conclusion

It took 6 years to fix Python to use the best Python filesystem encoding.

Python 3.0 mostly uses UTF-8 everywhere, but it was not a deliberate choice and it caused many issues when the locale encoding was not UTF-8. Python 3.1 got the surrogateescape error handler (PEP 383) which reduced Unicode errors.

October 2008, Martin v. Löwis added sys.setfilesystemencoding() to Python 3.0.

August 2010, I added a new PYTHONFSENCODING environment variable, Marc-Andre Lemburg's idea.

September 2010, I removed the sys.setfilesystemencoding() function because it creates mojibake by design. I also pushed an ugly change to reencode filenames to fix many PYTHONFSENCODING bugs.

October 2010, I fixed all tests when Python lives in a non-ASCII directory: first milestone of supporting locale encodings different than UTF-8. I also removed the PYTHONFSENCODING environment variable after a long discussion. Moreover, I pushed the most important Python 3.2 change: Python now uses the locale encoding as the filesystem encoding. This change fixed many issues.

December 2012, I forced the filesystem encoding to ASCII on FreeBSD and Solaris when the announced locale encoding is wrong.

Python 3.1 surrogateescape error handler (PEP 383)

2018-03-15T18:00:00+01:00

In my previous article, I wrote that os.listdir(str) ignored silently undecodable filenames in Python 3.0 and that lying on the real content of a directory looks like a very bad idea.

Martin v. Löwis found a very smart solution to this problem: the surrogateescape error handler.

This article is the second in a series of articles telling the history and rationale of the Python 3 Unicode model for the operating system:

First attempt to propose the solution

September 2008, bpo-3187: While solutions to fix os.listdir(str) were discussed, Martin v. Löwis proposed a different approach:

I'd like to propose yet another approach: make sure that conversion according to the file system encoding always succeeds. If an unconvertable byte is detected, map it into some private-use character. To reduce the chance of conflict with other people's private-use characters, we can use some of the plane 15 private-use characters, e.g. map byte 0xPQ to U+F30PQ (in two-byte Unicode mode, this would result in a surrogate pair).

This would make all file names accessible to all text processing (including glob and friends); UI display would typically either report an encoding error, or arrange for some replacement glyph to be shown.

There are certain variations of the approach possible, in case there is objection to a specific detail.

He amended this proposal:

James Knight points out that UTF-8b can be used to give unambiguous round-tripping of characters in a UTF-8 locale. So I would like to amend my previous proposal:

for a non-UTF-8 encoding, use private-use characters for roundtripping

if the locale's charset is UTF-8, use UTF-8b as the file system encoding.

But Martin's smart idea was lost in the middle of long discussion.

PEP 383

April 2009, Martin v. Löwis proposed again his idea, now as the well defined PEP 383: Non-decodable Bytes in System Character Interfaces. He posted his PEP to python-dev for comments.

Abstract:

File names, environment variables, and command line arguments are defined as being character data in POSIX; the C APIs however allow passing arbitrary bytes - whether these conform to a certain encoding or not.

This PEP proposes a means of dealing with such irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string.

The surrogateescape encoding is based on Markus Kuhn's idea that he called UTF-8b. Undecodable bytes in range 0x80-0xff are mapped as Unicode surrogate characters: range U+DC80 - U+DCFF.

Example:

>>> b'nonascii\xff'.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff (...)
>>> b'nonascii\xff'.decode('ascii', 'surrogateescape')
'nonascii\udcff'
>>> 'nonascii\udcff'.encode('ascii', 'surrogateescape')
b'nonascii\xff'

Using the surrogateescape error handler, decoding cannot fail. For example, os.listdir(str) no longer ignores silently undecodable filenames, since all filenames became decodable with any encoding. Moreover, encoding filenames with surrogateescape returns the original bytes unchanged.

The PEP was accepted by Guido van Rossum in less than one week!

Implementation

May 2009, Martin v. Löwis opened the bpo-5915 to get a review on his implementation.

Two days later, after Benjamin Peterson and Antoine Pitrou reviews, Martin pushed the commit 011e8420:

Issue #5915: Implement PEP 383, Non-decodable Bytes in System Character Interfaces.

Five days later, Martin renamed his "utf8b" error handler to its final name surrogateescape, commit 43c57785:

Rename utf8b error handler to surrogateescape.

Python 3.1 will be the first release getting the surrogateescape error handler.

Conclusion

In Python 3.0, os.listdir(str) ignored silently undecodable filenames which was not ideal.

Martin v. Löwis proposed to apply Markus Kuhn's idea called UTF-8b in Python as a new surrogateescape error handler.

Martin's PEP was approved in less than one week and implemented a few days later.

Using the surrogateescape error handler, decoding cannot fail: os.listdir(str) no longer ignores silently undecodable filenames. Moreover, encoding filenames with surrogateescape returns the original bytes unchanged.

The surrogateescape error handler fixed a lot of old and very complex Unicode issues on Unix. It is still widely used in Python 3.6 to not annoy users with Unicode errors.

Python 3.0 listdir() Bug on Undecodable Filenames

2018-03-09T13:00:00+01:00

Ten years ago, when Python 3.0 final was released, os.listdir(str) ignored silently undecodable filenames:

$ python3.0
>>> os.mkdir(b'x')
>>> open(b'x/nonascii\xff', 'w').close()
>>> os.listdir('x')
[]

You had to use bytes to see all filenames:

>>> os.listdir(b'x')
[b'nonascii\xff']

If the locale is POSIX or C, listdir() ignored silently all non-ASCII filenames. Hopefully, os.listdir() accepts bytes, right? In fact, 4 months before the 3.0 final release, it was not the case.

Lying on the real content of a directory looks like a very bad idea. Well, there is a rationale behind this design. Let me tell you this story which is now 10 years old.

This article is the first in a series of articles telling the history and rationale of the Python 3 Unicode model for the operating system:

The os.walk() bug

bpo-3187, june 2008: Helmut Jarausch tested the first beta release of Python 3.0 and reported a bug on os.walk() when he tried to walk into his home directory:

Traceback (most recent call last):
  File "WalkBug.py", line 5, in <module>
    for Dir, SubDirs, Files in os.walk('/home/jarausch') :
  File "/usr/local/lib/python3.0/os.py", line 278, in walk
    for x in walk(path, topdown, onerror, followlinks):
  File "/usr/local/lib/python3.0/os.py", line 268, in walk
    if isdir(join(top, name)):
  File "/usr/local/lib/python3.0/posixpath.py", line 64, in join
    if b.startswith('/'):
TypeError: expected an object with the buffer interface

In Python 3.0b1, os.listdir(str) returned undecodable filenames as bytes. The caller must be prepared to get filenames as two types: str and bytes: it wasn't the case for os.walk() which failed with a TypeError.

At the first look, the bug seems trivial to fix. In fact, many solutions were proposed, it will take 4 months and 79 messages to fix the bug.

I proposed a new Filename class

August 2008, my first comment proposed to use a custom "Filename" type to store the original bytes filename, but also gives a Unicode view of the filename, in a single object, using an hypothetical myformat() function:

class Filename:
    def __init__(self, orig):
        self.as_bytes = orig
        self.as_str = myformat(orig)
    def __str__(self):
        return self.as_str
    def __bytes__(self):
        return self.as_bytes

Antoine Pitrou suggested to inherit from str:

I agree that logically it's the right solution. It's also the most invasive. If that class is made a subclass of str, however, existing code shouldn't break more than it currently does.

I preferred to inherit from bytes for pratical reasons. Antoine noted that the native type for filenames on Windows is str, and so inheriting from bytes can be an issue on Windows.

Anyway, Guido van Rossum disliked the idea (comment on InvalidFilename, a variant of the class):

I'm not interested in the InvalidFilename class; it's an API complification that might seem right for your situation but will hinder most other people.

Guido van Rossum proposed to use replace error handler

Guido van Rossum proposed to use the replace error handler to prevent decoding error. For example, b'nonascii\xff' is decoded as 'nonascii�'.

The problem is that this filename cannot be used to read the file content using open() or to remove the file using os.unlink(), since the operating system doesn't know the Unicode filename containing the "�" character.

An important property is that encoding back the Unicode filename to bytes must return the same original bytes filename.

Defer the choice to the caller: pass a callback

As no obvious choice arised, I proposed to use a callback to handle undecodable filenames. Pseudo-code:

def listdir(path, fallback_decoder=default_fallback_decoder):
    charset = sys.getfilesystemcharset()
    dir_fd = opendir(path)
    try:
        for bytesname in readdir(dir_fd):
            try:
                name = str(bytesname, charset)
            exept UnicodeDecodeError:
                name = fallback_decoder(bytesname)
            yield name
    finally:
        closedir(dir_fd)

The default behaviour is to raise an exception on decoding error:

def default_fallback_decoder(name):
   raise

Example of callback returning the raw bytes string unchanged (Python 3.0 beta1 behaviour):

def return_undecodable_unchanged(name):
   return name

Example to use a custom filename class:

class Filename:
   ...

def filename_decoder(name):
   return Filename(name)

Guido also disliked my callback idea:

The callback variant is too complex; you could write it yourself by using os.listdir() with a bytes argument.

Emit a warning on undecodable filename

As ignoring undecodable filenames in os.listdir(str) slowly became the most popular option, Benjamin Peterson proposed to emit a warning if a filename cannot be decoded, to ease debugging:

(...) I don't like the idea of silently losing the contents of a directory. That's asking for difficult to discover bugs. Could Python emit a warning in this case?

Guido van Rossum liked the idea:

This may be the best compromise yet.

Amaury Forgeot d'Arc asked:

Does the warning warn multiple times? IIRC the default behaviour is to warn once.

Benjamin Peterson replied:

Making a warning happen more than once is tricky because it requires messing with the warnings filter. This of course takes away some of the user's control which is one of the main reasons for using the Python warning system in the first place.

Because of this issue, the warning idea was abandonned.

Support bytes and fix os.listdir()

Guido repeated that the best workaround is to pass filenames as bytes, which is the native type for filenames on Unix, but most functions only accepted filenames as str.

I started to write multiple patches to support passing filenames as bytes:

posix_path_bytes.patch: enhance posixpath.join()
io_byte_filename.patch: enhance open()
fnmatch_bytes.patch: enhance fnmatch.filter()
glob1_bytes.patch: enhance glob.glob()
getcwd_bytes.patch: os.getcwd() returns bytes if unicode conversion fails
merge_os_getcwd_getcwdu.patch: Remove os.getcwdu(); os.getcwd(bytes=True) returns bytes
os_getcwdb.patch: Fix os.getcwd() by using PyUnicode_Decode() and add os.getcwdb() which returns bytes

Guido van Rossum created a review on my combined patches. Then I also combined my patches into a single python3_bytes_filename.patch file.

After one month of development, 6 versions of the combined patch, Guido commited my big change as the commit f0af3e30:

commit f0af3e30db9475ab68bcb1f1ce0b5581e214df76
Author: Guido van Rossum <guido@python.org>
Date:   Thu Oct 2 18:55:37 2008 +0000

    Issue #3187: Better support for "undecodable" filenames.  Code by Victor
    Stinner, with small tweaks by GvR.

 Lib/fnmatch.py                |  27 ++++---
 Lib/genericpath.py            |   5 +-
 Lib/glob.py                   |  17 +++--
 Lib/io.py                     |  15 ++--
 Lib/posixpath.py              | 171 +++++++++++++++++++++++++++++++-----------
 Lib/test/test_fnmatch.py      |   9 +++
 Lib/test/test_posix.py        |   2 +-
 Lib/test/test_posixpath.py    | 150 ++++++++++++++++++++++++++++++++----
 Lib/test/test_unicode_file.py |   6 +-
 Misc/NEWS                     |  10 ++-
 Modules/posixmodule.c         |  90 +++++++++-------------
 11 files changed, 358 insertions(+), 144 deletions(-)

My change:

Modify os.listdir(str) to ignore silently undecodable filenames, instead of returning them as bytes
Add os.getcwdb() function: similar to os.getcwd() but returns the current working directory as bytes.
Support bytes paths:
- fnmatch.filter()
- glob.glob1()
- glob.iglob()
- open()
- os.path.isabs()
- os.path.issep()
- os.path.join()
- os.path.split()
- os.path.splitext()
- os.path.basename()
- os.path.dirname()
- os.path.splitdrive()
- os.path.ismount()
- os.path.expanduser()
- os.path.expandvars()
- os.path.normpath()
- os.path.abspath()
- os.path.realpath()

More bytes patches

I looked if other functions accepted passing filenames as bytes and... I was disappointed. It took me some years to fix the full Python standard library. Example of issues between 2008 and 2010:

bpo-4035: Support bytes in os.exec*()
bpo-4036: Support bytes in subprocess.Popen()
bpo-8513: subprocess: support bytes program name (POSIX)
bpo-8514: Add fsencode() functions to os module
bpo-8603: Create a bytes version of os.environ and getenvb() -- Add os.environb
bpo-8412: os.system() doesn't support surrogates nor bytes
bpo-8468: bz2 module: support surrogates in filename, and bytes/bytearray filename
bpo-8477: ssl module: support surrogates in filenames, and bytes/bytearray filenames
bpo-8640: subprocess: canonicalize env to bytes on Unix (Python3)
bpo-8776: Bytes version of sys.argv (REJECTED)

Conclusion

At the first look, Helmut Jarausch's os.walk() bug looked trivial to fix.

I proposed a new Filename class storing filenames as bytes and str, but Guido van Rossum rejected the idea because this API complification would hinder most people.

Guido van Rossum proposed to use the replace error handler, but decoded filenames were not recognized by the operating system making them useless for most cases.

I proposed to use callback to handle undecodable filenames, but Guido van Rossum also rejected this idea because it was too complex and could be written using os.listdir() with a bytes argument.

Benjamin Peterson proposed to emit a warning when a filename cannot be decoded, but the idea was abandonned because of the warnings filters complexity to emit the warning multiple times.

I wrote a big change modifying os.listdir() to ignore silently undecodable filenames, but also modify a lot of functions to also accept filenames as bytes. I made further changes the following years to fix the full Python standard library to accept bytes.

While it "only" took 4 months to fix the os.listdir(str) issue, this kind of bugs will keep me busy the next 10 years (2008-2018)...

This article is the first in a series of articles telling the history and rationale of the Python 3 Unicode model for the operating system.

How I fixed a very old GIL race condition in Python 3.7

2018-03-08T10:00:00+01:00

It took me 4 years to fix a nasty bug in the famous Python GIL (Global Interpreter Lock), one of the most critical part of Python. I had to dig the Git history to find a change made 26 years ago by Guido van Rossum: at this time, threads were something esoteric. Let me tell you my story.

Fatal Python error caused by a C thread and the GIL

In March 2014, Steve Dower reported the bug bpo-20891 when a "C thread" uses the Python C API:

In Python 3.4rc3, calling PyGILState_Ensure() from a thread that was not created by Python and without any calls to PyEval_InitThreads() will cause a fatal exit:

Fatal Python error: take_gil: NULL tstate

My first comment:

IMO it's a bug in PyEval_InitThreads().

PyGILState_Ensure() fix

I forgot the bug during 2 years. In March 2016, I modified Steve's test program to make it compatible with Linux (the test was written for Windows). I succeeded to reproduce the bug on my computer and I wrote a fix for PyGILState_Ensure().

One year later, november 2017, Marcin Kasperski asked:

Is this fix released? I can't find it in the changelog…

Oops, again, I completely forgot this issue! This time, not only I applied my PyGILState_Ensure() fix, but I also wrote the unit test test_embed.test_bpo20891():

Ok, the bug is now fixed in Python 2.7, 3.6 and master (future 3.7). On 3.6 and master, the fix comes with an unit test.

My fix for the master branch, commit b4d1e1f7:

bpo-20891: Fix PyGILState_Ensure() (#4650)

When PyGILState_Ensure() is called in a non-Python thread before
PyEval_InitThreads(), only call PyEval_InitThreads() after calling
PyThreadState_New() to fix a crash.

Add an unit test in test_embed.

And I closed the issue bpo-20891...

Random crash of the test on macOS

Everything was fine... but one week later, I noticed random crashes on macOS buildbots on my newly added unit test. I succeeded to reproduce the bug manually, example of crash at the 3rd run:

macbook:master haypo$ while true; do ./Programs/_testembed bpo20891 ||break; date; done
Lun  4 déc 2017 12:46:34 CET
Lun  4 déc 2017 12:46:34 CET
Lun  4 déc 2017 12:46:34 CET
Fatal Python error: PyEval_SaveThread: NULL tstate

Current thread 0x00007fffa5dff3c0 (most recent call first):
Abort trap: 6

test_embed.test_bpo20891() on macOS showed a race condition in PyGILState_Ensure(): the creation of the GIL lock itself... was not protected by a lock! Adding a new lock to check if Python currently has the GIL lock doesn't make sense...

I proposed an incomplete fix for PyThread_start_new_thread():

I found a working fix: call PyEval_InitThreads() in PyThread_start_new_thread(). So the GIL is created as soon as a second thread is spawned. The GIL cannot be created anymore while two threads are running. At least, with the python binary. It doesn't fix the issue if a thread is not spawned by Python, but this thread calls PyGILState_Ensure().

Why not always create the GIL?

Antoine Pitrou asked a simple question:

Why not always call PyEval_InitThreads() at interpreter initialization? Are there any downsides?

Thanks to git blame and git log, I found the origin of the code creating the GIL "on demand", a change made 26 years ago!

commit 1984f1e1c6306d4e8073c28d2395638f80ea509b
Author: Guido van Rossum <guido@python.org>
Date:   Tue Aug 4 12:41:02 1992 +0000

    * Makefile adapted to changes below.
    * split pythonmain.c in two: most stuff goes to pythonrun.c, in the library.
    * new optional built-in threadmodule.c, build upon Sjoerd's thread.{c,h}.
    * new module from Sjoerd: mmmodule.c (dynamically loaded).
    * new module from Sjoerd: sv (svgen.py, svmodule.c.proto).
    * new files thread.{c,h} (from Sjoerd).
    * new xxmodule.c (example only).
    * myselect.h: bzero -> memset
    * select.c: bzero -> memset; removed global variable

(...)

+void
+init_save_thread()
+{
+#ifdef USE_THREAD
+       if (interpreter_lock)
+               fatal("2nd call to init_save_thread");
+       interpreter_lock = allocate_lock();
+       acquire_lock(interpreter_lock, 1);
+#endif
+}
+#endif

My guess was that the intent of dynamically created GIL is to reduce the "overhead" of the GIL for applications only using a single Python thread (never spawn a new Python thread).

Luckily, Guido van Rossum was around and was able to elaborate the rationale:

Yeah, the original reasoning was that threads were something esoteric and not used by most code, and at the time we definitely felt that always using the GIL would cause a (tiny) slowdown and increase the risk of crashes due to bugs in the GIL code. I'd be happy to learn that we no longer need to worry about this and can just always initialize it.

Second fix for Py_Initialize() proposed

I proposed a second fix for Py_Initialize() to always create the GIL as soon as Python starts, and no longer "on demand", to prevent any risk of a race condition:

+    /* Create the GIL */
+    PyEval_InitThreads();

Nick Coghlan asked if I could you run my patch through the performance benchmarks. I ran pyperformance on my PR 4700. Differences of at least 5%:

haypo@speed-python$ python3 -m perf compare_to \
    2017-12-18_12-29-master-bd6ec4d79e85.json.gz \
    2017-12-18_12-29-master-bd6ec4d79e85-patch-4700.json.gz \
    --table --min-speed=5

+----------------------+--------------------------------------+-------------------------------------------------+
| Benchmark            | 2017-12-18_12-29-master-bd6ec4d79e85 | 2017-12-18_12-29-master-bd6ec4d79e85-patch-4700 |
+======================+======================================+=================================================+
| pathlib              | 41.8 ms                              | 44.3 ms: 1.06x slower (+6%)                     |
+----------------------+--------------------------------------+-------------------------------------------------+
| scimark_monte_carlo  | 197 ms                               | 210 ms: 1.07x slower (+7%)                      |
+----------------------+--------------------------------------+-------------------------------------------------+
| spectral_norm        | 243 ms                               | 269 ms: 1.11x slower (+11%)                     |
+----------------------+--------------------------------------+-------------------------------------------------+
| sqlite_synth         | 7.30 us                              | 8.13 us: 1.11x slower (+11%)                    |
+----------------------+--------------------------------------+-------------------------------------------------+
| unpickle_pure_python | 707 us                               | 796 us: 1.13x slower (+13%)                     |
+----------------------+--------------------------------------+-------------------------------------------------+

Not significant (55): 2to3; chameleon; chaos; (...)

Oh, 5 benchmarks were slower. Performance regressions are not welcome in Python: we are working hard on making Python faster!

Skip the failing test before Christmas

I didn't expect that 5 benchmarks would be slower. It required further investigation, but I didn't have time for that and I was too shy or ashame to take the responsibility of pushing a performance regression.

Before the christmas holiday, no decision was taken whereas test_embed.test_bpo20891() was still failing randomly on macOS buildbots. I was not confortable to touch a critical part of Python, its GIL, just before leaving for two weeks. So I decided to skip test_bpo20891() until I'm back.

No gift for you, Python 3.7.

New benchmark run and second fix applied to master

At the end of january 2018, I ran again the 5 benchmarks made slower by my PR. I ran these benchmarks manually on my laptop using CPU isolation:

vstinner@apu$ python3 -m perf compare_to ref.json patch.json --table
Not significant (5): unpickle_pure_python; sqlite_synth; spectral_norm; pathlib; scimark_monte_carlo

Ok, it confirms that my second fix has no significant impact on performances according to the Python "performance" benchmark suite.

I decided to push my fix to the master branch, commit 2914bb32:

bpo-20891: Py_Initialize() now creates the GIL (#4700)

The GIL is no longer created "on demand" to fix a race condition when
PyGILState_Ensure() is called in a non-Python thread.

Then I reenabled test_embed.test_bpo20891() on the master branch.

No second fix for Python 2.7 and 3.6, sorry!

Antoine Pitrou considered that backport for Python 3.6 should not be merged:

I don't think so. People can already call PyEval_InitThreads().

Guido van Rossum didn't want to backport this change neither. So I only removed test_embed.test_bpo20891() from the 3.6 branch.

I didn't apply my second fix to Python 2.7 neither for the same reason. Moreover, Python 2.7 has no unit test, since it was too difficult to backport it.

At least, Python 2.7 and 3.6 got my first PyGILState_Ensure() fix.

Conclusion

Python still has some race conditions in corner cases. Such bug was found in the creation of the GIL when a C thread starts using the Python API. I pushed a first fix, but a new and different race condition was found on macOS.

I had to dig into the very old history (1992) of the Python GIL. Luckily, Guido van Rossum was also able to elaborate the rationale.

After a glitch in benchmarks, we agreed to modify Python 3.7 to always create the GIL, instead of creating the GIL "on demand". The change has no significant impact on performances.

It was also decided to leave Python 2.7 and 3.6 unchanged, to prevent any risk of regression: continue to create the GIL "on demand".

It took me 4 years to fix a nasty bug in the famous Python GIL. I am never confortable when touching such critical part of Python. I am now happy that the bug is behind us: it's now fully fixed in the future Python 3.7!

See bpo-20891 for the full story. Thanks to all developers who helped me to fix this bug!

Python 3.7 nanoseconds

2018-03-06T16:30:00+01:00

Thanks to my latest change on time.perf_counter(), all Python 3.7 clocks now use nanoseconds as integer internally. It became possible to propose again my old idea of getting time as nanoseconds at Python level and so I wrote a new PEP 564 "Add new time functions with nanosecond resolution". While the PEP was discussed, I also deprecated time.clock() and removed os.stat_float_times().

time.clock()

Since I wrote the PEP 418 "Add monotonic time, performance counter, and process time functions" in 2012, I dislike time.clock(). This clock is not portable: on Windows it mesures wall-clock, whereas it measures CPU time on Unix. Extract of time.clock() documentation:

Deprecated since version 3.3: The behaviour of this function depends on the platform: use perf_counter() or process_time() instead, depending on your requirements, to have a well defined behaviour.

My PEP 418 deprecated time.clock() in the documentation. In bpo-31803, I modified time.clock() and time.get_clock_info('clock') to also emit a DeprecationWarning warning. I replaced time.clock() with time.perf_counter() in tests and demos. I also removed hasattr(time, 'monotonic') in test_time since time.monotonic() is always available since Python 3.5.

os.stat_float_times()

The os.stat_float_times() function was introduced in Python 2.3 to get file modification times with sub-second resolution (commit f607bdaa), the default was still to get time as seconds (integer). The function was introduced to get a smooth transition to time as floating point number, to keep the backward compatibility with Python 2.2.

os.stat() was modified to return time as float by default in Python 2.5 (commit fe33d0ba). Python 2.5 was released 11 years ago, I consider that people had enough time to migrate their code to float time :-) I modified os.stat_float_times() in Python 3.1 to emit a DeprecationWarning warning (commit 034d0aa2 of bpo-14711).

Finally, I removed os.stat_float_times() in Python 3.7: bpo-31827.

Serhiy Storchaka proposed to also remove last three items from os.stat_result. For example, stat_result[stat.ST_MTIME] could be replaced with stat_result.st_time. But I tried to remove these items and it broke the logging module, so I decided to leave it unchanged.

PEP 564: time.time_ns()

Six years ago (2012), I wrote the PEP 410 "Use decimal.Decimal type for timestamps" which proposes a large and complex change in all Python functions returning time to support nanosecond resolution using the decimal.Decimal type. The PEP was rejected for different reasons.

Since all clock now use nanoseconds internally in Python 3.7, I proposed a new PEP 564 "Add new time functions with nanosecond resolution". Abstract:

Add six new "nanosecond" variants of existing functions to the time module: clock_gettime_ns(), clock_settime_ns(), monotonic_ns(), perf_counter_ns(), process_time_ns() and time_ns(). While similar to the existing functions without the _ns suffix, they provide nanosecond resolution: they return a number of nanoseconds as a Python int.

The time.time_ns() resolution is 3 times better than the time.time() resolution on Linux and Windows.

People were now convinced by the need for nanosecond resolution, so I added an "Issues caused by precision loss" section with 2 examples:

Example 1: measure time delta in long-running process
Example 2: compare times with different resolution

As for my previous PEP 410, many people proposed many alternatives recorded in the PEP: sub-nanosecond resolution, modifying time.time() result type, different types, different API, a new module, etc.

Hopefully for me, Guido van Rossum quickly approved my PEP for Python 3.7!

Implementaton of the PEP 564

I implemented my PEP 564 in bpo-31784 with the commit c29b585f. I added 6 new time functions:

time.clock_gettime_ns()
time.clock_settime_ns()
time.monotonic_ns()
time.perf_counter_ns()
time.process_time_ns()
time.time_ns()

Example:

$ python3.7
Python 3.7.0b2+ (heads/3.7:31e2b76f7b, Mar  6 2018, 15:31:29)
[GCC 7.2.1 20170915 (Red Hat 7.2.1-2)] on linux
>>> import time
>>> time.time()
1520354387.7663522
>>> time.time_ns()
1520354388319257562

I also added tests on os.times() in test_os, previously the function wasn't tested at all!

Conclusion

I added 6 new functions to get time with a nanosecond resolution like time.time_ns() with my approved PEP 564. I also modified time.clock() to emit a DeprecationWarning and I removed the legacy os.stat_float_times() function.

Python 3.7 perf_counter() nanoseconds

2018-03-06T15:00:00+01:00

Since 2012, I have been trying to convert all Python clocks to use internally nanoseconds. The last clock which still used floating point internally was time.perf_counter(). INADA Naoki's new importtime tool was an opportunity for me to have a new look on a tricky integer overflow issue.

Modify importtime to use time.perf_counter() clock

INADA Naoki added to Python 3.7 a new cool -X importtime command line option to analyze the Python import performance. This tool can be used optimize the startup time of your application. Example:

vstinner@apu$ ./python -X importtime -c pass
import time: self [us] | cumulative | imported package
(...)
import time:       901 |       1902 | io
import time:       374 |        374 |       _stat
import time:       663 |       1037 |     stat
import time:       617 |        617 |       genericpath
import time:       877 |       1493 |     posixpath
import time:      3840 |       3840 |     _collections_abc
import time:      2106 |       8474 |   os
import time:       674 |        674 |   _sitebuiltins
import time:       922 |        922 |   sitecustomize
import time:       598 |        598 |   usercustomize
import time:      1444 |      12110 | site

Read Naoki's article How to speed up Python application startup time (Jan 19, 2018) for a concrete analysis of pipenv performance.

Naoki chose to use the time.monotonic() clock internally to measure elapsed time. On Windows, this clock (GetTickCount64() function) has a resolution around 15.6 ms, whereas most Python imports take less than 10 ms, and so most numbers are just zeros. Example:

f:\dev\3x>python -X importtime -c "import idlelib.pyshell"
Running Debug|Win32 interpreter...
import time: self [us] | cumulative | imported package
import time:         0 |          0 |     _codecs
import time:         0 |          0 |   codecs
import time:         0 |          0 |   encodings.aliases
import time:     15000 |      15000 | encodings
import time:         0 |          0 | encodings.utf_8
import time:         0 |          0 | _signal
import time:         0 |          0 | encodings.latin_1
import time:         0 |          0 |     _weakrefset
import time:         0 |          0 |   abc
import time:         0 |          0 | io
import time:         0 |          0 |       _stat
(...)

In bpo-31415, I fixed the issue by adding a new C function _PyTime_GetPerfCounter() to access the time.perf_counter() clock at the C level and I modified "importtime" to use it.

Problem solved! ... almost...

Double integer-float conversions

My commit a997c7b4 of bpo-31415 adding _PyTime_GetPerfCounter() moved the C code from Modules/timemodule.c to Python/pytime.c, but also changed the internal type storing time from floatting point number (C double) to integer number (_PyTyime_t which is int64_t in practice).

The drawback of this change is that time.perf_counter() now converts QueryPerformanceCounter() / QueryPerformanceFrequency() float into a _PyTime_t (integer) and then back to a float, and these conversions cause a precision loss. I computed that the conversions start to loose precision starting after a single second with QueryPerformanceFrequency() equals to 3,579,545 (3.6 MHz).

To fix the precision loss, I modified again time.clock() and time.perf_counter() to not use _PyTime_t anymore, only double.

Grumpy Victor

My change to replace _PyTime_t with double made me grumpy. I have been trying to convert all Python clocks to _PyTime_t since 6 years (2012).

Being blocked by a single clock made me grumpy, especially because the issue is specific to the Windows implementation. The Linux implementation of time.perf_counter() uses clock_gettime() which directly returns nanoseconds as integers, no division needed to get time as _PyTime_t.

I looked at the clock sources in the Linux kernel source code: kernel/time/clocksource.c. Linux clocks only use integers and support nanosecond resolution. I'm always impressed by the quality of the Linux kernel source code, the code is straightforward C code. If Linux is able to use integers for various kinds of clocks, I should be able to use integers for my specific Windows implementations of time.perf_counter(), no?

In practice, the _PyTime_t is a number of nanoseconds, so the computation is:

(QueryPerformanceCounter() * 1_000_000_000) / QueryPerformanceFrequency()

where 1_000_000_000 is the number of nanoseconds in one second. The problem is preventing integer overflow on the first part, using _PyTime_t which is int64_t in practice:

QueryPerformanceCounter() * 1_000_000_000

Some maths to avoid the precision loss

Using a pencil, a sheet of paper and some maths, I found a solution!

(a * b) / q == (a / q) * b + ((a % q) * b) / q

This prevents the risk of integer overflow. C implementation:

Py_LOCAL_INLINE(_PyTime_t)
_PyTime_MulDiv(_PyTime_t ticks, _PyTime_t mul, _PyTime_t div)
{
    _PyTime_t intpart, remaining;
    /* Compute (ticks * mul / div) in two parts to prevent integer overflow:
       compute integer part, and then the remaining part.

       (ticks * mul) / div == (ticks / div) * mul + (ticks % div) * mul / div

       The caller must ensure that "(div - 1) * mul" cannot overflow. */
    intpart = ticks / div;
    ticks %= div;
    remaining = ticks * mul;
    remaining /= div;
    return intpart * mul + remaining;
}

Simplified Windows implementation of perf_counter():

_PyTime_t win_perf_counter(void)
{
    LARGE_INTEGER freq;
    LONGLONG frequency;
    LARGE_INTEGER now;
    LONGLONG ticksll;
    _PyTime_t ticks;

    (void)QueryPerformanceFrequency(&freq);
    frequency = freq.QuadPart;

    QueryPerformanceCounter(&now);
    ticksll = now.QuadPart;
    ticks = (_PyTime_t)ticksll;

    return _PyTime_MulDiv(ticks, SEC_TO_NS, (_PyTime_t)frequency);
}

On Windows, I added the following sanity checks to make sure that integer overflows cannot occur:

/* Check that frequency can be casted to _PyTime_t.

   Make also sure that (ticks * SEC_TO_NS) cannot overflow in
   _PyTime_MulDiv(), with ticks < frequency.

   Known QueryPerformanceFrequency() values:

   * 10,000,000 (10 MHz): 100 ns resolution
   * 3,579,545 Hz (3.6 MHz): 279 ns resolution

   None of these frequencies can overflow with 64-bit _PyTime_t, but
   check for overflow, just in case. */
if (frequency > _PyTime_MAX
    || frequency > (LONGLONG)_PyTime_MAX / (LONGLONG)SEC_TO_NS) {
    PyErr_SetString(PyExc_OverflowError,
                    "QueryPerformanceFrequency is too large");
    return -1;
}

Since I also modified the macOS implementation of time.monotonic() to use _PyTime_MulDiv(), I also added this check for macOS:

/* Make sure that (ticks * timebase.numer) cannot overflow in
   _PyTime_MulDiv(), with ticks < timebase.denom.

   Known time bases:

   * always (1, 1) on Intel
   * (1000000000, 33333335) or (1000000000, 25000000) on PowerPC

   None of these time bases can overflow with 64-bit _PyTime_t, but
   check for overflow, just in case. */
if ((_PyTime_t)timebase.numer > _PyTime_MAX / (_PyTime_t)timebase.denom) {
    PyErr_SetString(PyExc_OverflowError,
                    "mach_timebase_info is too large");
    return -1;
}

pytime.c source code

If you are curious, the full code lives at Python/pytime.c and is currently around 1,100 lines of C code.

Conclusion

INADA Naoki's importtime tool was using time.monotonic() clock which failed to measure short import times on Windows. I modified it to use time.perf_counter() internally to get better precision on Windows. I identified a precision loss caused by my internal _PyTime_t type to store time as nanoseconds. Thanks to maths, I succeeded to use nanoseconds and prevent any risk of integer overflow.

My contributions to CPython during 2017 Q3: Part 3 (funny bugs)

2017-10-19T16:00:00+02:00

My contributions to CPython during 2017 Q3 (july, august, september), Part 3 (funny bugs).

Previous report: My contributions to CPython during 2017 Q3: Part 2 (dangling threads).

Summary:

FreeBSD bug: minor() device regression
regrtest snowball effect when hunting memory leaks
Bugfixes
Other Changes

FreeBSD bug: minor() device regression

bpo-31044: The test_makedev() of test_posix started to fail in the build 632 (Wed Jul 26 10:47:01 2017) of AMD64 FreeBSD CURRENT. The test failed on Debug, but also Non-Debug buildbots, in master and 3.6 branches. It looks more like a change on the buildbot, maybe a FreeBSD upgrade?

Thanks to koobs, I have a SSH access to the buildbot. I was able to reproduce the bug manually. I noticed that minor() truncates most significant bits.

I continued my analysis and I found that, at May 23, the FreeBSD dev_t type changed from 32 bits to 64 bits in the kernel, but the minor() userland function was not updated.

I reported a bug to FreeBSD: Bug 221048 - minor() truncates device number to 32 bits, whereas dev_t type was extended to 64 bits.

In the meanwhile, I skipped test_posix.test_makedev() on FreeBSD if dev_t is larger than 32-bit.

Hopefully, the FreeBSD bug was quickly fixed!

regrtest snowball effect when hunting memory leaks

While trying to fix all reference leaks on the new Windows and Linux "Refleaks" buildbots, I reported the bug bpo-31217:

test_code leaked [1, 1, 1] memory blocks, sum=3

Two weeks after reporting the bug, I was able to reproduce the bug, but only with Python compiled in 32-bit mode. Strange.

I spent one day to understand the bug. I removed as much as possible while making sure that I can still reproduce the bug. At the end, I wrote leak2.py which reproduces the bug with a single import: import sys. Even if the script is only 86 lines long, I was still unable to understand the bug.

My first hypothesis:

It seems like the "leak" is the call to sys.getallocatedblocks() which creates a new integer, and the integer is kept alive between two loop iterations.

Antoine Pitrou rejected it:

I doubt it. If that was the case, the reference count would increase as well.

It was Antoine Pitrou who understood the bug:

Ahah.
Actually, it's quite simple :-) On 64-bit Python:

>>> id(82914 - 82913) == id(1)
True

On 32-bit Python:

>>> id(82914 - 82913) == id(1)
False

So the first non-zero alloc_delta really has a snowball effect, as it
creates new memory block which will produce a non-zero alloc_delta on the
next run, etc.

I implemented Antoine's idea to fix the bug, commit:

Use a pool of integer objects to prevent false alarm when checking for
memory block leaks. Fill the pool with values in -1000..1000 which
are the most common (reference, memory block, file descriptor)
differences.

Co-Authored-By: Antoine Pitrou <pitrou@free.fr>

The bug is probably as old as the code hunting memory leaks.

Bugfixes

bpo-30891: Second fix for importlib _find_and_load() to handle correctly parallelism with threads. Call sys.modules.get() in the with _ModuleLockManager(name): block to protect the dictionary key with the module lock and use an atomic get to prevent race conditions.
bpo-31019: multiprocessing.Process.is_alive() now removes the process from the _children set if the process completed. The change prevents leaking "dangling" processes.
bpo-31326, concurrent.futures: ProcessPoolExecutor.shutdown() now explicitly closes the call queue. Moreover, shutdown(wait=True) now also joins the call queue thread, to prevent leaking a dangling thread.
bpo-31170: Update libexpat from 2.2.3 to 2.2.4: fix copying of partial characters for UTF-8 input (libexpat bug 115). Later, I also wrote non-regression tests for this bug (libexpat doesn't have any test for this bug).
bpo-31499, xml.etree: xmlparser_gc_clear() now sets self.parser to NULL to prevent a crash in xmlparser_dealloc() if xmlparser_gc_clear() was called previously by the garbage collector, because the parser was part of a reference cycle. Fix co-written with Serhiy Storchaka.
bpo-30892: Fix _elementtree module initialization (accelerator of xml.etree), handle correctly getattr(copy, 'deepcopy') failure to not fail with an assertion error.

Other Changes

bpo-30866: Add _testcapi.stack_pointer(). I used it to write the "Stack consumption" section of a previous report: My contributions to CPython during 2017 Q1
_ssl_: Fix compiler warning. Cast Py_buffer.len (Py_ssize_t, signed) to size_t (unsigned) to prevent the "comparison between signed and unsigned integer expressions" warning.
bpo-30486: Make cell_set_contents() symbol private. Don't export the cell_set_contents() symbol in the C API.

My contributions to CPython during 2017 Q3: Part 2 (dangling threads)

2017-10-19T15:00:00+02:00

My contributions to CPython during 2017 Q3 (july, august, september), Part 2: "Dangling threads".

Previous report: My contributions to CPython during 2017 Q3: Part 1.

Next reports:

My contributions to CPython during 2017 Q3: Part 3 (funny bugs).

Summary:

Bugfixes: Reference cycles
socketserver leaking threads and processes
- test_logging random bug
- Skip failing tests
- Fix socketserver for processes
- Fix socketserver for threads
- Issue not done yet
Environment altered and dangling threads
- Environment changed
- test.support and regrtest enhancements
- multiprocessing bug fixes
- concurrent.futures bug fixes
- test_threading and test_thread
- Other fixes

Bugfixes: Reference cycles

While fixing "dangling threads" (see below), I found and fixed 4 reference cycles which caused memory leaks and objects to live longer than expected. I was surprised that the bug in the common socket.create_connection() function was not noticed before! So my work on dangling threads was useful!

The typical pattern of such reference cycle is:

def func():
    err = None
    try:
        do_something()
    except Exception as exc:
        err = exc
    if err is not None:
        handle_error(exc)
    # the exception is stored in the 'err' variable

func()
# surprise, surprise, the exception is still alive at this point!

Or the variant:

def func():
    try:
        do_something()
    except Exception as exc:
        exc_info = sys.exc_info()
        handle_error(exc_info)
    # the exception is stored in the 'exc_info' variable

func()
# surprise, surprise, the exception is still alive at this point!

It's not easy to spot the bug, the bug is subtle. An exception object in Python 3 has a __traceback__ attribute which contains frames. If a frame stores the exception in a variable, like err in the first example, or exc_info in the second example, a cycle exists between the exception and frames. In this case, the exception, the traceback, the frames, and all variables of all frames are kept alive by the reference cycle, until the cycle is break by the garbage collector.

The problem is that the garbage collector is only called infrequently, so the cycle may stay alive for a long time.

Sometimes, the reference cycle is even more subtle than the simple examples above.

Fixed reference cycles:

bpo-31234, socket.create_connection(): Fix reference cycle.
bpo-31247: xmlrpc.server now explicitly breaks reference cycles when using sys.exc_info() in code handling exceptions.
bpo-31249, concurrent.futures: WorkItem.run() used by ThreadPoolExecutor now explicitly breaks a reference cycle between an exception object and the WorkItem object. ThreadPoolExecutor.shutdown() now also clears its threads set.
bpo-31238: pydoc: ServerThread.stop() now joins itself to wait until DocServer.serve_until_quit() completes and then explicitly sets its docserver attribute to None to break a reference cycle. This change was made to fix test_doc.
bpo-31323: Fix reference leak in test_ssl. Store exceptions as string rather than object to prevent reference cycles which cause leaking dangling threads.

I also started a discussion on reference cycles caused by exceptions: [Python-Dev] Evil reference cycles caused Exception.__traceback__. Sadly, no action was taken, no obvious solution was found.

I found the socket.create_connection() reference cycle because of an unrelated change in test.support:

bpo-29639: change test.support.HOST to "localhost"

Read my message on bpo-29639 for the full story. Extract:

Modifying support.HOST to "localhost" triggered a reference cycle!?

socketserver leaking threads and processes

test_logging random bug

This story starts at July, 3, with test_logging failing randomly on FreeBSD, bpo-30830:

test_output (test.test_logging.HTTPHandlerTest) ... ok
Warning -- threading_cleanup() failed to cleanup -1 threads after 3 sec (count: 0, dangling: 1)

I failed to reproduce the bug on my FreeBSD VM, nor on Linux. The bug only occurred on one specific FreeBSD buildbot. I even got access to the buildbot... and I still failed to reproduce the bug! I tried to run test_logging multiple times in parallel, increase the system load, etc. I felt disappointed. I used my system_load.py script which spawns Python processes running while 1: pass to stress the CPU.

After one month, I succeeded to reproduce the bug by running two commands in parallel.

Command 1 to trigger the bug:

./python -m test -v test_logging \
    --fail-env-changed \
    --forever \
    -m test.test_logging.DatagramHandlerTest.test_output \
    -m test.test_logging.ConfigDictTest.test_listen_config_10_ok \
    -m test.test_logging.SocketHandlerTest.test_output

Command 2 to stress the system:

./python -m test -j4

It seems like the Python test suite is a very good tool to stress a system to trigger a race condition!

Finally, I was able to identify the bug:

The problem is that socketserver.ThreadingMixIn spawns threads without waiting for their completion in server_close().

Skip failing tests

To stabilize the buildbots and to be able to work on other bugs, I decided to first skip all tests using socketserver.ThreadingMixIn until this class was fixed to prevent "dangling threads".

Fix socketserver for processes

While trying to see how to fix socketserver.ThreadingMixIn, I understood that bpo-31151 was a similar bug in the socketserver module but for processes:

test_ForkingUDPServer (test.test_socketserver.SocketServerTest) ... creating server
(...)
Warning -- reap_children() reaped child process 18281

My analysis:

The problem is that socketserver.ForkinMixin doesn't wait until all children completes. It only calls os.waitpid() in non-blocking module (using os.WNOHANG) after each loop iteration. If a child process completes after the last call to ForkingMixIn.collect_children(), the server leaks zombie processes.

I fixed socketserver.ForkingMixIn by modifying the server_close() method to block until all child processes complete: commit.

Just after pushing my fix, I understood that my fix changed the ForkingMixIn behaviour. I wrote an email to ask if it's the good behaviour or if a change was needed: [Python-Dev] socketserver ForkingMixin waiting for child processes. The answer is that not everybody wants this behaviour. Sadly, I didn't have time yet to let the user chooses the behaviour.

Fix socketserver for threads

Fixing socketserver.ForkinMixin was simple because the code already tracked the (identifier of) child processes and already had code to wait for child completion.

Fixing socketserver.ThreadingMixIn (bpo-31233) was more complicated since it didn't keep track of spawned threads.

I chose to keep a list of threading.Thread objects, but only for non-daemonic threads. socketserver.ThreadingMixIn.server_close() now joins all threads: commit.

Issue not done yet

As I wrote above, the socketserver still needs to be reworked to let the user decides if the server must gracefully wait for child completion or not. Maybe expose also a method to explicitly wait for children, maybe with a timeout?

Environment altered and dangling threads

This part kept me busy for the whole quarter. While trying to fix "all bugs", I looked at two specific "environment changes": "dangling threads" and "zombie processes". A dangling thread comes from a test spawning a thread but doesn't proper "clean" the thread.

Leaking threads or processes is a very bad side effect since it is likely to cause random bugs in following tests.

At the beginning, I expected that only 2 or 3 bugs should be fixed. At the end, it was closer to 100 bugs. I don't regret, I'm now sure that I made the Python test suite more reliable, and this work allowed me to catch and fix old reference cycles bugs (see above).

Environment changed

To detect bugs, I modified Travis CI jobs, AppVeyor and buildbots to run tests with --fail-env-changed. With this option, if a test alters the environment, the full test suite is marked as failed with "ENV_CHANGED".

I also fixed python3 -m test --fail-env-changed --forever in bpo-30764: --forever now stops if a test alters the environment.

test.support and regrtest enhancements

bpo-30845: reap_children() now logs warnings.
support.reap_children() now sets environment_altered to True if a test leaked a zombie process, to detect bugs using python3 -m test --fail-env-changed.
regrtest: count also "env changed" tests as failed tests in the test progress.
bpo-31234: support.threading_cleanup() now emits a warning immediately if there are threads running in the background, to be able to catch bugs more easily. Previously, the warning was only emitted if the function failed to cleanup these threads after 1 second.
bpo-31234: Add test.support.wait_threads_exit(). Use _thread.count() to wait until threads exit. The new context manager prevents the "dangling thread" warning. Add also support.join_thread() helper: joins a thread but raises an AssertionError if the thread is still alive after timeout seconds.

multiprocessing bug fixes

The multiprocessing module is very complex. multiprocessing tests are failing randomly for years, but nobody seems able to fix them. I can only hope that my following fixes will help to make these tests more reliable.

multiprocessing.Queue.join_thread() now waits until the thread completes, even if the thread was started by the same process which created the queue.
bpo-26762: Avoid daemon processes in _test_multiprocessing. test_level() of _test_multiprocessing._TestLogging now uses regular processes rather than daemon processes to prevent zombi processes (to not "leak" processes).
bpo-26762: Fix more dangling processes and threads in test_multiprocessing. Queue: call close() followed by join_thread(). Process: call join() or self.addCleanup(p.join).
bpo-26762: test_multiprocessing now detects dangling processes and threads per test case classes.
bpo-26762: test_multiprocessing close more queues. Close explicitly queues to make sure that we don't leave dangling threads. test_queue_in_process(): remove unused queue. test_access() joins also the process to fix a random warning.
bpo-26762: _test_multiprocessing now marks the test as ENV_CHANGED on dangling process or thread.
bpo-31069, Fix a warning about dangling processes in test_rapid_restart() of _test_multiprocessing: join the process.
bpo-31234, test_multiprocessing: Give 30 seconds to join_process(), instead of 5 or 10 seconds, to wait until the process completes.

concurrent.futures bug fixes

bpo-30845: Enhance test_concurrent_futures cleanup. Make sure that tests don't leak threads nor processes. Clear explicitly the reference to the executor to make sure that it's destroyed.
bpo-31249: test_concurrent_futures checks dangling threads. Add a BaseTestCase class to test_concurrent_futures to check for dangling threads and processes on all tests, not only tests using ExecutorMixin.
bpo-31249: Fix test_concurrent_futures dangling thread. ProcessPoolShutdownTest.test_del_shutdown() now closes the call queue and joins its thread, to prevent leaking a dangling thread.

test_threading and test_thread

bpo-31234: test_threaded_import: fix test_side_effect_import(). Don't leak the module into sys.modules. Avoid also dangling threads.
bpo-31234: test_thread.test_forkinthread() now waits until the thread completes.
bpo-31234: Try to fix the threading_cleanup() warning in test.lock_tests: wait a little bit longer to give time to the threads to complete. Warning seen on test_thread and test_importlib.
bpo-31234: Join threads in test_threading. Call thread.join() to prevent the "dangling thread" warning.
bpo-31234: Join timers in test_threading. Call the .join() method of threading.Timer timers to prevent the threading_cleanup() warning.

Other fixes

test_urllib2_localnet: clear server variable. Set the server attribute to None in cleanup to avoid dangling threads.
bpo-30818: test_ftplib calls asyncore.close_all(). Always clear asyncore socket map using asyncore.close_all(ignore_all=True) in tearDown() method.
bpo-30908: Fix dangling thread in test_os.TestSendfile. tearDown() now clears explicitly the self.server variable to make sure that the thread is completely cleared when tearDownClass() checks if all threads have been cleaned up.
bpo-31067: test_subprocess now also calls reap_children() in tearDown(), not only on setUp().
bpo-31160: Fix test_builtin for zombie process. PtyTests.run_child() now calls os.waitpid() to read the exit status of the child process to avoid creating zombie process and leaking processes in the background.
bpo-31160: Fix test_random for zombie process. TestModule.test_after_fork() now calls os.waitpid() to read the exit status of the child process to avoid creating a zombie process.
bpo-31160: test_tempfile: TestRandomNameSequence.test_process_awareness() now calls os.waitpid() to avoid leaking a zombie process.
bpo-31234: fork_wait.py tests now joins threads, to not leak running threads in the background.
bpo-30830: test_logging uses threading_setup/cleanup. Replace @support.reap_threads on some methods with support.threading_setup() in setUp() and support.threading_cleanup() in tearDown() in BaseTest.
bpo-31234: test_httpservers joins the server thread.
bpo-31250, test_asyncio: fix dangling threads. Explicitly call shutdown(wait=True) on executors to wait until all threads complete to prevent side effects between tests. Fix test_loop_self_reading_exception(): don't mock loop.close(). Previously, the original close() method was called rather than the mock, because how set_event_loop() registered loop.close().
bpo-31234: Explicitly clear the server attribute in test_ftplib and test_poplib to prevent dangling thread. Clear also self.server_thread attribute in TestTimeouts.tearDown().
bpo-31234: Join threads in tests. Call thread.join() on threads to prevent the "dangling threads" warning.
bpo-31234: Join threads in test_hashlib: use thread.join() to wait until the parallel hash tasks complete rather than using events. Calling thread.join() prevent "dangling thread" warnings.
bpo-31234: Join threads in test_queue. Call thread.join() to prevent the "dangling thread" warning.

Next report: My contributions to CPython during 2017 Q3: Part 3 (funny bugs).

My contributions to CPython during 2017 Q3: Part 1

2017-10-18T15:00:00+02:00

My contributions to CPython during 2017 Q3 (july, august, september), Part 1.

Previous report: My contributions to CPython during 2017 Q2 (part1).

Next reports:

Summary:

Statistics
Security fixes
Enhancement: socket.close() now ignores ECONNRESET
Removal of the macOS job of Travis CI
New test.pythoninfo utility
Revert commits if buildbots are broken
Fix the Python test suite

Statistics

# All branches
$ git log --after=2017-06-30 --before=2017-10-01 --reverse --branches='*' --author=Stinner|grep '^commit ' -c
209

# Master branch only
$ git log --after=2017-06-30 --before=2017-10-01 --reverse --author=Stinner origin/master|grep '^commit ' -c
97

Statistics: I pushed 97 commits in the master branch on a total of 209 commits, remaining: 112 commits in the other branches (backports, fixes specific to Python 2.7, security fixes in Python 3.3 and 3.4, etc.)

Security fixes

bpo-30947: Update libexpat from 2.2.1 to 2.2.3. Fix applied to master, 3.6, 3.5, 3.4, 3.3 and 2.7 branches! Expat 2.2.2 and 2.2.3 fixed multiple security vulnerabilities. http://python-security.readthedocs.io/vuln/expat_2.2.3.html
Fix whichmodule() of _pickle: : _PyUnicode_FromId() can return NULL, replace Py_INCREF() with Py_XINCREF(). Fix coverity report: CID 1417269.
bpo-30860: _PyMem_Initialize() contains code which is never executed. Replace the runtime check with a build assertion. Fix Coverity CID 1417587.

Enhancement: socket.close() now ignores ECONNRESET

bpo-30319: socket.close() now ignores ECONNRESET. Previously, many network tests failed randomly with ConnectionResetError on socket.close().

Patching all functions calling socket.close() would require a lot of work, and it was surprising to get a "connection reset" when closing a socket.

Who cares that the peer closed the connection, since we are already closing it!?

Note: socket.close() was modified in Python 3.6 to raise OSError on failure (bpo-26685).

Removal of the macOS job of Travis CI

While the Linux jobs of Travis CI usually takes 15 minutes, up to 30 minutes in the worst case, the macOS job of Travis CI regulary took longer than 30 minutes, sometimes longer than 1 hour.

While the macOS job was optional, sometimes it gone mad and prevented a PR to be merged. Cancelling the job marked Travis CI as failed on a PR, so it was still not possible to merge the PR, whereas, again, the job is marked as optional ("Allowed Failure").

Moreover, when the macOS job failed, the failure was not reported on the PR, since the job was marked as optional. The only way to notify a failure was to go to Travis CI and wait at least 30 minutes (whereas the Linux jobs already completed and it was already possible merge a PR...).

I sent a first mail in June: [python-committers] macOS Travis CI job became mandatory?

In september, we decided to remove the macOS job during the CPython sprint at Instagram (see my previous New C API article), to not slowdown our development speed (bpo-31355). I sent another email to announce the change: [python-committers] Travis CI: macOS is now blocking -- remove macOS from Travis CI?.

After the sprint, it was decided to not add again the macOS job, since we have 3 macOS buildbots. It's enough to detect regressions specific to macOS.

After the removal of the macOS end, at the end of september, Travis CI published an article about the bad performances of their macOS fleet: Updating Our macOS Open Source Offering. Sadly, the article confirms that the situation is not going to evolve quickly.

New test.pythoninfo utility

To understand the "Segfault when readline history is more then 2 * history size" crash of bpo-29854, I modified test_readline to log libreadline versions. I also added readline._READLINE_LIBRARY_VERSION. My colleague Nir Soffer wrote the final readline fix: skip the test on old readline versions.

As a follow-up of this issue, I added a new test.pythoninfo program to log many information to debug Python tests (bpo-30871). pythoninfo is now run on Travis CI, AppVeyor and buildbots.

Example of output:

$ ./python -m test.pythoninfo
(...)
_decimal.__libmpdec_version__: 2.4.2
expat.EXPAT_VERSION: expat_2.2.4
gdb_version: GNU gdb (GDB) Fedora 8.0.1-26.fc26
locale.encoding: UTF-8
os.cpu_count: 4
(...)
time.timezone: -3600
time.tzname: ('CET', 'CEST')
tkinter.TCL_VERSION: 8.6
tkinter.TK_VERSION: 8.6
tkinter.info_patchlevel: 8.6.6
zlib.ZLIB_RUNTIME_VERSION: 1.2.11
zlib.ZLIB_VERSION: 1.2.11

test.pythoninfo can be easily extended to log more information, without polluting the output of the Python test suite which is already too verbose and very long.

Revert commits if buildbots are broken

Thanks to my work done last months on the Python test suite, the buildbots are now very reliable. When a buildbot fails, it becomes very likely that it's a real regression, and not a random failure caused by a bug in the Python test suite.

I proposed a new rule: revert a change if it breaks builbots and the but cannot be fixed easily:

So I would like to set a new rule: if I'm unable to fix buildbots failures caused by a recent change quickly (say, in less than 2 hours), I propose to revert the change.

It doesn't mean that the commit is bad and must not be merged ever. No. It would just mean that we need time to work on fixing the issue, and it shouldn't impact other pending changes, to keep a sane master branch.

[python-committers] Revert changes which break too many buildbots.

test_datetime

The first revert was an enhancement of test_datetime, bpo-30822:

commit 98b6bc3bf72532b784a1c1fa76eaa6026a663e44
Author: Utkarsh Upadhyay <mail@musicallyut.in>
Date:   Sun Jul 2 14:46:04 2017 +0200

    bpo-30822: Fix testing of datetime module. (#2530)

    Only C implementation was tested.

I wrote an email to announce the revert: [python-committers] Revert changes which break too many buildbots.

It took 15 days to decide how to fix properly the issue (exclude tzdata from test resources). I don't regret my revert, since having broken buildbots for 15 days would be very annoying.

python-gdb.py fix

I also reverted this commit of bpo-30983:

commit 2e0f4db114424a00354eab889ba8f7334a2ab8f0
Author: Bruno "Polaco" Penteado <polaco@gmail.com>
Date:   Mon Aug 14 23:14:17 2017 +0100

    bpo-30983: eval frame rename in pep 0523 broke gdb's python extension (#2803)

    pep 0523 renames PyEval_EvalFrameEx to _PyEval_EvalFrameDefault while the gdb python extension only looks for PyEval_EvalFrameEx to understand if it is dealing with a frame.

    Final effect is that attaching gdb to a python3.6 process doesnt resolve python objects. Eg. py-list and py-bt dont work properly.

    This patch fixes that. Tested locally on python3.6

My comment on the issue:

I chose to revert the change because I don't have the bandwidth right now to investigate why the change broke test_gdb.

I'm surprised that a change affecting python-gdb.py wasn't properly tested manually using test_gdb.py :-( I understand that Travis CI doesn't have gdb and/or that the test pass in some cases?

The revert only gives us more time to design the proper solution.

Hopefully, a new fixed commit was pushed 4 days later and this one didn't break buildbots!

Fix the Python test suite

As usual, I spent a significant part of my time to fix bugs in the Python test suite to make it more reliable and more "usable".

bpo-30822: Exclude tzdata from regrtest --all. When running the test suite using --use=all / -u all, exclude tzdata since it makes test_datetime too slow (15-20 min on some buildbots, just this single test file) which then times out on some buildbots. -u tzdata must now be enabled explicitly.
bpo-30188, test_nntplib: Catch also ssl.SSLEOFError in NetworkedNNTPTests.setUpClass(), not only EOFError. (Sadly, test_nntplib still fails randomly with EOFError or SSLEOFError...)
bpo-31009: Fix support.fd_count() on Windows. Call msvcrt.CrtSetReportMode() to not kill the process nor log any error on stderr on os.dup(fd) if the file descriptor is invalid.
bpo-31034: Reliable signal handler for test_asyncio. Don't rely on the current SIGHUP signal handler, make sure that it's set to the "default" signal handler: SIG_DFL. A colleague reported me that the Python test suite hangs on running test_subprocess_send_signal() of test_asyncio. After analysing the issue, it seems like the test hangs because the RPM package builder ignores SIGHUP.
bpo-31028: Fix test_pydoc when run directly. Fix get_pydoc_link(): get the absolute path to __file__ to prevent relative directories.
bpo-31066: Fix test_httpservers.test_last_modified(). Write the temporary file on disk and then get its modification time.
bpo-31173: Rewrite WSTOPSIG test of test_subprocess.

The current test_child_terminated_in_stopped_state() function test creates a child process which calls ptrace(PTRACE_TRACEME, 0, 0) and then crash (SIGSEGV). The problem is that calling os.waitpid() in the parent process is not enough to close the process: the child process remains alive and so the unit test leaks a child process in a strange state. Closing the child process requires non-trivial code, maybe platform specific.

Remove the functional test and replaces it with an unit test which mocks os.waitpid() using a new _testcapi.W_STOPCODE() function to test the WIFSTOPPED() path.
bpo-31008: Fix asyncio test_wait_for_handle on Windows, tolerate a difference of 50 ms.
bpo-31235: Fix ResourceWarning in test_logging: always close all asyncore dispatchers (ignoring errors if any).
bpo-30121: Add test_subprocess.test_nonexisting_with_pipes(). Test the Popen failure when Popen was created with pipes. Create also NONEXISTING_CMD variable in test_subprocess.py.
bpo-31250, test_asyncio: fix EventLoopTestsMixin.tearDown(). Call doCleanups() to close the loop after calling executor.shutdown(wait=True).
test_ssl: Implement timeout in ssl_io_loop(). The timeout parameter was not used.
bpo-31448, test_poplib: Call POP3.close(), don't close close directly the sock attribute to fix a ResourceWarning.
os.test_utime_current(): tolerate 50 ms delta.
bpo-31135: ttk: fix LabeledScale and OptionMenu destroy() method. Call the parent destroy() method even if the used attribute doesn't exist. The LabeledScale.destroy() method now also explicitly clears label and scale attributes to help the garbage collector to destroy all widgets.
bpo-31479: Always reset the signal alarm in tests. Use the try: ... finally: signal.signal(0) pattern to make sure that tests don't "leak" a pending fatal signal alarm. Move some signal.alarm() calls into the try block.

Next report: My contributions to CPython during 2017 Q3: Part 2 (dangling threads).

Python Security

2017-09-15T22:00:00+02:00

I am working on the Python security for years, but I never wrote anything about that. Let's fix this!

PSRT

I am part of the Python Security Response Team (PSRT): I get emails sent to security@python.org. I try to analyze each report to validate that the bug is reproductible, find impacted Python versions and start to discuss how to fix the vulnerability. In some cases, the reported issue is not a security vulnerability, is not related to CPython, or sometimes is already fixed. We also get reports about CPython, but also the web sites and other projects related to Python.

Warning: I don't represent the PSRT, I speak for my own!

Vulnerabilities sent to PSRT

In this article, I will focus on vulnerabilities impacting CPython: the C and Python code of CPython core and the standard library.

When vulnerabilities are obvious bugs, they are quickly fixed. Done.

But it's not uncommon that fixing a vulnerability impacts the backward compatibility which is a major concern of CPython core developers. There is also a risk of rejecting legit input data because the added checks are too strict. We have to be very careful and so fixing vulnerabilities can take weeks, if not months in the worst case.

While CPython has few active core developers, the PSRT has even lesser active members to handle incoming reports. We are volunteers, so please be kind and patient...

Example of a complex fix

The urllib FTP protocol stream injection vulnerability was reported to the PSRT at 2016-01-15. The fix was only merged at 2017-07-26.

First, it was not obvious how the vulnerability can be exploited, nor if it should be fixed.

Then it was not obvious if the vulnerability should be fixed in the urllib module or in the ftplib module.

Even if the bug was public, it didn't get much attention. Since I don't know well how the urllib module, I wrote an email to the python-dev mailing list: Need help to fix urllib(.parse) vulnerabilities.

I proposed a fix for the urllib module: Reject newline character (U+000A) in URLs in urllib.parse. But it was rejected, since it was the wrong approach and my checks were too strict in many cases (rejected legit requests).

The final fix rejects \b and \r newline characters in the putline() method of the ftplib module.

Track known and fixed CPython vulnerabilities

Currently, not least that six branches still get security fixes!

Python 2.7
Python 3.3
Python 3.4
Python 3.5
Python 3.6
master: the development branch

Last year, I added a table to the Python developer guide to help me to track the status of each branch: see the Status of Python branches.

This year, I created a tool to help me to track known CPython vulnerabilities: python-security project (hosted at GitHub). The vulnerabilities.yaml file is a YAML file with one section per vulnerability. Each vulnerability has a title, link to the Python bug, disclosure date, reported date, commits, etc.

The tool gets the date of commits and the Git tags which contains the commit to infer the first Python versions of each branch which contain the fix. It also build a timeline to help to understand how the vulnerability was handled.

I also wanted to be more transparent on how we handle vulnerabilities and our velocity to fix them.

Honestly, I was disappointed that it took so long to fix some vulnerabilities in the past. Hopefully, it seems like we are more reactive nowadays!

Example of a fixed vulnerability

Example: CVE-2016-5699: HTTP header injection.

Right now, Python 3.3 is still vulnerable (my fix was commited, I am now waiting Python 3.3.7 which is coming at the end of september).

Since the vulnerability was reported, it took 108 days to merge the fix, 72 more days (total 180 days) for the first release including the fix (Python 2.7.10).

Sadly, the PSRT doesn't compute a severity of vulnerabilities yet.

Hopefully, for this vulnerability, web frameworks were able to workaround the vulnerability by input sanitization.

Backport all fixes

Last months, I backported fixes to the six branches which still accept security fixes, to respect the contract with our users: we are doing our best to protect you!

The good news is that with Python 2.7.14 and Python 3.3.7 releases scheduled this month, all major security vulnerabilities will be fixed in all maintained Python branches!

Some fixes were not backported on purpose. For example, the CVE-2013-7040: Hash not properly randomized vulnerability requires to change the hash algorithm and we decided to not touch Python 2.7 and 3.3 for backward compatibility reasons (don't break code relying on the exact hash function). The issue was fixed in Python 3.4 by using the SipHash hash algorithm which uses a hash secret (generated randomly by Python at startup).

Python security documentation

Last months, I also started to collect random notes about the Python security.

Explore my python-security.readthedocs.io documentation and send me feedback!

A New C API for CPython

2017-09-07T18:00:00+02:00

I am currently at a CPython sprint 2017 at Facebook. We are discussing my idea of writing a new C API for CPython hiding implementation details and replacing macros with function calls.

This article tries to explain why the CPython C API needs to evolve.

C API prevents further optimizations

The CPython PyListObject type uses an array of PyObject* objects. PyPy is able to use a C array of integers if the list only contains small integers. CPython cannot because PyList_GET_ITEM(list, index) is implemented as a macro:

#define PyList_GET_ITEM(op, i) ((PyListObject *)op)->ob_item[i]

The macro relies on the PyListObject structure:

typedef struct {
    PyVarObject ob_base;
    PyObject **ob_item;   // <-- pointer to real data
    Py_ssize_t allocated;
} PyListObject;

typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;

typedef struct _object {
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

API and ABI

Compiling C extension code using PyList_GET_ITEM() produces machine code accessing PyListObject members. Something like (C pseudo code):

PyObject **items;
PyObject *item;
items = (PyObject **)(((char*)list) + 24);
item = items[i];

The offset 24 is hardcoded in the C extension object file: the API (programming interface) becomes the ABI (binary interface).

But debug builds use a different memory layout:

typedef struct _object {
    struct _object *_ob_next;   // <--- two new fields are added
    struct _object *_ob_prev;   // <--- for debug purpose
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

The machine code becomes something like:

items = (PyObject **)(((char*)op) + 40);
item = items[i];

The offset changes from 24 to 40 (+16, two pointers of 8 bytes).

C extensions have to be recompiled to work on Python compiled in debug mode.

Another example is Python 2.7 which uses a different ABI for UTF-16 and UCS-4 Unicode string: the --with-wide-unicode configure option.

Stable ABI

If the machine code doesn't use the offset, it would be able to only compile C extensions once.

A solution is to replace PyList_GET_ITEM() macro with a function:

PyObject* PyList_GET_ITEM(PyObject *list, Py_ssize_t index);

defined as:

PyObject* PyList_GET_ITEM(PyObject *list, Py_ssize_t index)
{
    return ((PyListObject *)list)->ob_item[i];
}

The machine code becomes a function call:

PyObject *item;
item = PyList_GET_ITEM(list, index);

Specialized list for small integers

If C extension objects don't access structure members anymore, it becomes possible to modify the memory layout.

For example, it's possible to design a specialized implementation of PyListObject for small integers:

typedef struct {
    PyVarObject ob_base;
    int use_small_int;
    PyObject **pyobject_array;
    int32_t *small_int_array;   // <-- new compact C array for integers
    Py_ssize_t allocated;
} PyListObject;

PyObject* PyList_GET_ITEM(PyObject *op, Py_ssize_t index)
{
    PyListObject *list = (PyListObject *)op;
    if (list->use_small_int) {
        int32_t item = list->small_int_array[index];
        /* create a new object at each call */
        return PyLong_FromLong(item);
    }
    else {
        return list->pyobject_array[index];
    }
}

It's just an example to show that it becomes possible to modify PyObject structures. I'm not sure that it's useful in practice.

Multiple Python "runtimes"

Assuming that all used C extensions use the new stable ABI, we can now imagine multiple specialized Python runtimes installed in parallel, instead of a single runtime:

python3.7: regular/legacy CPython, backward compatible
python3.7-dbg: runtime checks to ease debug
fasterpython3.7: use specialized list
etc.

The python3 runtime would remain fully compatible since it would use the old C API with macros and full structures. So by default, everything will continue to work.

But the other runtimes require that all imported C extensions were compiled with the new C API.

python3.7-dbg adds more checks tested at runtime. Example:

PyObject* PyList_GET_ITEM(PyObject *list, Py_ssize_t index)
{
    assert(PyList_Check(list));
    assert(0 <= index && index < Py_SIZE(list));
    return ((PyListObject *)list)->ob_item[i];
}

Currently, some Linux distributions provide a python3-dbg binary, but may not provide -dbg binary packages of all C extensions. So all C extensions have to be recompiled manually which is quite painful (need to install build dependencies, wait until everthing is recompiled, etc.).

Experiment optimizations

With the new C API, it becomes possible to implement a new class of optimizations.

Tagged pointer

Store small integers directly into the pointer value. Reduce the memory usage, avoid expensive unboxing-boxing.

See Wikipedia: Tagged pointer.

No garbage collector (GC) at all

Python runtime without GC at all. Remove the following header from objects tracked by the GC:

struct {
    union _gc_head *gc_next;
    union _gc_head *gc_prev;
    Py_ssize_t gc_refs;
} PyGC_Head;

It would remove 24 bytes per object tracked by the GC.

For comparison, the smallest Python object is "object()" which only takes 16 bytes.

Tracing garbage collector without reference counting

This idea is really the most complex and most experimental idea, but IMHO it's required to "unlock" Python performances.

Write a new API to keep track of pointers:
- Declare a variable storing a PyObject* object
- Set a pointer
- Maybe also read a pointer?
Modify C extensions to use this new API
Implement a tracing garbage collector which can move objects in memory to compact memory
Remove reference counting

It even seems possible to implement a tracing garbage collector and use reference counting. But I'm not an expert in this area, need to dig the topic.

Questions:

Is it possible to fix all C extensions to use the new API? Should be an opt-in option in a first stage.
Is it possible to emulate Py_INCREF/DECREF API, for backward compatibility, using an hash table which maintains a reference counter outside PyObject?
Do we need to fix all C extensions?

Gilectomy

Abstracting the ABI allows to customize the runtime for Gilectomy needs, to be able to reemove the GIL.

Removing reference counting would make Gilectomy much simpler.

My contributions to CPython during 2017 Q2 (part 3)

2017-07-13T17:00:00+02:00

This is the third part of my contributions to CPython during 2017 Q2 (april, may, june):

Security
Trick bug: Clang 4.0, dtoa and strict aliasing
sigwaitinfo() race condition in test_eintr
FreeBSD test_subprocess core dump

Previous reports:

Next report:

My contributions to CPython during 2017 Q3: Part 1.

Security

Backport fixes

I am trying to fix all known security fixes in the 6 maintained Python branches: 2.7, 3.3, 3.4, 3.5, 3.6 and master.

I created the python-security.readthedocs.io website to track these vulnerabilities, especially which Python versions are fixed, to identifiy missing backports.

Python 2.7, 3.5, 3.6 and master are quite good, I am still working on backporting fixes into 3.4 and 3.3. Larry Hastings merged my 3.4 backports and other security fixes, and scheduled a new 3.4.7 release next weeks. Later, I will try to fix Python 3.3 as well, before its end-of-life, scheduled for the end of september.

See the Status of Python branches in the devguide.

libexpat 2.2

Python embeds a copy of libexpat to ease Python compilation on Windows and macOS. It means that we have to remind to upgrade it at each libexpat release. It is especially important when security vulerabilities are fixed in libexpat.

libexpat 2.2 was released at 2016-06-21 and it contains such fixes for vulnerabilities, see: CVE-2016-0718: expat 2.2, bug #537.

Sadly, it took us a few months to upgrade libexpat. I wrote a short shell script to easily upgrade libexpat: recreate the Modules/expat/ directory from a libexpat tarball.

My commit:

bpo-29591: Upgrade Modules/expat to libexpat 2.2 (#2164)

Remove the configuration (Modules/expat/*config.h) of unsupported platforms: Amiga, MacOS Classic on PPC32, Open Watcom.

Remove XML_HAS_SET_HASH_SALT define: it became useless since our local expat copy was upgrade to expat 2.1 (it's now expat 2.2.0).

I upgraded libexpat to 2.2 in Pytohn 2.7, 3.4, 3.5, 3.6 and master branches. I still have a pending pull request for 3.3.

libexpat 2.2.1

Just after I finally upgraded our libexpat copy to 2.2.0... libexpat 2.2.1 was released with new security fixes! See CVE-2017-9233: Expat 2.2.1

Again, I upgraded libexpat to 2.2.1 in all branches (pending: 3.3), see bpo-30694. My commit:

Upgrade expat copy from 2.2.0 to 2.2.1 to get fixes of multiple security vulnerabilities including:

CVE-2017-9233 (External entity infinite loop DoS),

CVE-2016-9063 (Integer overflow, re-fix),

CVE-2016-0718 (Fix regression bugs from 2.2.0's fix to CVE-2016-0718)

CVE-2012-0876 (Counter hash flooding with SipHash).

Note: the CVE-2016-5300 (Use os-specific entropy sources like getrandom) doesn't impact Python, since Python already gets entropy from the OS to set the expat secret using XML_SetHashSalt().

urllib splithost() vulnerability

Vulnerability: bpo-30500: urllib connects to a wrong host.

While it was quick to confirm the vulnerability, it was tricky to decide how to properly fix it without breaking backward compatibility. We had too few unit tests, and no obvious definition of the expected behaviour. I contributed to the discussed and to polish the fix:

bpo-30500 commit:

Fix urllib.parse.splithost() to correctly parse fragments. For example, splithost('//127.0.0.1#@evil.com/') now correctly returns the 127.0.0.1 host, instead of treating @evil.com as the host in an authentification (login@host).

Fix applied to master, 3.6, 3.5, 3.4 and 2.7; pending pull request for 3.3.

Travis CI

I also wrote a pull request to enable Travis CI and AppVeyor CI on Python 3.3 and 3.4 branches, to test security on CI. These changes are complex and not merged yet, but I am now confident that the CI will be enabled on 3.4!

My PR for Python 3.4: [3.4] Backport CI config from master.

Tricky bug: Clang 4.0, dtoa and strict aliasing

Aha, another funny story about compilers: bpo-30104.

I noticed that the following tests started to fail on the "AMD64 FreeBSD CURRENT Debug 3.x" buildbot:

test_cmath
test_float
test_json
test_marshal
test_math
test_statistics
test_strtod

First, I bet on a libc change on FreeBSD. Then, I found that test_strtod fails on FreeBSD using clang 4.0, but pass on FreeBSD using clang 3.8.

I started to bisect the code on Linux using a subset of Python/dtoa.c:

Start (integrated in CPython code base): 2,876 lines
dtoa2.c (standalone): 2,865 lines
dtoa5.c: 50 lines

Extract of dtoa5.c:

typedef union { double d; uint32_t L[2]; } U;

struct Bigint { int wds; };

static double
ratio(struct Bigint *a)
{
    U da, db;
    int k, ka, kb;
    double r;

    da.d = 1.682;
    ka = 6;
    db.d = 1.0;
    kb = 5;
    k = ka - kb + 32 * (a->wds - 12);
    printf("k=%i\n", k);

    if (k > 0)
        da.L[1] += k * 0x100000;
    else {
        k = -k;
        db.L[1] += k * 0x100000;
    }
    r = da.d / db.d;
    /* r == 3.364 */
    return r;
}

Even if I had a very short C code (50 lines) reproducing the bug, I was still unable to understand the bug. I read many articles about aliasing, and I still don't understand fully the bug... I suggest you these two good articles:

Understanding Strict Aliasing (Mike Acton, June 1, 2006)
Demystifying The Restrict Keyword (Mike Acton, May 29, 2006)

Anyway, I wanted to report the bug to clang (LLVM), but the LLVM bug tracker was migrating and I was unable to subscribe to get an account!

In the meanwhile, Dimitry Andric, a FreeBSD developer, told me that he got exactly the same clang 4.0 issue with "dtoa.c" in the julia programming language. Two months before I saw the same bug, he already reported the bug to FreeBSD: lang/julia: fails to build with clang 4.0, and to clang: After r280351: if/else blocks incorrectly optimized away?.

The "problem" is that clang developers disagree that it's a bug. In short, the discussion was around the C standard: does clang respect C aliasing rules or not? At the end, clang developers consider that they are right to optimize. To summarize:

It's a bug in the code, not in the compiler

So I made a first change to use the -fno-strict-aliasing flag when Python is compiled with clang:

Python/dtoa.c is not compiled correctly with clang 4.0 and optimization level -O2 or higher, because of an aliasing issue on the double/ULong[2] union.

But this change can make Python slower when compiled on clang, so I was asked to only compile Python/dtoa.c with this flag:

On clang, only compile dtoa.c with -fno-strict-aliasing, use strict aliasing to compile all other C files.

sigwaitinfo() race condition in test_eintr

The tricky test_eintr

When I wrote and implemented the PEP 475, Retry system calls failing with EINTR, I didn't expect so many annoying bugs of the newly written test_eintr unit test. This test calls system calls while sending signals every 100 ms. Usually the test tries to block on a system call during at least 200 ms, to make sure that the syscall was interrupted at least once by a signal, to check that Python correctly retries the interrupted system call.

Since the PEP was implemented, I already fixed many race conditions in test_eintr, but there was still a race condition on the sigwaitinfo() unit test. Sometimes on a few specific buildbots (FreeBSD), the test fails randomly.

First attempt

My first attempt was the bpo-25277, opened at 2015-09-30. I added faulthandler to dump tracebacks if a test hangs longer than 10 minutes. Then I changed the sleep from 200 ms to 2 seconds in the sigwaitinfo() test... just to make the bug less likely, but using a longer sleep doesn't fix the root issue.

Second attempt

My second attempt was the bpo-25868, opened at 2015-12-15. I added a pipe to "synchronize the parent and the child processes", to try to make the sigwaitinfo() test a little bit more reliable. I also reduced the sleep from 2 seconds to 100 ms.

7 minutes after my fix, Martin Panter wrote:

With the pipe, there is still a potential race after the parent writes to the pipe and before sigwaitinfo() is invoked, versus the child sleep() call.

What do you think of my suggestion to block the signal? Then (in theory) it should be robust, rather than relying on timing.

I replied that I wasn't sure that sigwaitinfo() EINTR error was still tested if we make his proposed change.

One month later, Martin wrote a patch but I was unable to take a decision on his change. In september 2016, Martin noticed a new test failure on the FreeBSD 9 buildbot.

Third attempt

My third attempt is the bpo-30320, opened at 2017-05-09. This time, I really wanted to fix all buildbot random failures. Since I was now able to reproduce the bug on my FreeBSD VM, I was able to write a fix but also to check that:

sigwaitinfo() and sigtimedwait() fail with EINTR and Python automatically restarts the interrupted syscall
I hacked the test file to only run the sigwaitinfo() and sigtimedwait() unit tests. Running the test in a loop doesn't fail: I ran the test during 5 minutes in 10 shells (tests running 10 times in parallel) => no failure, the race condition seems to be gone.

So I pushed my fix:

bpo-30320: test_eintr now uses pthread_sigmask()

Rewrite sigwaitinfo() and sigtimedwait() unit tests for EINTR using pthread_sigmask() to fix a race condition between the child and the parent process.

Remove the pipe which was used as a weak workaround against the race condition.

sigtimedwait() is now tested with a child process sending a signal instead of testing the timeout feature which is more unstable (especially regarding to clock resolution depending on the platform).

To be honest, I wasn't really confident, when I pushed my fix, that blocking the waited signal is the proper fix.

So it took 1 year and 8 months to really find and fix the root bug.

Sadly, while I was working on dozens of other bugs, I completely lost track of Martin's patch, even if I opened the bpo-25868. Sorry Martin for forgotting to review your patch! But when you wrote it, I was unable to test that sigwaitinfo() was still failing with EINTR.

FreeBSD test_subprocess core dump

bpo-30448: During one month, some FreeBSD buildbots was emitting this warning which started to annoy me, since I was trying to fix all buildbots warnings:

Warning -- files was modified by test_subprocess
  Before: []
  After:  ['python.core']

I tried and failed to reproduce the warning on my FreeBSD 11 VM. I also asked a friend to reproduce the bug, but he also failed. I was developping my test.bisect tool and I wanted to get access to a machine to reproduce the bug!

Later, Kubilay Kocak aka koobs gave me access to his FreeBSD buildbots and in a few seconds with my new test.bisect tool, I identified that the test_child_terminated_in_stopped_state() test triggers a deliberate crash, but doesn't disable core dump creation. The fix is simple, use test.support.SuppressCrashReport context manager. Thanks koobs for the access!

Maybe only FreeBSD 10 and older dump a core on this specific test, not FreeBSD 11. I don't know why. The test is special, it tests a process which crashs while being traced with ptrace().

My contributions to CPython during 2017 Q2 (part 2)

2017-07-13T16:30:00+02:00

This is the second part of my contributions to CPython during 2017 Q2 (april, may, june):

Mentoring
Reference and memory leaks
Contributions
Enhancements
Bugfixes
Stars of the CPython GitHub project

Previous report: My contributions to CPython during 2017 Q2 (part 1).

Next report: My contributions to CPython during 2017 Q2 (part 3).

Mentoring

During this quarter, I tried to mark "easy" issues using a "[EASY]" tag in their title and the "easy" or "easy C" keyword. I announced these issues on the core-mentorship mailing list. I asked core developers to not fix these easy issues, but rather explain how to fix them. In each issue, I described how fix these issues.

It was a success since all easy issues were fixed quickly, usually the PR was merged in less than 24 hours after I created the issue!

I mentored Stéphane Wirtel and Louie Lu to fix issues (easy or not). During this quarter, Stéphane Wirtel got 5 commits merged into master (on a total of 11 commits), and Louie lu got 6 commits merged into master (on a total of 10 commits).

They helped me to fix reference leaks spotted by the new Refleaks buildbots.

Reference and memory leaks

Zachary Ware installed a Gentoo and a Windows buildbots running the Python test suite with --huntrleaks to detect reference and memory leaks.

I worked hard with others, especially Stéphane Wirtel and Louie Lu, to fix all reference leaks and memory leaks in Python 2.7, 3.5, 3.6 and master. Right now, there is no more leaks on Windows! For Gentoo, the buildbot is currently offline, but I am confident that all leaks also fixed.

bpo-30598: _PySys_EndInit() now duplicates warnoptions. Fix a reference leak in subinterpreters, like test_callbacks_leak() of test_atexit. warnoptions is a list used to pass options from the command line to the sys module constructor. Before this change, the list was shared by multiple interpreter which is not the expected behaviour. Each interpreter should have their own independent mutable world. This change duplicates the list in each interpreter. So each interpreter owns its own list, so each interpreter can clear its own list.
bpo-30601: Fix a refleak in WindowsConsoleIO. Fix a reference leak in _io._WindowsConsoleIO: PyUnicode_FSDecoder() always initialize decodedname when it succeed and it doesn't clear input decodedname object.
bpo-30599: Fix test_threaded_import reference leak. Mock os.register_at_fork() when importing the random module, since this function doesn't allow to unregister callbacks and so leaked memory.
2.7: _tkinter: Fix refleak in getint(). PyNumber_Int() creates a new reference: need to decrement result reference counter.
bpo-30635: Fix refleak in test_c_locale_coercion. When checking for reference leaks, test_c_locale_coercion is run multiple times and so _LocaleCoercionTargetsTestCase.setUpClass() is called multiple times. setUpClass() appends new value at each call, so it looks like a reference leak. Moving the setup from setUpClass() to setUpModule() avoids this, eliminating the false alarm.
bpo-30602: Fix refleak in os.spawnve(). When os.spawnve() fails while handling arguments, free correctly argvlist: pass lastarg+1 rather than lastarg to free_string_array() to also free the first item.
bpo-30602: Fix refleak in os.spawnv(). When os.spawnv() fails while handling arguments, free correctly argvlist: pass lastarg+1 rather than lastarg to free_string_array() to also free the first item.
Fix ref cycles in TestCase.assertRaises(). bpo-23890: unittest.TestCase.assertRaises() now manually breaks a reference cycle to not keep objects alive longer than expected.
Python 2.7: bpo-30675: Fix refleak hunting in regrtest. regrtest now warms up caches: create explicitly all internal singletons which are created on demand to prevent false positives when checking for reference leaks.
_winconsoleio: Fix memory leak. Fix memory leak when _winconsoleio tries to open a non-console file: free the name buffer.
bpo-30813: Fix unittest when hunting refleaks. bpo-11798, bpo-16662, bpo-16935, bpo-30813: Skip test_discover_with_module_that_raises_SkipTest_on_import() and test_discover_with_init_module_that_raises_SkipTest_on_import() of test_unittest when hunting reference leaks using regrtest.
bpo-30704, bpo-30604: Fix memleak in code_dealloc(): Free also co_extra->ce_extras, not only co_extra. XXX Serhiy rewrote the structure in master to use a single memory block, implemented my idea.

Python 3.5 regrtest fix

bpo-30675, Fix the multiprocessing code in regrtest:

Rewrite code to pass slaveargs from the master process to worker processes: reuse the same code of the Python master branch.
Move code to initialize tests in a new setup_tests() function, similar change was done in the master branch.
In a worker process, call setup_tests() with the namespace built from slaveargs to initialize correctly tests.

Before this change, warm_caches() was not called in worker processes because the setup was done before rebuilding the namespace from slaveargs. As a consequence, the huntrleaks feature was unstable. For example, test_zipfile reported randomly false positive on reference leaks.

False positives

bpo-30776: reduce regrtest -R false positives (#2422)

Change the regrtest --huntrleaks checker to decide if a test file leaks or not. Require that each run leaks at least 1 reference.
Warmup runs are now completely ignored: ignored in the checker test and not used anymore to compute the sum.
Add an unit test for a reference leak.

Example of reference differences previously considered a failure (leak) and now considered as success (success, no leak):

[3, 0, 0]
[0, 1, 0]
[8, -8, 1]

The same change was done to check for memory leaks.

Contributions

This quarter, I helped to merge two contributions:

bpo-9850: Deprecate the macpath module. Co-Authored-By: Chi Hsuan Yen.
bpo-30595: Fix multiprocessing.Queue.get(timeout). multiprocessing.Queue.get() with a timeout now polls its reader in non-blocking mode if it succeeded to aquire the lock but the acquire took longer than the timeout. Co-Authored-By: Grzegorz Grzywacz.

Enhancements

bpo-30265: support.unlink() now only ignores ENOENT and ENOTDIR, instead of ignoring all OSError exception.
bpo-30054: Expose tracemalloc C API: make PyTraceMalloc_Track() and PyTraceMalloc_Untrack() functions public. numpy is able to use tracemalloc since numpy 1.13.

Bugfixes

bpo-30125: On Windows, faulthandler.disable() now removes the exception handler installed by faulthandler.enable().
bpo-30284: Fix regrtest for out of tree build. Use a build/ directory in the build directory, not in the source directory, since the source directory may be read-only and must not be modified. Fallback on the source directory if the build directory is not available (missing "abs_builddir" sysconfig variable).
test_locale now ignores the DeprecationWarning, don't fail anymore if test run with python3 -Werror. Fix also deprecation message: add a space.
Fix a compiler warnings on AIX: only define get_zone() and get_gmtoff() if needed.
Fix a compiler warning in tmtotuple(): use the time_t type for the gmtoff parameter.
bpo-30264: ExpatParser closes the source on error. ExpatParser.parse() of xml.sax.xmlreader now always closes the source: close the file object or the urllib object if source is a string (not an open file-like object). The change fixes a ResourceWarning on parsing error. Add test_parse_close_source() unit test.
Fix SyntaxWarning on importing test_inspect. Fix the following warning when test_inspect.py is compiled to test_inspect.pyc: SyntaxWarning: tuple parameter unpacking has been removed in 3.x
bpo-30418: On Windows, subprocess.Popen.communicate() now also ignore EINVAL on stdin.write(): ignore also EINVAL if the child process is still running but closed the pipe.
bpo-30257: _bsddb: Fix newDBObject(). Don't set cursorSetReturnsNone to DEFAULT_CURSOR_SET_RETURNS_NONE anymore if self->myenvobj is set. Fix a GCC warning on the strange indentation.
bpo-30231: Remove skipped test_imaplib tests. The public cyrus.andrew.cmu.edu IMAP server (port 993) doesn't accept TLS connection using our self-signed x509 certificate. Remove the two tests which are already skipped. Write a new test_certfile_arg_warn() unit test for the certfile deprecation warning.

Stars of the CPython GitHub project

At June 30, I wrote an email to python-dev about GitHub showcase of hosted programming languages: Python is only #11 with 8,539 stars, behind PHP and Ruby! I suggested to "like" ("star"?) the CPython project on GitHub if you like the Python programming language!

Four days later, we got +2,389 new stars (8,539 => 10,928), thank you! Python moved from the 11th place to the 9th, before Elixir and Julia.

Ben Hoyt posted it on reddit.com/r/Python, where it got a bit of traction. Terry Jan Reedy also posted it on python-list.

Screenshot at 2017-07-13 showing Ruby, PHP and CPython:

CPython now has 11,512 stars, only 861 stars behind PHP ;-)

My contributions to CPython during 2017 Q2 (part 1)

2017-07-13T16:00:00+02:00

This is the first part of my contributions to CPython during 2017 Q2 (april, may, june):

Statistics
Buidbots and test.bisect
Python 3.6.0 regression
struct.Struct.format type
Optimization: one less syscall per open() call
make regen-all

Previous report: My contributions to CPython during 2017 Q1.

Next reports:

Next parts

Statistics

# All branches
$ git log --after=2017-03-31 --before=2017-06-30 --reverse --branches='*' --author=Stinner > 2017Q2
$ grep '^commit ' 2017Q2|wc -l
222

# Master branch only
$ git log --after=2017-03-31 --before=2017-06-30 --reverse --author=Stinner origin/master|grep '^commit '|wc -l
85

Statistics: 85 commits in the master branch, a total of 222 commits: most (but not all) of the remaining 137 commits are cherry-picked backports to 2.7, 3.5 and 3.6 branches.

Note: I didn't use --no-merges since we don't use merge anymore, but git cherry-pick -x, to backport fixes. Before GitHub, we used forwardport with Mercurial merges (ex: commit into 3.6, then merge into master).

Buildbots and test.bisect

Since this article became way too long, I splitted it into sub-articles:

Python 3.6.0 regression

I am ashamed, I introduced a tricky regression in Pyton 3.6.0 with my work on FASTCALL optimizations :-( A special way to call C builtin functions was broken:

from datetime import datetime
next(iter(datetime.now, None))

This code raises a StopIteration exception instead of formatting the current date and time.

It's even worse. I was aware of the bug, it was already fixed it in master, but I just forgot to backport my fix: bpo-30524, fix _PyStack_UnpackDict().

To prevent regressions, I wrote exhaustive unit tests on the 3 FASTCALL functions, commit: bpo-30524: Write unit tests for FASTCALL

struct.Struct.format type

Sometimes, fixing a bug can take longer than expected. In March 2014, Zbyszek Jędrzejewski-Szmek reported a bug on the format attribute of the struct.Struct class: this attribute type is bytes, whereas a Unicode string (str) was expected.

I proposed to "just" change the attribute type in December 2014, but it was an incompatible change which would break the backward compatibility. Martin Panter agreed and wrote a patch. Serhiy Storchaka asked to discuss such incompatible change on python-dev, but then nothing happened during longer than... 2 years!

In March 2017, I converted the old Martin's patch into a new GitHub pull request. Serhiy asked again to write to python-dev, so I wrote: Issue #21071: change struct.Struct.format type from bytes to str. And... I got zero answer.

Well, I didn't expect any, since it's a trivial change, and I don't expect that anyone rely on the exact format attribute type. Moreover, the struct.Struct constructor already accepts bytes and str types. If the attribute is passed to the constructor: it just works.

In June 2017, Serhiy Storchaka replied to my email: If nobody opposed to this change it will be made in short time.

Since nobody replied, again, I just merged my pull request. So it took 3 years and 3 months to change the type of an uncommon attribute :-)

Note: I never used this attribute... Before reading this issue, I didn't even know that the struct module has a struct.Struct type...

Optimization: one less syscall per open() call

In bpo-30228, I modified FileIO.seek() and FileIO.tell() methods to now set the internal seekable attribute to avoid one fstat() syscall per Python open() call in buffered or text mode.

The seekable property is now also more reliable since its value is set correctly on memory allocation failure.

I still have a second pending pull request to remove one more fstat() syscall: bpo-30228: TextIOWrapper uses abs_pos, not tell().

make regen-all

I started to look at bpo-23404, because the Python compilation failed on the "AMD64 FreeBSD 9.x 3.x" buildbot when trying to regenerate the Include/opcode.h file.

Old broken make touch

We had a make touch command to workaround this file timestamp issue, but the command uses Mercurial, whereas Python migrated to Git last february. The buildobt "touch" step was removed because make touch was broken.

I was always annoyed by the Makefile which wants to regenerate generated files because of wrong file modification time, whereas the generated files were already up to date.

The bug annoyed me on OpenIndiana where "make touch" didn't work beause the operating system only provides Python 2.6 and Mercurial didn't work on this version.

The bug also annoyed me on FreeBSD which has no "python" command, only "python2.7", and so required manual steps.

The bug was also a pain point when trying to cross-compile Python.

New shiny make regen-all

I decided to rewrite the Makefile to not regenerate generated files based on the file modification time anymore. Instead, I added a new make regen-all command to regenerate explicitly all generated files. Basically, I replaced make touch with make regen-all.

Changes:

Add a new make regen-all command to rebuild all generated files
Add subcommands to only generate specific files:
- regen-ast: Include/Python-ast.h and Python/Python-ast.c
- regen-grammar: Include/graminit.h and Python/graminit.c
- regen-importlib: Python/importlib_external.h and Python/importlib.h
- regen-opcode: Include/opcode.h
- regen-opcode-targets: Python/opcode_targets.h
- regen-typeslots: Objects/typeslots.inc
Rename PYTHON_FOR_GEN to PYTHON_FOR_REGEN
pgen is now only built by make regen-grammar
Add $(srcdir)/ prefix to paths to source files to handle correctly compilation outside the source directory
Remove make touch, Tools/hg/hgtouch.py and .hgtouch

Note: By default, $(PYTHON_FOR_REGEN) is no more used nor needed by "make".

Work on Python buildbots, 2017 Q2

2017-07-13T09:00:00+02:00

I spent the last 6 months on working on buildbots: reduce the failure rate, send email notitication on failure, fix random bugs, detect more bugs using warnings, backport fixes to older branches, etc. I decided to fix all buildbots issues: fix all warnings and all unstable tests!

The good news is that I made great progress, I fixed most random failures. A random fail now became the exception rather than the norm. Some issues were not bugs in tests, but real race conditions in the code. It's always good to fix unlikely race conditions before users hit them on production!

Introduction: Python Buildbots
Orange Is The New Color
New buildbot-status Mailing List
Hardware issues
- The vacuum cleaner
- The memory stick
Warnings
regrtest
Bug fixes
Python 2.7
Buildbot reports to python-dev

Introduction: Python Buildbots

CPython is running a Buildbot server for continuous integration, but tests are run as post-commit: see Python buildbots. CPython is tested by a wide range of buildbot slaves:

6 operating systems:
- Linux (Debian, Ubuntu, Gentoo, RHEL, SLES)
- Windows (7, 8, 8.1 and 10)
- macOS (Tiger, El Capitain, Sierra)
- FreeBSD (9, 10, CURRENT)
- AIX
- OpenIndiana (currently offline)
5 CPU architectures:
- ARMv7
- x86 (Intel 32 bit)
- x86-64 aka "AMD64" (Intel 64-bit)
- PPC64, PPC64LE
- s390x
3 C compilers:
- GCC
- Clang (FreeBSD, macOS)
- Visual Studio (Windows)

There are different kinds of tests:

Python test suite: the most common check
Docs: check that the documentation can be build and doesn't contain warnings
Refleaks: check for reference leaks and memory leaks, run the Python test suite with the --huntrleaks option
DMG: Build the macOS installer with the Mac/BuildScript/build-installer.py script

Python is tested in different configurations:

Debug: ./configure --with-pydebug, the most common configuration
Non-debug: release mode, with compiler optimizations
PGO: Profiled Guided Optimization, ./configure --enable-optimizations
Installed: ./configure --prefix=XXX && make install
Shared library (libpython): ./configure --enable-shared

Currently, 4 branches are tested:

master: called "3.x" on buildbots
3.6
3.5
2.7

There is also custom, a special branch used by core developers for testing patches.

The buildbot configuration can be found in the buildmaster-config project (start with the master/master.cfg file).

Note: Thanks to the migration to GitHub, Pull Requests are now tested on Linux, Windows and macOS by Travis CI and AppVeyor. It's the first time in the CPython development history that we have automated pre-commit tests!

Orange Is The New Color

A buildbot now becomes orange when tests contain warnings.

My first change was to modify the buildbot configuration to extract warnings from the raw test output to create a new "warnings" report, to more easily detect warnings and tests failing randomly (test fail then pass when re-run).

Example of orange build, x86-64 El Capitain 3.x:

Extract of the current master/custom/steps.py:

class Test(BaseTest):
    # Regular expression used to catch warnings, errors and bugs
    warningPattern = (
        # regrtest saved_test_environment warning:
        # Warning -- files was modified by test_distutils
        # test.support @reap_threads:
        # Warning -- threading_cleanup() failed to cleanup ...
        r"Warning -- ",
        # Py_FatalError() call
        r"Fatal Python error:",
        # PyErr_WriteUnraisable() exception: usually, error in
        # garbage collector or destructor
        r"Exception ignored in:",
        # faulthandler_exc_handler(): Windows exception handler installed with
        # AddVectoredExceptionHandler() by faulthandler.enable()
        r"Windows fatal exception:",
        # Resource warning: unclosed file, socket, etc.
        # NOTE: match the "ResourceWarning" anywhere, not only at the start
        r"ResourceWarning",
        # regrtest: At least one test failed. Log a warning even if the test
        # passed on the second try, to notify that a test is unstable.
        r'Re-running failed tests in verbose mode',
        # Re-running test 'test_multiprocessing_fork' in verbose mode
        r'Re-running test .* in verbose mode',
        # Thread last resort exception handler in t_bootstrap()
        r'Unhandled exception in thread started by ',
        # test_os leaked [6, 6, 6] memory blocks, sum=18,
        r'test_[^ ]+ leaked ',
    )
    # Use ".*" prefix to search the regex anywhere since stdout is mixed
    # with stderr, so warnings are not always written at the start
    # of a line. The log consumer calls warningPattern.match(line)
    warningPattern = r".*(?:%s)" % "|".join(warningPattern)
    warningPattern = re.compile(warningPattern)

    # if tests have warnings, mark the overall build as WARNINGS (orange)
    warnOnWarnings = True

New buildbot-status Mailing List

To check buildbots, previously I had to analyze manually the huge "waterfall" view of four Python branches: 2.7, 3.5, 3.6 and master ("3.x").

Example of typical buildbot waterfall:

The screenshot is obviously truncated since the webpage is giant: I have to scroll in all directions... It's not convenient to check the status of all builds, detect random failures, etc.

We also have an IRC bot reporting buildbot failures: when a green (success) or orange (warning) buildbot becomes red (failure). I wanted to have the same thing, but by email. Technically, it's trivial to enable email notification, but I never did it because buildbots were simply too unstable: most failures were not related to the newly tested changes.

But I decided to fix all buildbots issues, so I enabled email notification (bpo-30325). Since May 2017, buildbots are now sending notifications to a new buildbot-status mailing list.

I use the mailing list to check if the failure is known or not: I try to answer to all failure notification emails. If the failure is known, I copy the link to the issue. Otherwise, I create a new issue and then copy the link to the new issue.

Hardware issues

Unit tests versus real life :-) (or "software versus hardware")

The vacuum cleaner

Fixing buildbot issues can be boring sometimes, so let's start with a funny bug. At June 25, Nick Coghlan wrote to the python-buildbots mailing list:

It looks like the FreeBSD buildbots had an outage a little while ago, and the FreeBSD 10 one may need a nudge to get back online (the FreeBSD Current one looks like it came back automatically).

The reason is unexpected :-) Kubilay Kocak, owner of the buildbot, answered:

Vacuum cleaner tripped RCD pulling too much current from the same circuit as heater was running on. Buildbot worker host on same circuit.

The memory stick

I opened at least 50 issues to report random buildbot failures. In the middle of these issues, you can find bpo-30371:

http://buildbot.python.org/all/builders/AMD64%20Windows7%20SP1%203.x/builds/436/steps/test/logs/stdio

======================================================================
FAIL: test_long_lines (test.test_email.test_email.TestFeedParsers)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_email\test_email.py", line 3526, in test_long_lines
    self.assertEqual(m.get_payload(), 'x'*M*N)
AssertionError: 'xxxx[17103482 chars]xxxxxzxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx[2896464 chars]xxxx' != 'xxxx[17103482 chars]xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx[2896464 chars]xxxx'

Notice the "z" in "...xxxxxz...".

and:

New fail, same buildbot:

======================================================================
FAIL: test_long_lines (test.test_email.test_email.TestFeedParsers)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_email\test_email.py", line 3534, in test_long_lines
    self.assertEqual(m.items(), [('a', ''), ('b', 'x'*M*N)])
AssertionError: Lists differ: [('a'[1845894 chars]xxxxxzxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx[18154072 chars]xx')] != [('a'[1845894 chars]xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx[18154072 chars]xx')]

First differing element 1:
('b',[1845882 chars]xxxxxzxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx[18154071 chars]xxx')
('b',[1845882 chars]xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx[18154071 chars]xxx')

  [('a', ''),
   ('b',


Don't click on http://buildbot.python.org/all/builders/AMD64%20Windows7%20SP1%203.x/builds/439/steps/test/logs/stdio
: the log contains lines of 2 MB which make my Firefox super slow :-)

Jeremy Kloth, owner the buildbot, answered:

Watch this space, but I'm pretty sure that it is (was) bad memory.

He fixed the issue:

That's the real problem, I'm not sure it's the memory, but it does have the symptoms. And that is why my buildbot was down earlier, I was attempting to determine the bad stick and replace it.

Warnings

To fix test warnings, I enhanced the test suite to report more information when a warning is emitted and to ease detection of failures.

A major change is the new --fail-env-changed option I added to regrtest (bpo-30764): make tests fail if the "environment" is changed. This option is now used on buildbots, Travis CI and AppVeyor, but only for the master branch yet.

Other changes:

The @reap_threads decorator and the threading_cleanup() function of test.support now log a warning if they fail to clenaup threads. The log may help to debug such other warning seen on the AMD64 FreeBSD CURRENT Non-Debug 3.x buildbot: "Warning -- threading._dangling was modified by test_logging".
threading_cleanup() failure marks test as ENV_CHANGED. If threading_cleanup() fails to cleanup threads, set a a new support.environment_altered flag to true, flag uses by save_env which is used by regrtest to check if a test altered the environment. At the end, the test file fails with ENV_CHANGED instead of SUCCESS, to report that it altered the environment.
regrtest: always show before/after values of modified environment.

I backported all these changes to the 2.7, 3.5 and 3.6 branches to make sure that warnings are fixed in all maintained branches.

regrtest

As usual, I spent time our specialized test runner, regrtest:

bpo-30263: regrtest: log system load and the number of CPUs. I tried to find a relationship between race conditions and the system load. I failed to find any obvious correlation yet, but I still consider that the system load is useful.
bpo-27103: regrtest disables -W if -R (reference hunting) is used. Workaround for a regrtest bug.

But the most complex task was to backport all regrtest features and enhancements from master to regrtest of 3.6, 3.5 and then 2.7 branches.

In Python 3.6, I rewrote regrtest.py file to split it into smaller files a in new Lib/test/libregrtest/ library, so it was painful to backport changes to 3.5 (bpo-30383) which still uses the single regrtest.py file.

In Python 2.7 (bpo-30283), it is even worse. Lib/test/regrtest.py uses the old getopt module to parse the command line instead of the new argparse used in 3.5 and newer. But I succeeded to backport all features and enhancements from master!

Python 2.7, 3.5, 3.6 and master now have almost the same CLI for python -m test, almost the same features (except of one or two missing feature), and should provide the same level of information on failures and warnings.

By the way, the new test.bisect tool is now also available in all these branches. See my New Python test.bisect tool article.

Bug fixes

As expected, the longest section here is the list of changes I wrote to fix all buildbot failures and warnings:

bpo-29972: Skip tests known to fail on AIX. See [Python-Dev] Fix or drop AIX buildbot? email.
bpo-29925: Skip test_uuid1_safe() on OS X Tiger
Fix and optimize test_asyncore.test_quick_connect(). Don't use addCleanup() in test_quick_connect() because it keeps the Thread object alive and so @reap_threads times out after 1 second. "./python -m test -v test_asyncore -m test_quick_connect" now takes 185 ms, instead of 11 seconds.
bpo-30106: Fix test_asyncore.test_quick_connect(). test_quick_connect() runs a thread up to 50 seconds, whereas the socket is connected in 0.2 second and then the thread is expected to end in less than 3 second. On Linux, the thread ends quickly because select() seems to always return quickly. On FreeBSD, sometimes select() fails with timeout and so the thread runs much longer than expected. Fix the thread timeout to fix a race condition in the test.
bpo-30106: Fix tearDown() of test_asyncore. Call asyncore.close_all() with ignore_all=True in the tearDown() method of the test_asyncore base test case. It prevents keeping alive sockets in asyncore.socket_map if close() fails with an unexpected error.
bpo-30108: Restore sys.path in test_site. Add setUpModule() and tearDownModule() functions to test_site to save/restore sys.path at the module level to prevent warning if the user site directory is created, since site.addsitedir() modifies sys.path.
bpo-30107: test_io doesn't dump a core file on an expected crash anymore. test_io has two unit tests which trigger a deadlock: test_daemon_threads_shutdown_stdout_deadlock() and test_daemon_threads_shutdown_stderr_deadlock(). These tests call Py_FatalError() if the expected bug is triggered which calls abort(). Use test.support.SuppressCrashReport to prevent the creation on a core dump, to fix the warning: Warning -- files was modified by test_io (...) After: ['python.core']
bpo-30125: Disable faulthandler to run test_SEH() of test_ctypes to prevent the following log with a traceback: Windows fatal exception: access violation
bpo-30131: test_logging cleans up threads using @support.reap_threads.
bpo-30132: BuildExtTestCase of test_distutils now uses support.temp_cwd() in setUp() to remove files created in the current working directory by BuildExtTestCase unit tests.
bpo-30107: On macOS, test.support.SuppressCrashReport now redirects /usr/bin/defaults command stderr into a pipe to not pollute stderr. It fixes a test_io.test_daemon_threads_shutdown_stderr_deadlock() failure when the CrashReporter domain doesn't exists.
bpo-30175: Skip client cert tests of test_imaplib. The IMAP server cyrus.andrew.cmu.edu doesn't accept our randomly generated client x509 certificate anymore.
bpo-30175: test_nntplib fails randomly with EOFError in NetworkedNNTPTests.setUpClass(): catch EOFError to skip tests in that case.
bpo-30199: AsyncoreEchoServer of test_ssl now calls asyncore.close_all(ignore_all=True) to ensure that asyncore.socket_map is cleared once the test completes, even if ConnectionHandler was not correctly unregistered. Fix the following warning: Warning -- asyncore.socket_map was modified by test_ssl.
Fix test_ftplib warning if IPv6 is not available. DummyFTPServer now calls del_channel() on bind() error to prevent the following warning in TestIPv6Environment.setUpClass(): Warning -- asyncore.socket_map was modified by test_ftplib
bpo-30329: Catch Windows error 10022 on shutdown(). Catch the Windows socket WSAEINVAL error (code 10022) in imaplib and poplib on shutdown(SHUT_RDWR): An invalid operation was attempted. This error occurs sometimes on SSL connections.
bpo-30357: test_thread now uses threading_cleanup(). test_thread: setUp() now uses support.threading_setup() and support.threading_cleanup() to wait until threads complete to avoid random side effects on following tests. Co-Authored-By: Grzegorz Grzywacz.
bpo-30339: test_multiprocessing_main_handling timeout. test_multiprocessing_main_handling: increase the test_source timeout from 10 seconds to 60 seconds, since the test fails randomly on busy buildbots. Sadly, this change wasn't enough to fix buildbots.
bpo-30387: Fix warning in test_threading. test_is_alive_after_fork() now joins directly the thread to avoid the following warning added by bpo-30357: "Warning -- threading_cleanup() failed to cleanup 0 threads after 2 sec (count: 0, dangling: 21)". Use also a different exit code to catch generic exit code 1.
bpo-30649: On Windows, test_os now tolerates a delta of 50 ms instead of 20 ms in test_utime_current() and test_utime_current_old(). On other platforms, reduce the delta from 20 ms to 10 ms. PPC64 Fedora 3.x buildbot requires at least a delta of 14 ms.
bpo-30595: test_queue_feeder_donot_stop_onexc() of _test_multiprocessing now uses a timeout of 1 second on Queue.get(), instead of 0.1 second, for slow buildbots.
bpo-30764, bpo-29335: test_child_terminated_in_stopped_state() of test_subprocess now uses support.SuppressCrashReport() to prevent the creation of a core dump on FreeBSD.
bpo-30280: TestBaseSelectorEventLoop of test.test_asyncio.test_selector_events now correctly closes the event loop: cleanup its executor to not leak threads: don't override the close() method of the event loop, only override the_close_self_pipe() method. asyncio base TestCase now uses threading_setup() and threading_cleanup() of test.support to cleanup threads.
bpo-26568, bpo-30812: Fix test_showwarnmsg_missing(): restore the attribute after removing it.

Python 2.7

I wanted to fix all buildbot issues of all branches including 2.7, whereas I didn't touch much the Python 2.7 code base last months (last years???). The first six months of 2017, I backported dozens of commits from master to 2.7!

For example, I added AppVeyor on 2.7: a Windows CI for GitHub!

On Windows we support multiple versions of Visual Studio. I use Visual Studio 2008, whereas most 2.7 Windows buildbots use Visual Studio 2010 or newer. I fixed sysconfig.is_python_build() if Python is built with Visual Studio 2008 (VS 9.0) (bpo-30342).

Other Python 2.7 changes:

Fix "make tags" command.
bpo-30764: support.SuppressCrashReport backported to 2.7 and "ported" to Windows. Add Windows support to test.support.SuppressCrashReport: call SetErrorMode() and CrtSetReportMode(). _testcapi: add CrtSetReportMode() and CrtSetReportFile() functions and CRT_xxx and CRTDBG_xxx constants needed by SuppressCrashReport.
bpo-30705: Fix test_regrtest.test_crashed(). Add test.support._crash_python() which triggers a crash but uses test.support.SuppressCrashReport() to prevent a crash report from popping up. Modify test_child_terminated_in_stopped_state() of test_subprocess and test_crashed() of test_regrtest to use _crash_python().

I also backported many fixes wrote by other developers, including old fixes up to 8 years old!

Usually, finding the proper fix takes much more time than the cherry-pick itself which is usually straighforward (no conflict, nothing to do). I am always impressed that Git is able to detect that a file was renamed between Python 2 and Python 3, and applies cleanly the change!

Example of backports from master to 2.7:

bpo-6393: Fix locale.getprerredencoding() on macOS. Python crashes on OSX when $LANG is set to some (but not all) invalid values due to an invalid result from nl_langinfo(). Fix written in September 2009 (8 years ago)!
bpo-15526: test_startfile changes the cwd. Try to fix test_startfile's inability to clean up after itself in time. Patch by Jeremy Kloth. Fix the following support.rmtree() error while trying to remove the temporary working directory used by Python tests: "WindowsError: [Error 32] The process cannot access the file because it is being used by another process: ...". Original commit written in September 2012!
bpo-11790: Fix sporadic failures in test_multiprocessing.WithProcessesTestCondition. Fixed written in April 2011. This backported commit was tricky to identify!
bpo-8799, fix test_threading: Reduce timing sensitivity of condition test by explicitly. delaying the main thread so that it doesn't race ahead of the workers. Fix written in Nov 2013.
test_distutils: Use EnvironGuard on InstallTestCase, UtilTestCase, and BuildExtTestCase to prevent the following warning: Warning -- os.environ was modified by test_distutils
Fix test_multprocessing: Relax test timing (bpo-29861) to avoid sporadic failures.

Buildbot reports to python-dev

I also wrote 3 reports to the Python-Dev mailing list:

New Python test.bisect tool

2017-07-12T15:00:00+02:00

This article tells the story of the new CPython test.bisect tool to identify failing tests in the CPython test suite.

Modify manually a test file

I am fixing reference leaks since many years. When the test file contains more than 200 tests and is longer than 5,000 lines, it's just not possible to spot a reference leak. Each time, I modified the long test file and actually removes enough code until the file becomes short enough so I can read it.

This method works, but it usually took me 20 to 30 minutes, and so it was common that I made mistakes... and usually had to restart from the start...

First failed attempt

In october 2014, while fixing yet another reference leak in test_capi, Xavier de Gaye was surprised that I identified quickly the leak and wanted to want how I proceeded. I explained my method removing code, but I also asked for a tool.

Xavier created bpo-22607 at 2014-10-11 and wrote a patch based on an integer range to run a subset of tests and did something special on the subTest() context manager. But Georg Brandl wasn't convinced by this approach and... I forgot this issue.

New design: list tests, run a subset

During this quarter, I had to fix dozens of reference leaks but also tests failing with "environment changed": one test method modified "something". It was really painful to identify each time the failing test.

So I created bpo-29512 at 2017-02-09 to ask again the same tool. Technically, I just wanted to run a subset of tests.

While working on OpenStack, I enjoyed the testr tool, a test runner able to list tests and to run a subset of tests. testr also provides a bisection tool to identify a subset of tests enough to reproduce a bug. The subset can contain more than a single test. Sometimes you need to run two tests sequentially to trigger a specific bug, and it's usually long and boring to identify manually these two tests.

I proposed a similar design for my bisection tool. Start by listing all tests, and then:

create a pure random sample of tests: subset with half the size of the current test set
If tests still fail, use the subset as the new set. Otherwise, throw the subset.
Loop until the subset is small enough or the process run longer than 100 iterations.

regrtest --list-cases

To list tests, I created bpo-30523 and wrote a patch for the unittest module. Modifying unittest didn't work well with doctests and the command line interface (CLI) didn't work as I wanted. I proposed to modify regrtest instead of unittest.

I proposed to Louie Lu to implement my new idea. I was impressed that he implemented it so quickly and that it worked so well! I just asked him to not exclude doctest test cases, since these test cases were working as expected! I quickly merged his modified patch which adds the --list-cases option to regrtest.

Note: regrtest already had a --list-tests which lists test files, whereas --list-cases lists test methods and doctests.

regrtest --matchfile

I created bpo-30540 to add a --matchfile option to regrtest. regrtest already had a --match option, but it was only possible to use the option once, and I wanted to use a text files for my list of tests.

Again, I was surprised that it was so simple to implement the feature. By the way, I modified regrtest --match to allow to specific the option multiple times, to run multiple tests instead of a single one.

New test.bisect tool

Since I had the two key features: regrtest --list-cases and regrtest --matchfile, it became trivial to implement the bisection tool. I wrote a first prototype. The "prototype" worked much better than expected.

My first version required a text file listing test cases. I modified it to run automatically the new --list-cases command.

I extended the tool to not only track reference leaks, but also "environment changed" failures like finding a test which creates a file but doesn't remove it.

I was asked to add this tool in the Python stdlib, so I added it as Lib/test/bisect.py to use it with:

python3 -m test.bisect ...

The test.bisect CLI is similar to the test CLI on purpose.

Reference leak example

I modified test_access() of test_os to add manually a reference leak:

$ ./python -m test -R 3:3 test_os
(...)
test_os leaked [1, 1, 1] references, sum=3
test_os leaked [1, 1, 1] memory blocks, sum=3
test_os failed in 33 sec
(...)

Just replace -m test with -m test.bisect in the command, and you get the guilty method:

$ ./python -m test.bisect -R 3:3 test_os
Start bisection with 257 tests
Test arguments: -R 3:3 test_os
Bisection will stop when getting 1 or less tests (-n/--max-tests option), or after 100 iterations (-N/--max-iter option)

[+] Iteration 1: run 128 tests/257

+ /home/haypo/prog/python/master/python -m test --matchfile /tmp/tmpvbraed7h -R 3:3 test_os
(...)
Tests succeeded: skip this subtest, try a new subbset

[+] Iteration 2: run 128 tests/257

+ /home/haypo/prog/python/master/python -m test --matchfile /tmp/tmpcjqtzgfe -R 3:3 test_os
(...)
Tests failed: use this new subtest

[+] Iteration 3: run 64 tests/128
(...)
[+] Iteration 15: run 1 tests/2
(...)

Tests (1):
* test.test_os.FileTests.test_access

Bisection completed in 16 iterations and 0:03:10

The test.bisect command found the bug I introduced: test.test_os.FileTests.test_access.

The command takes a few minutes, but I don't care of its performance as soon as its fully automated! If you use the -o file option, each time the tool is able to reduce the size of the test set, it writes the new list of tests on disk. So even if the tool crashs or fails to find a single failure test, it already helps!

I am now very happy that test.bisect works better than I expected. So I backported it to 2.7, 3.5, 3.6 and master branches, since I want to fix all buildbot failures on all maintained branches.

Environment changed example

While running the previous example, I noticed the following warning:

Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2)

Using the new --fail-env-changed option, it is now posible to check which test of test_os emits such warning:

haypo@selma$ ./python -m test.bisect --fail-env-changed -R 3:3 test_os
(...)

Tests (1):
* test.test_os.TestSendfile.test_keywords

Bisection completed in 14 iterations and 0:03:27

I never trust anything, so let's confirm the bug:

haypo@selma$ ./python -m test --fail-env-changed -R 3:3 test_os -m test.test_os.TestSendfile.test_keywords
Run tests sequentially
0:00:00 load avg: 0.33 [1/1] test_os
Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2)
beginning 6 repetitions
123456
Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2)
.
Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2)
.Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2)
.Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2)
.Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2)
.Warning -- threading_cleanup() failed to cleanup 0 threads after 3 sec (count: 0, dangling: 2)
.
test_os failed (env changed)

1 test altered the execution environment:
    test_os

Total duration: 21 sec
Tests result: ENV CHANGED

Ok right, there is something wrong with test_keywords(). I just opened the bpo-30908.

My contributions to CPython during 2017 Q1

2017-07-05T12:00:00+02:00

My contributions to CPython during 2017 Q1 (january, februrary, march):

Statistics
Optimization
Tricky bug
FASTCALL optimizations
Stack consumption
Contributions
os.urandom() and getrandom()
Migration to GitHub
Enhancements
Security
regrtest
Bugfixes

Previous report: My contributions to CPython during 2016 Q4. Next report: My contributions to CPython during 2017 Q2 (part 1).

Statistics

# All commits
$ git log --after=2016-12-31 --before=2017-04-01 --reverse --branches='*' --author=Stinner > 2017Q1
$ grep '^commit ' 2017Q1|wc -l
121

# Exclude merges
$ git log --no-merges --after=2016-12-31 --before=2017-04-01 --reverse --branches='*' --author=Stinner|grep '^commit '|wc -l
105

# master branch (excluding merges)
$ git log --no-merges --after=2016-12-31 --before=2017-04-01 --reverse --author=Stinner origin/master|grep '^commit '|wc -l
98

# Only merges
$ git log --merges --after=2016-12-31 --before=2017-04-01 --reverse --branches='*' --author=Stinner|grep '^commit '|wc -l
16

Statistics: 98 commits in the master branch, 16 merge commits (done using Mercurial before the migration to GitHub, and then converted to Git), and 7 other commits (likely backports), total: 121 commits.

Optimization

With the work done in 2016 on FASTCALL, it became much easier to optimize code by using the new FASTCALL API.

Python slots

Issue #29507: I worked with INADA Naoki to continue the work he did with Yury Selivanov on optimizing method calls. We optimized "slots" implemented in Python. Slots is an internal optimization to call "dunder" methods like __getitem__().

For Python methods, get the unbound Python function and prepend arguments with self, rather than calling the descriptor which creates a temporary PyMethodObject.

Add a new _PyObject_FastCall_Prepend() function used to call the unbound Python method with self. It avoids the creation of a temporary tuple to pass positional arguments.

Avoiding a temporary PyMethodObject and a temporary tuple makes Python slots up to 1.46x faster. Microbenchmark on a __getitem__() method implemented in Python:

Median +- std dev: 121 ns +- 5 ns -> 82.8 ns +- 1.0 ns: 1.46x faster (-31%)

struct module

In the issue #29300, Serhiy Storchaka and me converted most methods in the C _struct module to Argument Clinic to make them use the FASTCALL calling convention. Using METH_FASTCALL avoids the creation of temporary tuple to pass positional arguments and so is faster. For example, struct.pack("i", 1) becomes 1.56x faster (-36%):

$ ./python -m perf timeit \
    -s 'import struct; pack=struct.pack' 'pack("i", 1)' \
    --compare-to=../default-ref/python
Median +- std dev: 119 ns +- 1 ns -> 76.8 ns +- 0.4 ns: 1.56x faster (-36%)
Significant (t=295.91)

The difference is only 42.2 ns, but since the function only takes 76.8 ns, the difference is significant. The speedup can also be explained by more efficient functions used to parse arguments. The new functions now use a cache on the format string.

deque module

Similar change in the deque module, I modified the index(), insert() and rotate() methods to use METH_FASTCALL. Speedup:

d.index(): 1.24x faster
d.rotate(1): 1.24x faster
d.insert(): 1.18x faster
d.rotate(): 1.10x faster

Tricky bug

test_exceptions.test_unraisable()

The optimization on Python slots (issue #29507) caused a regression in the test_unraisable() unit test of test_exceptions.

The test_unraisable() method expects that PyErr_WriteUnraisable(method) fails on repr(method).

Before the change, slot_tp_finalize() called PyErr_WriteUnraisable() with a PyMethodObject. In this case, repr(method) calls repr(self) which is BrokenRepr.__repr__() and the calls raises a new exception.

After the change, slot_tp_finalize() uses an unbound method: repr() is called on a regular __del__() method which doesn't call repr(self) and so repr() doesn't fail anymore.

The fix is to remove the BrokenRepr unit test, since PyErr_WriteUnraisable() doesn't call __repr__() anymore.

The removed test was really implementation specific, and my optimization "fixed" the bug or "broke" the test. It's hard to say :-)

unittest assertRaises() reference cycle

At April 2015, Vjacheslav Fyodorov reported a reference cycle in the assertRaises() method of the unittest module: bpo-23890.

When the context manager API of the assertRaises() method is used, the context manager returns an object which contains the exception. So the exception is kept alive longer than usual.

Python 3 exceptions now store traceback objects which contain local variables. If a function stores the current exception in a local variable and the frame of this function is part of the traceback, we get a reference cycle:

exception -> traceback > frame -> variable -> exception

I fixed the reference cycle by manually clearing local variables. Example of change of my commit:

try:
    return context.handle('assertRaises', args, kwargs)
finally:
    # bpo-23890: manually break a reference cycle
    context = None

It's not the first time that I fixed such reference cycle in the unit test module. My previous fix was the issue #19880. Fix a reference leak in unittest.TestCase. Explicitly break reference cycles between frames and the _Outcome instance: commit 031bd532.

FASTCALL optimizations

FASTCALL is my project to avoid temporary tuple to pass positional arguments and avoid temporary dictionary to pass keyword arguments when calling a function. It optimizes function calls in general.

I continued work on FASTCALL to optimize code further and use FASTCALL in more cases.

Recursion depth

In the issue #29306, I fixed the usage of Py_EnterRecursiveCall() to account correctly the recursion depth, to fix the code responsible to prevent C stack overflow:

*PyCFunction_*Call*() functions now call Py_EnterRecursiveCall().
PyObject_Call() now calls directly _PyFunction_FastCallDict() and PyCFunction_Call() to avoid calling Py_EnterRecursiveCall() twice per function call

Support position arguments

The issue #29286 enhanced Argument Clinic to use FASTCALL for functions which only accept positional arguments:

Rename _PyArg_ParseStack to _PyArg_ParseStackAndKeywords
Add _PyArg_ParseStack() helper function
Add _PyArg_NoStackKeywords() helper function.
Add _PyArg_UnpackStack() function helper
Argument Clinic: Use METH_FASTCALL calling convention instead of METH_VARARGS to parse position arguments and to parse "boring" position arguments.

Functions converted to FASTCALL

_hashopenssl module
collections.OrderedDict methods (some of them, not all)
__build_class__(), getattr(), next() and sorted() builtin functions
type_prepare() C function, used in type constructor
dict.get() and dict.setdefault() now use Argument Clinic. The signature of docstrings is also enhanced. For example, get(...) becomes get(self, key, default=None, /). Add also a note explaining why dict_update() doesn't use METH_FASTCALL.

Optimizations

Issue #28839: Optimize function_call(), now simply calls _PyFunction_FastCallDict() which is more efficient (fast paths for the common case, optimized code object and no keyword argument).
Issue #28839: Optimize _PyFunction_FastCallDict() when kwargs is an empty dictionary, avoid the creation of an useless empty tuple.
Issue #29259: Write fast path in _PyCFunction_FastCallKeywords() for METH_FASTCALL, avoid the creation of a temporary dictionary for keyword arguments.
Issue #29259, #29263. methoddescr_call() creates a PyCFunction object, call it and the destroy it. Add a new _PyMethodDef_RawFastCallDict() method to avoid the temporary PyCFunction object.
PyCFunction_Call() now calls _PyCFunction_FastCallDict()
bpo-29735: Optimize partial_call(): avoid tuple. Add _PyObject_HasFastCall(). Fix also a performance regression in partial_call() if the callable doesn't support FASTCALL.

Bugfixes

Issue #29286: _PyStack_UnpackDict() now returns -1 on error. Change _PyStack_UnpackDict() prototype to be able to notify of failure when args is NULL.
Fix PyCFunction_Call() performance issue. Issue #29259, #29465: PyCFunction_Call() doesn't create anymore a redundant tuple to pass positional arguments for METH_VARARGS. Add a new cfunction_call() subfunction.

Objects/call.c file

The issue #29465 moved all C functions "calling functions" to a new Objects/call.c file. Moving all functions at the same place should help to keep the code consistent. It might also help the compiler to inline code more easily, or maybe help to cache more machine code in CPU instruction cache.

This change was made during the GitHub migration. Since the change is big (modify many .c files), I got many conflicts and it was annoying to rebase it. I am now happy to get this call.c file, it already helped me :-)

Having call.c also helps to keep helper functions need their callers, and prevent to expose them in the C API, even if they are exposed as private functions.

Don't optimize keywords

Document that _PyFunction_FastCallDict() must copy kwargs. Issue #29318: Caller and callee functions must not share the dictionary: kwargs must be copied.
Document why functools.partial() must copy kwargs. Add a comment to prevent further attempts to avoid a copy for optimization.

Stack consumption

A FASTCALL micro-optimization was blocked by Serhiy Storchaka because it increased the C stack consumption. In the past, I never analyzed the C stack consumption. Since I wanted to get this micro-optimization merged, I tried to reduce the consumption.

At the beginning, I wrote a function to measure the C stack consumption in a reliable way. It took me a few iterations.

Table showing the C stack consumption in bytes, and the difference compared to Python 3.5 (last release before I started working on FASTCALL):

Function	2.7	3.5	3.6	3.7
test_python_call	1,360 (+352)	1,008	1,120 (+112)	960 (-48)
test_python_getitem	1,408 (+288)	1,120	1,168 (+48)	880 (-240)
test_python_iterator	1,424 (+192)	1,232	1,200 (-32)	1,024 (-208)
Total	4,192 (+832)	3,360	3,488 (+128)	2,864 (-496)

Table showing the number of function calls before a stack overflow, and the difference compared to Python 3.5:

Function	2.7	3.5	3.6	3.7
test_python_call	6,161 (-2,153)	8,314	7,482 (-832)	8,729 (+415)
test_python_getitem	5,951 (-1,531)	7,482	7,174 (-308)	9,522 (+2,040)
test_python_iterator	5,885 (-916)	6,801	6,983 (+182)	8,184 (+1,383)
Total	17,997 (-4600)	22,597	21,639 (-958)	26,435 (+3,838)

Python 3.7 is the best of 2.7, 3.5, 3.6 and 3.7: lowest stack consumption and maximum number of calls (before a stack overflow) ;-)

Changes:

call_method() now uses _PyObject_FastCall(). Issue #29233: Replace the inefficient _PyObject_VaCallFunctionObjArgs() with _PyObject_FastCall() in call_method() and call_maybe().
Issue #29227: Inline call_function() into _PyEval_EvalFrameDefault() using Py_LOCAL_INLINE to reduce the stack consumption.
Issue #29234: Inlining _PyStack_AsTuple() into callers increases their stack consumption, Disable inlining to optimize the stack consumption. Add _Py_NO_INLINE: use __attribute__((noinline)) of GCC and Clang.

Contributions

Issue #28961: Fix unittest.mock._Call helper: don't ignore the name parameter anymore. Patch written by Jiajun Huang.
Prohibit implicit C function declarations. Issue #27659: use -Werror=implicit-function-declaration when possible (GCC and Clang, but it depends on the compiler version). Patch written by Chi Hsuan Yen.

os.urandom() and getrandom()

As usual, I had fun with os.urandom() in this quarter (see my previous article on urandom: PEP 524: os.urandom() now blocks on Linux in Python 3.6).

The glibc developers succeeded to implement a function getrandom() in glibc 2.25 (February 2017) to expose the "new" Linux getrandom() syscall which was introduced in Linux 3.17 (August 2014). Read the LWN article: The long road to getrandom() in glibc.

I created the issue #29157 because my os.urandom() implementation wasn't ready for the addition of a getrandom() function on Linux. My implementation using the getrandom() function didn't handle the ENOSYS error (syscall not supported), when Python is compiled on a recent kernel and glibc, but run on an older kernel and glibc.

I rewrote the code to prefer getrandom() over getentropy():

dev_urandom() now calls py_getentropy(). Prepare the fallback to support getentropy() failure and falls back on reading from /dev/urandom.
Simplify dev_urandom(). pyurandom() is now responsible to call getentropy() or getrandom(). Enhance also dev_urandom() and pyurandom() documentation.
getrandom() is now preferred over getentropy(). The glibc 2.24 now implements getentropy() on Linux using the getrandom() syscall. But getentropy() doesn't support non-blocking mode. Since getrandom() is tried first, it's not more needed to explicitly exclude getentropy() on Solaris. Replace: "if defined(HAVE_GETENTROPY) && !defined(sun)" with "if defined(HAVE_GETENTROPY)"
Enhance py_getrandom() documentation. py_getentropy() now supports ENOSYS, EPERM & EINTR

IMHO the main enhancement was the documentation (comments) of the code. The main function pyrandom() now has this long comment:

Read random bytes:

Return 0 on success

Raise an exception (if raise is non-zero) and return -1 on error

Used sources of entropy ordered by preference, preferred source first:

CryptGenRandom() on Windows

getrandom() function (ex: Linux and Solaris): call py_getrandom()

getentropy() function (ex: OpenBSD): call py_getentropy()

/dev/urandom device

Read from the /dev/urandom device if getrandom() or getentropy() function is not available or does not work.

Prefer getrandom() over getentropy() because getrandom() supports blocking and non-blocking mode: see the PEP 524. Python requires non-blocking RNG at startup to initialize its hash secret, but os.urandom() must block until the system urandom is initialized (at least on Linux 3.17 and newer).

Prefer getrandom() and getentropy() over reading directly /dev/urandom because these functions don't need file descriptors and so avoid ENFILE or EMFILE errors (too many open files): see the issue #18756.

Only the getrandom() function supports non-blocking mode.

Only use RNG running in the kernel. They are more secure because it is harder to get the internal state of a RNG running in the kernel land than a RNG running in the user land. The kernel has a direct access to the hardware and has access to hardware RNG, they are used as entropy sources.

Note: the OpenSSL RAND_pseudo_bytes() function does not automatically reseed its RNG on fork(), two child processes (with the same pid) generate the same random numbers: see issue #18747. Kernel RNGs don't have this issue, they have access to good quality entropy sources.

If raise is zero:

Don't raise an exception on error

Don't call the Python signal handler (don't call PyErr_CheckSignals()) if a function fails with EINTR: retry directly the interrupted function

Don't release the GIL to call functions.

Migration to GitHub

In February 2017, the Mercurial repository was converted to Git and the development of CPython moved to GitHub at https://github.com/python/cpython/. I helped to polish the migration in early days:

Rename README to README.rst and enhance formatting
bpo-29527: Don't treat warnings as error in Travis docs job
Travis CI: run rstlint.py in the docs job. Currently, http://buildbot.python.org/all/buildslaves/ware-docs buildbot is only run as post-commit. For example, bpo-29521 (PR#41) introduced two warnings, unnotified by the Travis CI docs job. Modify the docs job to run toosl/rstlint.py. Fix also the two minor warnings which causes the buildbot slave to fail. Doc/Makefile: set PYTHON to python3.
Add Travis CI and Codecov badges to README.
Exclude myself from mention-bot. I made changes in almost all CPython files last 5 years, so mention-bot asks me to review basically all pull requests. I simply don't have the bandwidth to review everything, sorry! I prefer to select myself which PR I want to follow.
bpo-27425: Add .gitattributes, fix Windows tests. Mark binary files as binay in .gitattributes to not translate newline characters in Git repositories on Windows.

Enhancements

Issue #29259: python-gdb.py now also looks for PyCFunction in the current frame, not only in the older frame. python-gdb.py now also supports method-wrapper (wrapperobject) objects (Issue #29367).
Issue #26273: Document the new TCP_USER_TIMEOUT and TCP_CONGESTION constants
bpo-29919: Remove unused imports found by pyflakes. Make also minor PEP8 coding style fixes on modified imports.
bpo-29887: Test normalization now fails if download fails; fix also a ResourceWarning.

Security

Backport for Python 3.4. Issues #27850 and #27766: Remove 3DES from ssl default cipher list and add ChaCha20 Poly1305. See the CVE-2016-2183: Sweet32 attack (DES, 3DES) vulnerability.

regrtest

regrtest is the runner of the Python test suite. Changes:

regrtest: don't fail immediately if a child does crash. Issue #29362: Catch a crash of a worker process as a normal failure and continue to run next tests. It allows to get the usual test summary: single line result (OK/FAIL), total duration, etc.
Fix regrtest -j0 -R output: write also dots into stderr, instead of stdout.

Bugfixes

Issue #29140: Fix hash(datetime.time). Fix time_hash() function: replace DATE_xxx() macros with TIME_xxx() macros. Before, the hash function used a wrong value for microseconds if fold is set (equal to 1).
Issue #29174, #26741: Fix subprocess.Popen.__del__() on Python shutdown. subprocess.Popen.__del__() now keeps a strong reference to warnings.warn() function. The change allows to log the warning late at Python finalization. Before the warning was ignored or logged an error instead of the warning.
Issue #25591: Fix test_imaplib if the module ssl is missing.
Fix script_helper.run_python_until_end(): copy the SYSTEMROOT environment variable. Windows requires at least the SYSTEMROOT environment variable to start Python. If run_python_until_end() doesn't copy SYSTEMROOT, the function always fail on Windows.
Fix datetime.fromtimestamp(): check bounds. Issue #29100: Fix datetime.fromtimestamp() regression introduced in Python 3.6.0: check minimum and maximum years.
Fix test_datetime on system with 32-bit time_t. Issue #29100: Catch OverflowError in the new test_timestamp_limits() test.
Fix test_datetime on Windows. Issue #29100: On Windows, datetime.datetime.fromtimestamp(min_ts) fails with an OSError in test_timestamp_limits().
bpo-29176: Fix the name of the _curses.window class. Set name to _curses.window instead of _curses.curses window with a space!?
bpo-29619: os.stat() and os.DirEntry.inodeo() now convert inode (st_ino) using unsigned integers to support very large inodes (larger than 2^31).

speed.python.org results: March 2017

2017-03-29T00:40:00+02:00

In feburary 2017, CPython from Bitbucket with Mercurial moved to GitHub with Git: read [Python-Dev] CPython is now on GitHub by Brett Cannon.

In 2016, I worked on speed.python.org to automate running benchmarks and make benchmarks more stable. At the end, I had a single command to:

tune the system for benchmarks
compile CPython using LTO+PGO
install CPython
install performance
run performance
upload results

But my tools were written for Mercurial and speed.python.org uses Mercurial revisions as keys for changes. Since the CPython repository was converted to Git, I have to remove all old results and run again old benchmarks. But before removing everyhing, I took screenshots of the most interesting pages. It would prefer to keep a copy of all data, but it would require to write new tools and I am not motivated to do that.

Python 3.7 compared to Python 2.7

Benchmarks where Python 3.7 is faster than Python 2.7:

Benchmarks where Python 3.7 is slower than Python 2.7:

Significant optimizations

CPython became regulary faster in 2016 on the following benchmarks.

call_method, the main optimized was Speedup method calls 1.2x:

float:

hexiom:

nqueens:

pickle_list, something happened near September 2016:

richards:

scimark_lu, I like the latest dot!

scimark_sor:

sympy_sum:

telco is one of the most impressive, it became regulary faster:

unpickle_list, something happened between March and May 2016:

The enum change

One change related to the enum module had significant impact on the two following benchmarks.

python_startup:

See "Python startup performance regression" section of My contributions to CPython during 2016 Q4 for the explanation on changes around September 2016.

regex_compile became 1.2x slower (312 ms => 376 ms: +20%) because constants of the re module became enum objects: see convert re flags to (much friendlier) IntFlag constants (issue #28082).

Benchmarks became stable

The following benchmarks are microbenchmarks which are impacted by many external factors. It's hard to get stable results. I'm happy to see that results are stable. I would say very stable compared to results when I started to work on the project!

call_simple:

spectral_norm:

Straight line

It seems like no optimization had a significant impact on the following benchmarks. You can also see that benchmarks became stable, so it's easier to detect performance regression or significant optimization.

dulwich_log:

pidigits:

sqlite_synth:

Apart something around April 2016, tornado_http result is stable:

Unstable benchmarks

After months of efforts to make everything stable, some benchmarks are still unstable, even if temporary spikes are lower than before. See Analysis of a Python performance issue to see the size of previous tempoary performance spikes.

regex_v8:

scimark_sparse_mat_mult:

unpickle_pure_python:

Boring results

There is nothing interesting to say on the following benchmark results.

2to3:

crypto_pyaes:

deltablue:

logging_silent:

mako:

xml_etree_process:

xml_etre_iterparse:

FASTCALL issues

2017-02-25T00:00:00+01:00

Here is the raw list of the 46 CPython issues I opended between 2016-04-21 and 2017-02-10 to implement my FASTCALL optimization. Most issues created in 2016 are already part of Python 3.6.0, some are already merged into the future Python 3.7, the few remaining issues are still open.

27 FASTCALL issues

2016-04-21: [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments
2016-05-26: Add _PyObject_FastCall()
2016-08-20: Add _PyFunction_FastCallDict(): fast call with keyword arguments as a dict
2016-08-20: Add METH_FASTCALL: new calling convention for C functions
2016-08-22: Add _PyObject_FastCallKeywords(): avoid the creation of a temporary dictionary for keyword arguments
2016-08-23: functools.partial: don't copy keywoard arguments in partial_call()? [REJECTED]
2016-08-23: Use fast call in method_call() and slot_tp_new()
2016-08-23: Optimize update_keyword_args() function
2016-11-22: Update python-gdb.py for fastcalls
2016-11-30: _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc() [REJECTED]
2016-12-02: Compiler warnings in _PyObject_CallArg1()
2016-12-02: Fastcall uses more C stack
2016-12-09: Modify PyObject_CallFunction() to use fast call internally
2017-01-10: Reduce C stack consumption in function calls
2017-01-10: call_method(): call _PyObject_FastCall() rather than _PyObject_VaCallFunctionObjArgs()
2017-01-11: Disable inlining of _PyStack_AsTuple() to reduce the stack consumption
2017-01-13: Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects [REJECTED]
2017-01-13: Implement LOAD_METHOD/CALL_METHOD for C functions
2017-01-18: Check usage of Py_EnterRecursiveCall() and Py_LeaveRecursiveCall() in new FASTCALL functions
2017-01-19: Optimize _PyFunction_FastCallDict() for **kwargs [REJECTED]
2017-01-24: Add tp_fastnew and tp_fastinit to PyTypeObject, 15-20% faster object instanciation [REJECTED]
2017-01-24: _PyStack_AsDict(): Don't check if all keys are strings nor if keys are unique
2017-01-25: python-gdb: display wrapper_call()
2017-02-05: Use _PyArg_Parser for _PyArg_ParseStack(): support positional only arguments
2017-02-06: Modify _PyObject_FastCall() to reduce stack consumption
2017-02-09: Use FASTCALL in call_method() to avoid temporary tuple
2017-02-10: Move functions to call objects into a new Objects/call.c file

3 issues converting functions to FASTCALL

2017-01-16: Use METH_FASTCALL in str methods
2017-01-18: Use FASTCALL in dict.update() [REJECTED]
2017-02-05: Use FASTCALL for collections.deque methods: index, insert, rotate

6 Argument Clinic issues

Converting code to Argument Clinic converts METH_VARARGS methods to METH_FASTCALL.

2017-01-16: Convert OrderedDict methods to Argument Clinic
2017-01-17: Argument Clinic: Fix signature of optional positional-only arguments
2017-01-17: Modify the _struct module to use FASTCALL and Argument Clinic
2017-01-17: decimal: Use FASTCALL and/or Argument Clinic
2017-01-18: Argument Clinic: convert dict methods
2017-02-02: Argument Clinic: inline PyArg_UnpackTuple and PyArg_ParseStack(AndKeyword)?

10 other optimization issues

2016-08-24: C function calls: use Py_ssize_t rather than C int for number of arguments
2016-09-07: Optimize bytes.join(sequence) [REJECTED]
2016-11-05: Decorate hot functions using __attribute__((hot)) to optimize Python
2016-11-07: Python startup performance regression
2016-11-25: Add RETURN_NONE bytecode instruction [REJECTED]
2016-11-25: Drop CALL_PROFILE special build?
2016-12-09: Inline PyEval_EvalFrameEx() in callers [REJECTED]
2016-12-15: Document PyObject_CallFunction() special case more explicitly
2017-02-06: Experiment usage of likely/unlikely in CPython core
2017-02-08: Should PyObject_Call() call the profiler on C functions, use C_TRACE() macro?

FASTCALL microbenchmarks

2017-02-24T22:00:00+01:00

For my FASTCALL project (CPython optimization avoiding temporary tuples and dictionaries to pass arguments), I wrote many short microbenchmarks. I grouped them into a new Git repository: pymicrobench. Benchmark results are required by CPython developers to prove that an optimization is worth it. It's not uncommon that I abandon a change because the speedup is not significant, makes CPython slower, or because the change is too complex. Last 12 months, I counted that I abandonned 9 optimization issues, rejected for different reasons, on a total of 46 optimization issues.

This article gives Python 3.7 results of these microbenchmarks compared to Python 3.5 (before FASTCALL). I ignored 3 microbenchmarks which are between 2% and 5% slower: the code was not optimized and the result is not signifiant (less than 10% on a microbenchmark is not significant).

On results below, the speedup is between 1.11x faster (-10%) and 1.92x faster (-48%). It's not easy to isolate the speedup of only FASTCALL. Since Python 3.5, Python 3.7 got many other optimizations.

Using FASTCALL gives a speedup around 20 ns: measured on a patch to use FASTCALL. It's not a lot, but many builtin functions take less than 100 ns, so 20 ns is significant in practice! Avoiding a tuple to pass positional arguments is interesting, but FASTCALL also allows further internal optimizations.

Microbenchmark on calling builtin functions:

Benchmark	3.5	3.7
struct.pack("i", 1)	105 ns	77.6 ns: 1.36x faster (-26%)
getattr(1, "real")	79.4 ns	64.4 ns: 1.23x faster (-19%)

Microbenchmark on calling methods of builtin types:

Benchmark	3.5	3.7
{1: 2}.get(7, None)	84.9 ns	61.6 ns: 1.38x faster (-27%)
collections.deque([None]).index(None)	116 ns	87.0 ns: 1.33x faster (-25%)
{1: 2}.get(1)	79.4 ns	59.6 ns: 1.33x faster (-25%)
"a".replace("x", "y")	134 ns	101 ns: 1.33x faster (-25%)
b"".decode()	71.5 ns	54.5 ns: 1.31x faster (-24%)
b"".decode("ascii")	99.1 ns	75.7 ns: 1.31x faster (-24%)
collections.deque.rotate(1)	106 ns	82.8 ns: 1.28x faster (-22%)
collections.deque.insert()	778 ns	608 ns: 1.28x faster (-22%)
b"".join((b"hello", b"world") * 100)	4.02 us	3.32 us: 1.21x faster (-17%)
[0].count(0)	53.9 ns	46.3 ns: 1.16x faster (-14%)
collections.deque.rotate()	72.6 ns	63.1 ns: 1.15x faster (-13%)
b"".join((b"hello", b"world"))	102 ns	89.8 ns: 1.13x faster (-12%)

Microbenchmark on builtin functions calling Python functions (callbacks):

Benchmark	3.5	3.7
map(lambda x: x, list(range(1000)))	76.1 us	61.1 us: 1.25x faster (-20%)
sorted(list(range(1000)), key=lambda x: x)	90.2 us	78.2 us: 1.15x faster (-13%)
filter(lambda x: x, list(range(1000)))	81.8 us	73.4 us: 1.11x faster (-10%)

Microbenchmark on calling slots (__getitem__, __init__, __int__) implemented in Python:

Benchmark	3.5	3.7
Python __getitem__: obj[0]	167 ns	87.0 ns: 1.92x faster (-48%)
call_pyinit_kw1	348 ns	240 ns: 1.45x faster (-31%)
call_pyinit_kw5	564 ns	401 ns: 1.41x faster (-29%)
call_pyinit_kw10	960 ns	734 ns: 1.31x faster (-24%)
Python __int__: int(obj)	241 ns	207 ns: 1.16x faster (-14%)

Microbenchmark on calling a method descriptor (static method):

Benchmark	3.5	3.7
int.to_bytes(1, 4, "little")	177 ns	103 ns: 1.72x faster (-42%)

Benchmarks were run on speed-python, server used to run CPython benchmarks.

The start of the FASTCALL project

2017-02-16T17:00:00+01:00

False start

In April 2016, I experimented a Python change to avoid temporary tuple to call functions. Builtin functions were between 20 and 50% faster!

Sadly, some benchmarks were randomy slower. It will take me four months to understand why!

Work on benchmarks

During four months, I worked on making benchmarks more stable. See my previous blog posts:

My journey to stable benchmark, part 1 (system) (May 21, 2016)
My journey to stable benchmark, part 2 (deadcode) (May 22, 2016)
My journey to stable benchmark, part 3 (average) (May 23, 2016)
Visualize the system noise using perf and CPU isolation (June 16, 2016)
Intel CPUs: P-state, C-state, Turbo Boost, CPU frequency, etc. (July 15, 2015)
Intel CPUs (part 2): Turbo Boost, temperature, frequency and Pstate C0 bug (September 23, 2016)
Analysis of a Python performance issue (November 19, 2016)
...

See my talk How to run a stable benchmark that I gave at FOSDEM 2017 (Brussels, Belgium): slides + video. I listed all the issues that I had to get reliable benchmarks.

Ask for permission

August 2016, I confirmed that my change didn't introduce any slowndown. So I asked for the permission on the python-dev mailing list to start pushing changes: New calling convention to avoid temporarily tuples when calling functions.

Guido van Rossum asked me for benchmark results:

But is there a performance improvement?

Benchmark results

On micro-benchmarks, FASTCALL is much faster:

getattr(1, "real") becomes 44% faster
list(filter(lambda x: x, list(range(1000)))) becomes 31% faster
namedtuple.attr (read the attribute) becomes 23% faster
...

Full results:

On the CPython benchmark suite, I also saw many faster benchmarks:

pickle_list: 1.29x faster
etree_generate: 1.22x faster
pickle_dict: 1.19x faster
etree_process: 1.16x faster
mako_v2: 1.13x faster
telco: 1.09x faster
...

Replies to my email

I got two very positive replies, so I understood that it was ok.

Brett Canon:

I just wanted to say I'm excited about this and I'm glad someone is taking advantage of what Argument Clinic allows for and what I know Larry had initially hoped AC would make happen!

Yury Selivanov:

Exceptional results, congrats Victor. Will be happy to help with code review.

Real start

That's how the FASTCALL began for real! I started to push a long serie of patches adding new private functions and then modify code to call these new functions.

My contributions to CPython during 2016 Q4

2017-02-16T11:00:00+01:00

My contributions to CPython during 2016 Q4 (october, november, december):

hg log -r 'date("2016-10-01"):date("2016-12-31")' --no-merges -u Stinner

Statistics: 105 non-merge commits + 31 merge commits (total: 136 commits).

Previous report: My contributions to CPython during 2016 Q3. Next report: My contributions to CPython during 2017 Q1.

Table of Contents:

Python startup performance regression
Optimizations
Code placement and __attribute__((hot))
Interesting bug: duplicated filters when tests reload the warnings module
Contributions
regrtest
Other changes

Python startup performance regression

Regresion

My work on tracking Python performances started to become useful :-) I identified a performance slowdown on the bm_python_startup benchmark (average time to start Python).

Before September 2016, the start took around 17.9 ms. At September 15, after the CPython sprint, it was better: 13.4 ms. But suddenly, at september 19, it became much worse: 22.8 ms. What happened?

Timeline of Python startup performance on speed.python.org:

I looked at commits between September 15 and September 19, and I quickly identified the commit of the convert re flags to (much friendlier) IntFlag constants (issue #28082). The re module now imports the enum module to get a better representation for their flags. Example:

$ ./python
Python 3.7.0a0
>>> import re; re.M
<RegexFlag.MULTILINE: 8>

Revert

At November 7, I opened the issue #28637 to propose to revert the commit to get back better Python startup performance. The revert was approved by Guido van Rossum, so I pushed it.

Better fix

I also noticed that the re module is not imported by default if Python is installed or if Python is run from its source code directory. The re module is only imported by default if Python is installed in a virtual environment.

Serhiy Storchaka proposed a change to not import re anymore in the site module when Python runs into a virutal environment. Since the benefit was obvious (avoid an import at startup) and simple, it was quickly merged.

Restore reverted enum change

Since using enum in re has no more impact on Python startup performance by default, the enum change was restored at November 14.

Sadly, the enum change still have an impact on performance: re.compile() became 1.2x slower (312 ms => 376 ms: +20%).

I think that it's ok since it is very easy to use precompiled regular expressions in an application: store and reuse the result of re.compile(), instead of calling directly re.match() for example.

Optimizations

FASTCALL

Same than 2016 Q3: I pushed a lot of changes for FASTCALL optimizations, but I will write a dedicated article later.

No int+int micro-optimization, thank you

After 2 years of benchmarking and a huge effort of making Python benchmarks more reliable and stable, I decided to close the issue #21955 "ceval.c: implement fast path for integers with a single digit" as REJECTED. It became clear to me that such micro-optimization has no effect on non-trivial code, but only on specially crafted micro-benchmarks. I added a comment in the C code to prevent further optimizations attempts:

/* NOTE(haypo): Please don't try to micro-optimize int+int on
   CPython using bytecode, it is simply worthless.
   See http://bugs.python.org/issue21955 and
   http://bugs.python.org/issue10044 for the discussion. In short,
   no patch shown any impact on a realistic benchmark, only a minor
   speedup on microbenchmarks. */

timeit

I enhanced the timeit benchmark module to make it more reliable (issue #28240):

Autorange now starts with a single loop iteration instead of 10. For example, python3 -m timeit -s 'import time' 'time.sleep(1)' now only takes 4 seconds instead of 40 seconds.
Repeat the benchmarks 5 times by default, instead of only 3, to make benchmarks more reliable.
Remove -c/--clock and -t/--time command line options which were deprecated since Python 3.3.
Add nsec (nanosecond) unit to format timings
Enhance formatting of raw timings in verbose mode. Add newlines to the output for readability.

Micro-optimizations

I also pushed two minor micro-optimizations:

Use PyThreadState_GET() macro in performance critical code. _PyThreadState_UncheckedGet() calls are not inlined as expected, even when using gcc -O3.
Modify type_setattro() to call directly _PyObject_GenericSetAttrWithDict() instead of PyObject_GenericSetAttr(). PyObject_GenericSetAttr() is a thin wrapper to _PyObject_GenericSetAttrWithDict().

Code placement and attribute((hot))

On speed.python.org, I still noticed random performance slowdowns on the evil call_simple benchmark. This benchmark is a micro-benchmark measuring the performance of a single Python function call, it is CPU-bound and very small and so impact by CPU caches. I was bitten again by significant performance slowdown only caused by code placement.

It wasn't possible to use Profiled Guided Optimization (PGO) on the benchmark runner, since it used Ubuntu 14.04 and GCC crashed with an "internal error".

So I tried something different: mark "hot functions" with __attribute__((hot)). It's a GCC and Clang attribute helping code placements: "hot functions" are moved to a dedicated ELF section and so are closer in memory, and the compiler tries to optimize these functions even more.

The following functions are considered as hot according to statistics collected by Linux perf record and perf report commands:

_PyEval_EvalFrameDefault()
call_function()
_PyFunction_FastCall()
PyFrame_New()
frame_dealloc()
PyErr_Occurred()

I added a _Py_HOT_FUNCTION macro which uses __attribute__((hot)) and used _Py_HOT_FUNCTION on these functions (issue #28618).

Read also my previous blog article Analysis of a Python performance issue for a deeper analysis.

Sadly, after I wrote this blog post and after more analysis of call_simple benchmark results, I saw that __attribute__((hot)) wasn't enough. I still had random major performance slowdown.

I dediced to upgrade the performance runner to Ubuntu 16.04. It was dangerous because nobody has access to the physical server, so it may takes weeks to repair it if I did a mistake. Hopefully, the upgrade gone smoothly and I was able to run again all benchmarks using PGO. As expected, using PGO+LTO, benchmark results are more stable!

Interesting bug: duplicated filters when tests reload the warnings module

Python test suite has an old bug: the issue #18383 opened in July 2013. Sometimes, the test suite emits the following warning:

[247/375] test_warnings
Warning -- warnings.filters was modified by test_warnings

Since it's only a warning and it only occurs in the Python test suite, it was a low priority and took 3 years to be fixed! It also took time to find the right design to fix the root cause.

Duplicated filters

test_warnings imports the warnings module 3 times:

import warnings as original_warnings   # Python
py_warnings = support.import_fresh_module('warnings', blocked=['_warnings'])  # Python
c_warnings = support.import_fresh_module('warnings', fresh=['_warnings'])   # C

The Python warnings module (Lib/warnings.py) installs warning filters when the module is loaded:

_processoptions(sys.warnoptions)

where sys.warnoptions contains the value of the -W command line option.

If the Python module is loaded more than once, filters are duplicated.

First fix: use the right module

I pushed a first fix in september 2015.

Fix test_warnings: don't modify warnings.filters. BaseTest now ensures that unittest.TestCase.assertWarns() uses the same warnings module than warnings.catch_warnings(). Otherwise, warnings.catch_warnings() will be unable to remove the added filter.

Second fix: don't add duplicated filters

Issue #18383: the first patch was proposed by Florent Xicluna in 2013: save the length of filters, and remove newly added filters after warnings modules are reloaded by test_warnings. December 2014, Serhiy Storchaka reviewed the patch: he didn't like this workaround, he would like to fix the root cause.

March 2015, Alex Shkop proposed a patch which avoids to add duplicated filters.

September 2015, Martin Panter proposed to try to save/restore filters on the C warnings module. I proposed something similar in the issue #26742. But this solution has the same flaw that Florent's idea: it's only a workaround.

Martin also proposed add a private flag to say that filters were already set to not try to add again same filters.

Finally, in may 2016, Martin updated Alex's patch avoiding duplicated filters and pushed it.

Third fix

The filter comparisons wasn't perfect. A filter can be made of a precompiled regular expression, whereas these objects don't implement comparison.

November 2016, I opened the issue #28727 to propose to implement rich comparison for _sre.SRE_Pattern.

My first patch didn't implement hash() and had different bugs. It took me almost one week and 6 versions to write complete unit tests and handle all cases: support bytes and Unicode and handle regular expression flags.

Serhiy Storchaka found bugs and helps me to write the implementation.

Contributions

As usual, I reviewed and pushed changes written by other contributors:

Issue #27896: Allow passing sphinx options to Doc/Makefile. Patch written by Julien Palard.
Issue #28476: Reuse math.factorial() in test_random. Patch written by Francisco Couzo.
Issue #28479: Fix reST syntax in windows.rst. Patch written by Julien Palard.
Issue #26273: Add new constants: socket.TCP_CONGESTION (Linux 2.6.13) and socket.TCP_USER_TIMEOUT (Linux 2.6.37). Patch written by Omar Sandoval.
Issue #28979: Fix What's New in Python 3.6: compact dict is not faster, but only more compact. Patch written by Brendan Donegan.
Issue #28147: Fix a memory leak in split-table dictionaries: setattr() must not convert combined table into split table. Patch written by INADA Naoki.
Issue #29109: Enhance tracemalloc documentation:
- Wrong parameter name, 'group_by' instead of 'key_type'
- Don't round up numbers when explaining the examples. If they exactly match what can be read in the script output, it is to easier to understand (4.8 MiB vs 4855 KiB)
- Fix incorrect method link that was pointing to another module
Patch written by Loic Pefferkorn.

regrtest

regrtest --fromfile now accepts a list of filenames, not only a list of test names.
Issue #28409: regrtest: fix the parser of command line arguments.

Other changes

Fix _Py_normalize_encoding() function: It was not exactly the same than Python encodings.normalize_encoding(): the C function now also converts to lowercase.
Issue #28256: Cleanup _math.c: only define fallback implementations when needed. It avoids producing deadcode when the system provides required math functions, and so enhance the code coverage.
_csv: use _PyLong_AsInt() to simplify the code, the function checks for the limits of the C int type.
Issue #28544: Fix _asynciomodule.c on Windows. PyType_Ready() sets the reference to &PyType_Type. &PyType_Type address cannot be resolved at compilation time (not on Windows?).
Issue #28082: Add basic unit tests on the new re enums.
Issue #28691: Fix warn_invalid_escape_sequence(): handle correctly DeprecationWarning raised as an exception. First clear the current exception to replace the DeprecationWarning exception with a SyntaxError exception. Unit test written by Serhiy Storchaka.
Issue #28023: Fix python-gdb.py on old GDB versions. Replace int(value.address)+offset with value.cast(unsigned char*)+offset. It seems like int(value.address) fails on old GDB versions.
Issue #28765: _sre.compile() now checks the type of groupindex and indexgroup arguments. groupindex must a dictionary and indexgroup must be a tuple. Previously, indexgroup was a list. Use a tuple to reduce the memory usage.
Issue #28782: Fix a bug in the implementation yield from (fix _PyGen_yf() function). Fix the test checking if the next instruction is YIELD_FROM. Regression introduced by the new "WordCode" bytecode (issue #26647). Fix reviewed by Serhiy Storchaka and Yury Selivanov.
Issue #28792: Remove aliases from _bisect. Remove aliases from the C module. Always implement bisect() and insort() aliases in bisect.py. Remove also the # backward compatibility comment: there is no plan to deprecate nor remove these aliases. When keys are equal, it makes sense to use bisect.bisect() and bisect.insort().
Fix a ResourceWarning in generate_opcode_h.py. Use a context manager to close the Python file. Replace also open() with tokenize.open() to handle coding cookie of Lib/opcode.py.
Issue #28740: Add sys.getandroidapilevel() function: return the build time API version of Android as an integer. Function only available on Android. The availability of this function can be tested to check if Python is running on Android.
Issue #28152: Fix -Wunreachable-code warnings on Clang.
- Don't declare dead code when the code is compiled with Clang.
- Replace C if() with precompiler #if to fix a warning on dead code when using Clang.
- Replace 0 with (0) to ignore a compiler warning about dead code on ((int)(SEM_VALUE_MAX) < 0): SEM_VALUE_MAX is not negative on Linux.
Issue #28835: Fix a regression introduced in warnings.catch_warnings(): call warnings.showwarning() if it was overriden inside the context manager.
Issue #28915: Replace int with Py_ssize_t in modsupport. Py_ssize_t type is better for indexes. The compiler might emit more efficient code for i++. Py_ssize_t is the type of a PyTuple index for example. Replace also int endchar with char endchar.
Initialize variables to fix compiler warnings. Warnings seen on the "AMD64 Debian PGO 3.x" buildbot. Warnings are false positive, but variable initialization should not harm performances.
Remove useless variable initialization. Don't initialize variables which are not used before they are assigned.
Issue #28838: Cleanup abstract.h. Rewrite all comments to use the same style than other Python header files: comment functions before their declaration, no newline between the comment and the declaration. Reformat some comments, add newlines, to make them easier to read. Quote argument like 'arg' to mention an argument in a comment.
Issue #28838: abstract.h: remove long outdated comment. The documentation of the Python C API is more complete and more up to date than this old comment. Removal suggested by Antoine Pitrou.
python-gdb.py: catch gdb.error on gdb.selected_frame().
Issue #28383: __hash__ documentation recommends naive XOR to combine, but this is suboptimal. Update the documentation to suggest to reuse the hash() function on a tuple, with an example.

My contributions to CPython during 2016 Q3

2017-02-14T19:00:00+01:00

My contributions to CPython during 2016 Q3 (july, august, september):

hg log -r 'date("2016-07-01"):date("2016-09-30")' --no-merges -u Stinner

Statistics: 161 non-merge commits + 29 merge commits (total: 190 commits).

Previous report: My contributions to CPython during 2016 Q2. Next report: My contributions to CPython during 2016 Q4.

Table of Contents:

Two new core developers
CPython sprint, September, in California
PEP 524: Make os.urandom() blocking on Linux
PEP 509: private dictionary version
FASTCALL: optimization avoiding temporary tuple to call functions
More efficient CALL_FUNCTION bytecode
Work on optimization
Interesting bug: hidden resource warnings
Contributions
Bugfixes
regrtest changes
Tests changes
Other changes

Two new core developers

New core developers is the result of the productive third 2016 quarter.

At september 25, 2016, Yury Selivanov proposed to give commit privileges for INADA Naoki. Naoki became a core developer the day after!

At november 14, 2016, I proposed to promote Xiang Zhang as a core developer. One week later, he also became a core developer! I mentored him during one month, and later let him push directly changes.

Most Python core developers are men coming from North America and Europe. INADA Naoki comes from Japan and Xiang Zhang comes from China: more core developers from Asia, we increased the diversity of Python core developers!

CPython sprint, September, in California

I was invited at my first CPython sprint in September! Five days, September 5-9, at Instagram office in California, USA. I reviewed a lot of changes and pushed many new features! Read my previous blog post: CPython sprint, september 2016.

PEP 524: Make os.urandom() blocking on Linux

I pushed the implementation my PEP 524: read my previous blog post: PEP 524: os.urandom() now blocks on Linux in Python 3.6.

PEP 509: private dictionary version

Another enhancement from my FAT Python project: my PEP 509: Add a private version to dict was approved at the CPython sprint by Guido van Rossum.

The dictionary version is used by FAT Python to check quickly if a variable was modified in a Python namespace. Technically, a Python namespace is a regular dictionary.

Using the feedback from the python-ideas mailing list on the first version of my PEP, I made further changes:

Use 64-bit unsigned integers on 32-bit system: "A risk of an integer overflow every 584 years is acceptable." Using 32-bit, an overflow occurs every 4 seconds!
Don't expose the version at Python level to prevent users writing optimizations based on it in Python. Reading the dictionary version in Python is as slow as a dictionary lookup, wheras the version is usually used to avoid a "slow" dictionary lookup. The version is only accessible at the C level.

While my experimental FAT Python static optimizer didn't convince Guido, Yury Selivanov wrote yet another cache for global variables using the dictionary version: Implement LOAD_GLOBAL opcode cache (sadly, not merged yet).

I added the private version to the builtin dict type with the issue #26058. The global dictionary version is incremented at each dictionary creation and at each dictionary change, and each dictionary has its own version as well.

FASTCALL: optimization avoiding temporary tuple to call functions

Thanks to my work on making Python benchmarks more stable, I confirmed that my FASTCALL patches don't introduce performance regressions, and make Python faster in some specific cases.

I started to push FASTCALL changes. It will take me 6 months to push most changes to enable fully FASTCALL "everywhere" in the code base and to finish the implementation.

Following blog posts will describe FASTCALL changes, its history and performance enhancements. Spoiler: Python 3.6 is fast!

More efficient CALL_FUNCTION bytecode

I reviewed and merged Demur Rumed's patch to make the CALL_FUNCTION opcodes more efficient. Demur implemented the design proposed by Serhiy Storchaka. Serhiy Storchaka also reviewied the implementation with me.

Issue #27213: Rework CALL_FUNCTION* opcodes to produce shorter and more efficient bytecode:

CALL_FUNCTION now only accepts positional arguments
CALL_FUNCTION_KW accepts positional arguments and keyword arguments, keys of keyword arguments are packed into a constant tuple.
CALL_FUNCTION_EX is the most generic opcode: it expects a tuple and a dict for positional and keyword arguments.

CALL_FUNCTION_VAR and CALL_FUNCTION_VAR_KW opcodes have been removed.

Demur Rumed also implemented "Wordcode", a new bytecode format using fixed units of 16-bit: 8-bit opcode with 8-bit argument. Wordcode was merged in May 2016, see issue #26647: ceval: use Wordcode, 16-bit bytecode.

All instructions have an argument: opcodes without argument use the argument 0. It allowed to remove the following conditional code in the very hot code of Python/ceval.c:

if (HAS_ARG(opcode))
    oparg = NEXTARG();

The bytecode is now fetched using 16-bit words, instead of loading one or two 8-bit words per instruction.

Work on optimization

I continued with work on the performance Python benchmark suite. The suite works on CPython and PyPy, but it's maybe not fine tuned for PyPy yet.

Issue #27938: Add a fast-path for us-ascii encoding
Issue #15369: Remove the (old version of) pybench microbenchmark. Please use the new "performance" benchmark suite which includes a more recent version of pybench.
Issue #15369. Remove old and unreliable pystone microbenchmark. Please use the new "performance" benchmark suite which is much more reliable.

Interesting bug: hidden resource warnings

At 2016-08-22, I started to investigate why "Warning -- xxx was modfied by test_xxx" warnings were not logged on some buildbots (issue #27829).

I modified the code logging the warning to flush immediatly stderr: print(..., flush=True).

19 days later, I tried to remove a quiet flag -q on the Windows build... but it was a mistake, this flag doesn't mean quiet in the modified batch script :-)

13 days later, I finally understood that the -W option of regrtest was eating stderr if the test pass but the environment was modified.

I fixed regrtest to log stderr in all cases, except if the test pass! It should now be easier to fix "environment changed" warnings emitted by regrtest.

Contributions

As usual, I reviewed and pushed changes written by other contributors:

Issue #27350: I reviewed and pushed the implementation of compact dictionaries preserving insertion order. This resulted in dictionaries using 20% to 25% less memory when compared to Python 3.5. The implementation was written by INADA Naoki, based on the PyPy implementation, with a design by Raymond Hettinger.
"make tags": remove -t option of ctags. The option was kept for backward compatibility, but it was completly removed recently. Patch written by Stéphane Wirtel.
Issue #27558: Fix a SystemError in the implementation of "raise" statement. In a brand new thread, raise a RuntimeError since there is no active exception to reraise. Patch written by Xiang Zhang.
Issue #28120: Fix dict.pop() for splitted dictionary when trying to remove a "pending key": a key not yet inserted in split-table. Patch by Xiang Zhang.

Bugfixes

socket: Fix internal_select() function. Bug found by Pavel Belikov ("Fragment N1"): http://www.viva64.com/en/b/0414/#ID0ECDAE
socket: use INVALID_SOCKET.
- Replace fd = -1 with fd = INVALID_SOCKET
- Replace fd < 0 with fd == INVALID_SOCKET: SOCKET_T is unsigned on Windows
Bug found by Pavel Belikov ("Fragment N1"): http://www.viva64.com/en/b/0414/#ID0ECDAE
Issue #11048: ctypes, fix CThunkObject_new()
- Initialize restype and flags fields to fix a crash when Python runs on a read-only file system
- Use Py_ssize_t type rather than int for the i iterator variable
- Reorder assignements to be able to more easily check if all fields are initialized
Initial patch written by Marcin Bachry.
Issue #27744: socket: Fix memory leak in sendmsg() and sendmsg_afalg(). Release msg.msg_iov memory block. Release memory on PyMem_Malloc(controllen) failure
Issue #27866: ssl: Fix refleak in cipher_to_dict().
Issue #28077: Fix dict type, find_empty_slot() only supports combined dictionaries.
Issue #28200: Fix memory leak in path_converter(). Replace PyUnicode_AsWideCharString() with PyUnicode_AsUnicodeAndSize().
Issue #27955: Catch permission error (EPERM) in py_getrandom(). Fallback on reading from the /dev/urandom device when the getrandom() syscall fails with EPERM, for example if blocked by SECCOMP.
Issue #27778: Fix a memory leak in os.getrandom() when the getrandom() is interrupted by a signal and a signal handler raises a Python exception.
Issue #28233: Fix PyUnicode_FromFormatV() error handling. Fix a memory leak if the format string contains a non-ASCII character: destroy the unicode writer.

regrtest changes

regrtest: rename --slow option to --slowest (to get same option name than the testr tool). Thanks to optparse, --slow syntax still works ;-) Add --slowest option to buildbots. Display the top 10 slowest tests.
regrtest: nicer output for durations. Use milliseconds and minutes units, not only seconds.
regrtest: Add a summary of the tests at the end of tests output: "Tests result: xxx". It was sometimes hard to check quickly if tests succeeded, failed or something bad happened.
regrtest: accept options after test names. For example, ./python -m test test_os -v runs test_os in verbose mode. Before, regrtest tried to run a test called "-v"!
Issue #28195: Fix test_huntrleaks_fd_leak() of test_regrtest. Don't expect the fd leak message to be on a specific line number, just make sure that the line is present in the output.

Example of a recent (2017-02-15) successful test run, truncated output:

...
0:08:20 [403/404] test_codecs passed
0:08:21 [404/404] test_threading passed
391 tests OK.

10 slowest tests:
- test_multiprocessing_spawn: 1 min 24 sec
- test_concurrent_futures: 1 min 3 sec
- test_multiprocessing_forkserver: 60 sec
...

13 tests skipped:
    test_devpoll test_ioctl test_kqueue ...

Total duration: 8 min 22 sec
Tests result: SUCCESS

Tests changes

script_helper: kill the subprocess on error. If Popen.communicate() raises an exception, kill the child process to not leave a running child process in background and maybe create a zombi process. This change fixes a ResourceWarning in Python 3.6 when unit tests are interrupted by CTRL+c.
Issue #27181: Skip test_statistics tests known to fail until a fix is found.
Issue #18401: Fix test_pdb if $HOME is not set. HOME is not set on Windows for example.
test_eintr: Fix ResourceWarning warnings
Buildbot: give 20 minute per test file. It seems like at least 2 buildbots need more than 15 minutes per test file. Example with "AMD64 Snow Leop 3.x":
```
10 slowest tests:
- test_tools: 14 min 40 sec
- test_tokenize: 11 min 57 sec
- test_datetime: 11 min 25 sec
- ...
```
Issue #28176: test_asynico: fix test_sock_connect_sock_write_race(), increase the timeout from 10 seconds to 60 seconds.

Other changes

Issue #22624: Python 3 now requires the clock() function to build to simplify the C code.
Issue #27404: tag security related changes with the "[Security]" prefix in the changelog Misc/NEWS.
Issue #27776: dev_urandom(raise=0) now closes the file descriptor on error
Issue #27128, #18295: Use Py_ssize_t in _PyEval_EvalCodeWithName(). Replace int type with Py_ssize_t for index variables used for positional arguments. It should help to avoid integer overflow and help to emit better machine code for i++ (no trap needed for overflow). Make also the total_args variable constant.
Fix "make tags": set locale to C to call sort. vim expects that the tags file is sorted using english collation, so it fails if the locale is french for example. Use LC_ALL=C to force english sorting order. Issue #27726.
Issue #27698: Add socketpair function to socket.__all__ on Windows
Issue #27786: Simplify (optimize?) PyLongObject private function x_sub(): the z variable is known to be a new object which cannot be shared, Py_SIZE() can be used directly to negate the number.
Fix a clang warning in grammar.c. Clang is smarter than GCC and emits a warning for dead code on a function declared with __attribute__((__noreturn__)) (the Py_FatalError() function in this case).
Issue #28114: Add unit tests on os.spawn*() to prepare to fix a crash with bytes environment.
Issue #28127: Add _PyDict_CheckConsistency(): function checking that a dictionary remains consistent after any change. By default, only basic attributes are tested, table content is not checked because the impact on Python performance is too important. DEBUG_PYDICT must be defined (ex: gcc -D DEBUG_PYDICT) to check also dictionaries content.

CPython sprint, september 2016

2017-02-14T18:00:00+01:00

I was invited at my first CPython sprint in September! Five days, September 5-9, at Instagram office in California, USA. The sprint was sponsored by Instagram, Microsoft, and the PSF.

First little game: Many happy faces, but Where is Victor?

IMHO it was the most productive CPython week ever :-) Having Guido van Rossum in a room helped to get many PEPs accepted. Having a lot of highly skilled reviewers in the same room helped to get many new features and many PEP implementations merged much faster than usual.

Second little game: try to spot the sprint on the CPython commit statistics of the last 12 months (Feb, 2016-Feb, 2017) ;-)

Compact dict

Issue #27350: I reviewed and pushed the "compact dict" implementation which makes Python dictionaries ordered (by insertion order) by default. It reduces the memory usage of dictionaries betwen 20% and 25%.

The implementation was written by INADA Naoki, based on the PyPy implementation, with a design by Raymond Hettinger.

FASTCALL

"Fast calls": Python 3.6 has a new private C API and a new METH_FASTCALL calling convention which avoids temporary tuple for positional arguments and avoids temporary dictionary for keyword arguments. Changes:

Add a new C calling convention: METH_FASTCALL
Add _PyArg_ParseStack() function
Add _PyCFunction_FastCallKeywords() function: issue #27810
Add _PyObject_FastCallKeywords() function: issue #27830

More efficient CALL_FUNCTION bytecode

I reviewed and pushed: "Rework CALL_FUNCTION* opcodes to produce shorter and more efficient bytecode" (issue #27213).

Patch writen by Demur Rumed, design by Serhiy Storchaka, reviewed by Serhiy Storchaka and me.

PEP 509: Add a private version to dict

Guido approved my PEP 509 "Add a new private version to the builtin dict type".

I pushed the implementation.

PEP 524: Make os.urandom() blocking on Linux

I pushed the implementation of my PEP 524: "Make os.urandom() blocking on Linux".

Issue #27776: The os.urandom() function does now block on Linux 3.17 and newer until the system urandom entropy pool is initialized to increase the security.

Read my previous blog post for the painful story behind the PEP: PEP 524: os.urandom() now blocks on Linux.

Asynchronous PEP 525 and 530

Guido van Rossum approved two PEPs of Yury Selivanov:

PEP 525: Asynchronous Generators
PEP 530: Asynchronous Comprehensions

I reviewed the huge C implementation with Yury on my side :-)

unicode_escape codec optimization

I reviewed and pushed "Optimize unicode_escape and raw_unicode_escape" (the isue #16334), patch written by Serhiy Storchaka.

Python 3.6 bugfixes

I happily found many issues including a major one: regular list-comprehension were completely broken :-)

Another minor issue: SyntaxError didn't reported the correct line number in a specific case.

Don't worry, Yury fixed both ;-)

Official sprint report

Read also the official report: Python Core Development Sprint 2016: 3.6 and beyond!.

PEP 524: os.urandom() now blocks on Linux in Python 3.6

2017-02-14T12:00:00+01:00

getrandom() avoids file descriptors

Last years, I'm making sometimes enhancements in the Python code used to generate random numbers, the C implementation of os.urandom(). My main two changes were to use the new getentropy() and getrandom() functions when available on Linux, Solaris, OpenBSD, etc.

In 2013, os.urandom() opened a file descriptor to read from /dev/urandom and then closed it. It was decided to use a single private file descriptor and keep it open to prevent EMFILE or ENFILE errors (too many open files) under high system loads with many threads: see the issue #18756.

The private file descriptor introduced a backward incompatible change in badly written programs. The code was modified to call fstat() to check if the file descriptor was closed and then replaced with a different file descriptor (but same number): check if st_dev or st_ino attributes changed.

In 2014, the new Linux kernel 3.17 added a new getrandom() syscall which gives access to random bytes without having to handle a file descriptor. I modified os.urandom() to call getrandom() to avoid file descriptors, but a different issue appeared.

getrandom() hangs at system startup

On embedded devices and virtual machines, Python 3.5 started to hang at startup.

On Debian, a systemd script used Python to compute a MD5 checksum, but Python was blocked during its initialization. Other users reported that Python blocked on importing the random module, sometimes imported indirectly by a different module.

Python was blocked on the getrandom(0) syscall, waiting until the system collected enough entropy to initialize the urandom pool. It took longer than 90 seconds, so systemd killed the service with a timeout. As a consequence, the system boot takes longer than 90 seconds or can even fail!

Fix Python startup

The fix was obvious: call getrandom(GRND_NONBLOCK) which fails immediately if the call would block, and fall back on reading from /dev/urandom which doesn't block even if the entropy pool is not initialized yet.

Quickly, our security experts complained that falling back on /dev/urandom makes Python less secure. When the fall back path is taken, /dev/urandom returns random number not suitable for security purpose (initialized with low entropy), wheras os.urandom() documenation says: "The returned data should be unpredictable enough for cryptographic applications" (and "though its exact quality depends on the OS implementation.").

Calling getrandom() in blocking mode for os.urandom() makes Python more secure, but it doesn't fix the startup bug.

Discussion storm

The proposed change started a huge rain of messages. More than 200 messages, maybe even more than 500 messages, on the bug tracker and python-dev mailing list. Everyone became a security expert and wanted to give his/her very important opinion, without listening to other arguments.

Two Python security experts left the discussion.

I also ignored new messages. I simply had not enough time to read all of them, and the discussion tone made me angry.

New mailing list and two new PEPs

A new security-sig mailing list, subtitled "os.urandom rehab clinic", was created just to take a decision on os.urandom()!

Nick Coghlan wrote the PEP 522: Allow BlockingIOError in security sensitive APIs. Basically: he considers that there is no good default behaviour when os.urandom() would block, so raise an exception to let users decide.

I wrote PEP 524: Make os.urandom() blocking on Linux. My PEP proposes to make os.urandom() blocking, but also modify Python startup to fall back on non-blocking RNG to initialize the secret hash seed and the random module (which is not sensitive for security, except of random.SystemRandom).

Nick's PEP describes an important use case: be able to check if os.urandom() would block. Instead of adding a flag to os.urandom(), I chose to expose the low-level C getrandom() function as a new Python os.getrandom() function. Calling os.getrandom(1, os.GRND_NONBLOCK) raises a BlockingIOError exception, as Nick proposed for os.urandom(), so it's possible to decide what to do in this case.

While both PEPs are valid, IMHO my PEP was less backward incompatible, simpler and maybe closer to what users expect. The "os.urandom() would block" case is a special case with my PEP, but my PEP allows to decide what to do in that case (thanks to os.getrandom()).

Guido van Rossum approved my PEP and rejected Nick's PEP. I worked with Nick to implement my PEP.

Python 3.6 changes

I added a new os.getrandom() function: expose the Linux getrandom() syscall (issue #27778). I also added the two getrandom() flags: os.GRND_NONBLOCK and os.GRND_RANDOM.

I modified os.urandom() to block on Linux: call getrandom(0) instead of getrandom(GRND_NONBLOCK) (issue #27776).

I also added a private _PyOS_URandomNonblock() function used to initialize the hash secret and used by random.Random.seed() (used to initialize the random module).

The os.urandom() function now blocks in Python 3.6 on Linux 3.17 and newer until the system urandom entropy pool is initialized to increase the security.

My contributions to CPython during 2016 Q2

2017-02-12T18:00:00+01:00

My contributions to CPython during 2016 Q2 (april, may, june):

hg log -r 'date("2016-04-01"):date("2016-06-30")' --no-merges -u Stinner

Statistics: 52 non-merge commits + 22 merge commits (total: 74 commits).

Previous report: My contributions to CPython during 2016 Q1. Next report: My contributions to CPython during 2016 Q3.

Start of my work on optimization

During 2016 Q2, I started to spend more time on optimizing CPython.

I experimented a change on CPython: a new FASTCALL calling convention to avoid the creation of a temporary tuple to pass positional argulments: issue26814. Early results were really good: calling builtin functions became between 20% and 50% faster!

Quickly, my optimization work was blocked by unreliable benchmarks. I spent the rest of the year 2016 analyzing benchmarks and making benchmarks more stable.

subprocess now emits ResourceWarning

subprocess.Popen destructor now emits a ResourceWarning warning if the child process is still running (issue #26741). The warning helps to track and fix zombi processes. I updated asyncio to prevent a false ResourceWarning (warning whereas the child process completed): asyncio now copies the child process exit status to the internal Popen object.

I also fixed the POSIX implementation of subprocess.Popen._execute_child(): it now sets the returncode attribute from the child process exit status when exec failed.

Security: fix potential shell injections in ctypes.util

I rewrote methods of the ctypes.util module using os.popen(). I replaced os.popen() with subprocess.Popen without shell (issue #22636) to fix a class of security vulneratiblity, "shell injection" (inject arbitrary shell commands to take the control of a computer).

The os.popen() function uses a shell, so there is a risk if the command line arguments are not properly escaped for shell. Using subproces.Popen without shell fixes completely the risk.

Note: the ctypes is generally not considered as "safe", but it doesn't harm to make it more secure ;-)

Optimization: PyMem_Malloc() now uses pymalloc

PyMem_Malloc() now uses the fast Python "pymalloc" memory allocator which is optimized for small objects with a short lifetime (issue #26249). The change makes some benchmarks up to 4% faster.

This change was possible thanks to the whole preparation work I did in the 2016 Q1, especially the new GIL check in memory allocator debug hooks and the new PYTHONMALLOC=debug environment variable enabling these hooks on a Python compiled in released mode.

I tested lxml, Pillow, cryptography and numpy before pushing the change, as asked by Marc-Andre Lemburg. All these projects work with the change, except of numpy. I wrote a fix for numpy: Use PyMem_RawMalloc on Python 3.4 and newer, merged one month later (my first contribution to numy!).

The change indirectly helped to identify and fix a memory leak in the formatfloat() function used to format bytes strings: b"%f" % 1.2 (issue #25349, #26249).

Optimization

Issue #27056: Optimize pickle.load() and pickle.loads(), up to 10% faster to deserialize a lot of small objects. I found this optimization using Linux perf on Python compiled with PGO. My change implements manually the optimization if Python is not compiled with PGO.

Issue #26770: When set_inheritable() is implemented with fcntl(), don't call fcntl() twice if the FD_CLOEXEC flag is already set to the requested value. Linux uses ioctl() and so always only need a single syscall.

Changes

Issue #26716: Replace IOError with OSError in fcntl documentation, IOError is a deprecated alias to OSError since Python 3.3.
Issue #26639: Replace the deprecated imp module with the importlib module in Tools/i18n/pygettext.py. Remove _get_modpkg_path(), replaced with importlib.util.find_spec().
Issue #26735: Fix os.urandom() on Solaris 11.3 and newer when reading more than 1024 bytes: call getrandom() multiple times with a limit of 1024 bytes per call.
configure: fix HAVE_GETRANDOM_SYSCALL check, syscall() function requires #include <unistd.h>.
Issue #26766: Fix _PyBytesWriter_Finish(). Return a bytearray object when bytearray is requested and when the small buffer is used. Fix also test_bytes: bytearray%args must return a bytearray type.

Issue #26777: Fix random failure of test_asyncio.test_timeout_disable() on the "AMD64 FreeBSD 9.x 3.5" buildbot:

File ".../Lib/test/test_asyncio/test_tasks.py", line 2398, in go
  self.assertTrue(0.09 < dt < 0.11, dt)
AssertionError: False is not true : 0.11902812402695417

Replace < 0.11 with < 0.15.

Backport test_gdb fix for s390x buildbots to Python 3.5.
Cleanup import.c: replace PyUnicode_RPartition() with PyUnicode_FindChar() and PyUnicode_Substring() to avoid the creation of a temporary tuple. Use PyUnicode_FromFormat() to build a string and avoid the single_dot ('.') singleton.
regrtest now uses subprocesses when the -j1 command line option is used: each test file runs in a fresh child process. Before, the -j1 option was ignored. Tools/buildbot/test.bat script now uses -j1 by default to run each test file in fresh child process.
regrtest: display test result (passed, failed, ...) after each test completion. In multiprocessing mode: always display the result. In sequential mode: only display the result if the test did not pass
Issue #27278: Fix os.urandom() implementation using getrandom() on Linux. Truncate size to INT_MAX and loop until we collected enough random bytes, instead of casting a directly Py_ssize_t to int.

Contributions

I also pushed a few changes written by other contributors.

Issue #26839: os.urandom() doesn't block on Linux anymore. On Linux, os.urandom() now calls getrandom() with GRND_NONBLOCK to fall back on reading /dev/urandom if the urandom entropy pool is not initialized yet. Patch written by Colm Buckley. This issue started a huge annoying discussion around random number generation on the bug tracker and the python-dev mailing list. I later wrote the PEP 524: Make os.urandom() blocking on Linux to fix the issue!

Other changes:

Issue #26647: Cleanup opcode: simplify code to build opcode.opname. Patch written by Demur Rumed.
Issue #26647: Cleanup modulefinder: use dis.opmap[name] rather than dis.opname.index(name). Patch written by Demur Rumed.
Issue #26801: Fix error handling in shutil.get_terminal_size(): catch AttributeError instead of NameError. Skip the functional test of test_shutil using the stty size command if the os.get_terminal_size() function is missing. Patch written by Emanuel Barry.
Issue #26802: Optimize function calls only using unpacking like func(*tuple) (no other positional argument, no keyword argument): avoid copying the tuple. Patch written by Joe Jevnik.
Issue #21668: Add missing libm dependency in setup.py: link audioop, _datetime, _ctypes_test modules to libm, except on Mac OS X. Patch written by Chi Hsuan Yen.
Issue #26799: Fix python-gdb.py: don't get C types at startup, only on demand. The C types can change if python-gdb.py is loaded before loading the Python executable in gdb. Patch written by Thomas Ilsche.
Issue #27057: Fix os.set_inheritable() on Android, ioctl() is blocked by SELinux and fails with EACCESS. The function now falls back to fcntl(). Patch written by Michał Bednarski.
Issue #26647: Fix typo in test_grammar. Patch written by Demur Rumed.

My contributions to CPython during 2016 Q1

2017-02-09T17:00:00+01:00

My contributions to CPython during 2016 Q1 (january, februrary, march):

hg log -r 'date("2016-01-01"):date("2016-03-31")' --no-merges -u Stinner

Statistics: 196 non-merge commits + 33 merge commits (total: 229 commits).

Previous report: My contributions to CPython during 2015 Q4. Next report: My contributions to CPython during 2016 Q2.

Summary

Since this report is much longer than I expected, here are the highlights:

Python 8: no pep8, no chocolate!
AST enhancements coming from FAT Python
faulthandler now catchs Windows fatal exceptions
New PYTHONMALLOC environment variable
tracemalloc: new C API and support multiple address spaces
ResourceWarning warnings now come with a traceback
PyMem_Malloc() now fails if the GIL is not held
Interesting bug: reentrant flag in tracemalloc

Python 8: no pep8, no chocolate!

I prepared an April Fool: [Python-Dev] The next major Python version will be Python 8 :-)

I increased Python version to 8, added the pep8 module and modified importlib to raise an ImportError if a module is not PEP8-compliant!

AST enhancements coming from FAT Python

Changes coming from my FAT Python (AST optimizer, run ahead of time):

The compiler now ignores constant statements like b'bytes' (issue #26204). I had to replace constant statement with expressions to prepare the change (ex: replace b'bytes' with x = b'bytes'). First, the compiler emited a SyntaxWarning, but it was quickly decided to let linters to emit such warnings to not annoy users: read the thread on python-dev.

Example, Python 3.5:

>>> def f():
...  b'bytes'
...
>>> import dis; dis.dis(f)
  2           0 LOAD_CONST               1 (b'bytes')
              3 POP_TOP
              4 LOAD_CONST               0 (None)
              7 RETURN_VALUE

Python 3.6:

>>> def f():
...  b'bytes'
...
>>> import dis; dis.dis(f)
  1           0 LOAD_CONST               0 (None)
              2 RETURN_VALUE

Other changes:

Issue #26107: The format of the co_lnotab attribute of code objects changes to support negative line number delta. It allows AST optimizers to move instructions without breaking Python tracebacks. Change needed by the loop unrolling optimization of FAT Python.
Issue #26146: Add a new kind of AST node: ast.Constant. It can be used by external AST optimizers like FAT Python, but the compiler does not emit directly such node. Update code to accept ast.Constant instead of ast.Num and/or ast.Str.
Issue #26146: marshal.loads() now uses the empty frozenset singleton. It fixes a test failure in FAT Python and reduces the memory footprint.

faulthandler now catchs Windows fatal exceptions

I enhanced the faulthandler.enable() function on Windows to set a handler for Windows fatal exceptions using AddVectoredExceptionHandler() (issue #23848).

Windows exceptions are the native way to handle fatal errors on Windows, whereas UNIX signals SIGSEGV, SIGFPE and SIGABRT are "emulated" on top of that.

New PYTHONMALLOC environment variable

I added a new PYTHONMALLOC environment variable (issue #26516) to set the Python memory allocators.

PYTHONMALLOC=debug enables debug hooks on a Python compiled in release mode, whereas Python 3.5 requires to recompile Python in debug mode. These hooks implements various checks:

Detect buffer underflow: write before the start of the buffer
Detect buffer overflow: write after the end of the buffer
Detect API violations, ex: PyObject_Free() called on a buffer allocated by PyMem_Malloc()
Check if the GIL is held when allocator functions of PYMEM_DOMAIN_OBJ (ex: PyObject_Malloc()) and PYMEM_DOMAIN_MEM (ex: PyMem_Malloc()) domains are called

Moreover, logging a fatal memory error now uses the tracemalloc module to get the traceback where a memory block was allocated. Example of a buffer overflow using python3.6 -X tracemalloc=5 (store 5 frames in traces):

Debug memory block at address p=0x7fbcd41666f8: API 'o'
    4 bytes originally requested
    The 7 pad bytes at p-7 are FORBIDDENBYTE, as expected.
    The 8 pad bytes at tail=0x7fbcd41666fc are not all FORBIDDENBYTE (0xfb):
        at tail+0: 0x02 *** OUCH
        at tail+1: 0xfb
        at tail+2: 0xfb
        ...
    The block was made by call #1233329 to debug malloc/realloc.
    Data at p: 1a 2b 30 00

Memory block allocated at (most recent call first):
  File "test/test_bytes.py", line 323
  File "unittest/case.py", line 600
  ...

Fatal Python error: bad trailing pad byte

Current thread 0x00007fbcdbd32700 (most recent call first):
  File "test/test_bytes.py", line 323 in test_hex
  File "unittest/case.py", line 600 in run
  ...

PYTHONMALLOC=malloc forces the usage of the system malloc() allocator. This option can be used with Valgrind. Without this option, Valgrind emits tons of false alarms in the Python pymalloc memory allocator.

tracemalloc: new C API and support multiple address spaces

Antoine Pitrou and Nathaniel Smith asked me to enhance the tracemalloc module:

Add a C API to be able to manually track/untrack memory blocks, to track the memory allocated by custom memory allocators. For example, numpy uses allocators with a specific memory alignment for SIMD instructions.
Support tracking memory of different address spaces. For example, central (CPU) memory and GPU memory for numpy.

Support multiple address spaces

I made deep changes in the hashtable.c code (simple C implementation of an hash table used by _tracemalloc) to support keys of a variable size (issue #26588), instead of using an hardcoded void * size. It allows to support keys larger than sizeof(void*), but also to use less memory for keys smaller than sizeof(void*) (ex: int keys).

Then I extended the C _tracemalloc module and the Python tracemalloc to add a new domain attribute to traces: add Trace.domain attribute and tracemalloc.DomainFilter class.

The final step was to optimize the memory footprint of _tracemalloc. Start with compact keys (Py_uintptr_t type) and only switch to pointer_t keys when the first memory block with a non-zero domain is tracked (when one more one address space is used). So the _tracemalloc memory usage doesn't change by default in Python 3.6!

C API

I added a private C API (issue #26530):

int _PyTraceMalloc_Track(_PyTraceMalloc_domain_t domain, Py_uintptr_t ptr, size_t size);
int _PyTraceMalloc_Untrack(_PyTraceMalloc_domain_t domain, Py_uintptr_t ptr);

I waited for Antoine and Nathaniel feedback on this API, but the API remains private in Python 3.6 since none reviewed it.

ResourceWarning warnings now come with a traceback

Final result

Before going to explain the long development of the feature, let's see an example of the final result! Example with the script example.py:

import warnings

def func():
    return open(__file__)

f = func()
f = None

Output of the command python3.6 -Wd -X tracemalloc=5 example.py:

example.py:7: ResourceWarning: unclosed file <_io.TextIOWrapper name='example.py' mode='r' encoding='UTF-8'>
  f = None
Object allocated at (most recent call first):
  File "example.py", lineno 4
    return open(__file__)
  File "example.py", lineno 6
    f = func()

The Object allocated at (...) part is the new feature ;-)

Add source parameter to warnings

Python 3 logs ResourceWarning warnings when a resource is not closed properly to help developers to handle resources correctly. The problem is that the warning is only logged when the object is destroy, which can occur far from the object creation and can occur on a line unrelated to the object because of the garbage collector.

I added a new tracemalloc module to Python 3.4 which has an interesting tracemalloc.get_object_traceback() function. If tracemalloc traced the allocation of an object, it is able to provide later the traceback where the object was allocated.

I wanted to modify the warnings module to call get_object_traceback(), but I noticed that it wasn't possible to easily extend the warnings API because this module allows to override showwarning() and formatwarning() functions and these functions have a fixed number of parameters. Example:

def showwarning(message, category, filename, lineno, file=None, line=None):
    ...

With the issue #26568, I added new _showwarnmsg() and _formatwarnmsg() functions to the warnings module which get a warnings.WarningMessage object instead of a list of parameters:

def _showwarnmsg(msg):
    ...

I added a source attribute to warnings.WarningMessage (issue #26567) and a new optional source parameter to warnings.warn() (issue #26604): the leaked resource object. I modified _formatwarnmsg() to log the traceback where resource was allocated, if available.

The tricky part was to fix corner cases when the following functions of the warnings module are overriden:

formatwarning(), showwarning()
_formatwarnmsg(), _showwarnmsg()

Set the source parameter

I started to modify modules to set the source parameter when logging ResourceWarning warnings.

The easy part was to modify asyncore, asyncio and _pyio modules to set the source parameter. These modules are implemented in Python, the change was just to add source=self. Example of asyncio destructor:

def __del__(self):
    if not self.is_closed():
        warnings.warn("unclosed event loop %r" % self, ResourceWarning,
                      source=self)
        if not self.is_running():
            self.close()

Note: The warning is logged before the resource is closed to provide more information in repr(). Many objects clear most information in their close() method.

Modifying C modules was more tricky than expected. I had to implement "finalizers" (PEP 432: Safe object finalization) for the _socket.socket type (issue #26590) and for the os.scandir() iterator (issue #26603).

More reliable warnings

The Python shutdown process is complex, and some Python functions are broken during the shutdown. I enhanced the warnings module to handle nicely these failures and try to log warnings anyway.

I modified warnings.formatwarning() to catch linecache.getline() failures on formatting the traceback.

Logging the resource traceback is complex, so I only implemented it in Python. Python tries to use the Python warnings module if it was imported, or falls back on the C _warnings module. To get the resource traceback at Python shutdown, I modified the C module to try to import the Python warning: _warnings.warn_explicit() now tries to import the Python warnings module if the source parameter is set to be able to log the traceback where the source was allocated (issue #26592).

Fix ResourceWarning warnings

Since it became easy to debug these warnings, I fixed some of them in the Python test suite:

Issue #26620: Fix ResourceWarning in test_urllib2_localnet. Use context manager on urllib objects and use self.addCleanup() to cleanup resources even if a test is interrupted with CTRL+c
Issue #25654: multiprocessing: open file with closefd=False to avoid ResourceWarning. _test_multiprocessing: open file with O_EXCL to detect bugs in tests (if a previous test forgot to remove TESTFN). test_sys_exit(): remove TESTFN after each loop iteration
Fix ResourceWarning in test_unittest when interrupted

PyMem_Malloc() now fails if the GIL is not held

Since using the mall object allocator (pymalloc)) for dictionary key storage showed speedup for the dict type (issue #23601), I proposed to generalize the change, use pymalloc for PyMem_Malloc(): [Python-Dev] Modify PyMem_Malloc to use pymalloc for performance.

The main issue was that the change means that PyMem_Malloc() now requires to hold the GIL, whereas it didn't before since it called directly malloc().

Check if the GIL is held

CPython has a PyGILState_Check() function to check if the GIL is held. Problem: the function doesn't work with subinterpreters: see issues #10915 and #15751.

I added an internal flag to PyGILState_Check() (issue #26558) to skip the test. The flag value is false at startup, set to true once the GIL is fully initialized (Python initialization), set to false again when the GIL is destroyed (Python finalization). The flag is also set to false when the first subinterpreter is created.

This hack works around PyGILState_Check() limitations allowing to call PyGILState_Check()` anytime to debug more bugs earlier.

_Py_dup(), _Py_fstat(), _Py_read() and _Py_write() are low-level helper functions for system functions, but these functions require the GIL to be held. Thanks to the PyGILState_Check() enhancement, it became possible to check the GIL using an assertion.

PyMem_Malloc() and GIL

Issue #26563: Debug hooks on Python memory allocators now raise a fatal error if memory allocator functions like PyMem_Malloc() and PyMem_Malloc() are called without holding the GIL.

The change spotted two bugs which I fixed:

Issue #26563: Replace PyMem_Malloc() with PyMem_RawMalloc() in the Windows implementation of os.stat(), the code is called without holding the GIL.
Issue #26563: Fix usage of PyMem_Malloc() in overlapped.c. Replace PyMem_Malloc() with PyMem_RawFree() since PostToQueueCallback() calls PyMem_Free() in a new C thread which doesn't hold the GIL.

I wasn't able to switch PyMem_Malloc() to pymalloc in this quarter, since it took more a lot of time to implement requested checks and test third party modules.

Fatal error and faulthandler

I enhanced the faulthandler module to work in non-Python threads (issue #26563). I fixed Py_FatalError() if called without holding the GIL: don't try to print the current exception, nor try to flush stdout and stderr: only dump the traceback of Python threads.

Interesting bug: reentrant flag in tracemalloc

A bug annoyed me a lot: a random assertion error related to a reentrant flag in the _tracemalloc module.

Story starting in the middle of the issue #26588 (2016-03-21). While working on issue #26588, "_tracemalloc: add support for multiple address spaces (domains)", I noticed an assertion failure in set_reentrant(), a helper function to set a Thread Local Storage (TLS), on a buildbot:

python: ./Modules/_tracemalloc.c:195: set_reentrant:
    Assertion `PyThread_get_key_value(tracemalloc_reentrant_key) == ((PyObject *) &_Py_TrueStruct)' failed.

I was unable to reproduce the bug on my Fedora 23 (AMD64). After changes on my patch, I pushed it the day after, but the assertion failed again. I added assertions and debug informations. More failures, an interesting one on Windows which uses a single process.

I added an assertion in tracemalloc_init() to ensure that the reeentrant flag is set at the end of the function. The reentrant flag was no more set at tracemalloc_start() entry for an unknown reason. I changed the module initialization to no call tracemalloc_init() anymore, it's only called on tracemalloc.start().

"The bug was seen on 5 buildbots yet: PPC Fedora, AMD64 Debian, s390x RHEL, AMD64 Windows, x86 Ubuntu."

I finally understood and fixed the bug with the change af1c1149784a: tracemalloc_start() and tracemalloc_stop() don't clear/set the reentrant flag anymore.

The problem was that I expected that tracemalloc_init() and tracemalloc_start() functions would always be called in the same thread, whereas it occurred that tracemalloc_init() was called in thread A when the tracemalloc module is imported, whereas tracemalloc_start() was called in thread B.

Other commits

Enhancements

The developers of the vmprof profiler asked me to expose the atomic variable _PyThreadState_Current. The private variable was removed from Python 3.5.1 API because the implementation of atomic variables depends on the compiler, compiler options, etc. and so caused compilation issues. I added a new private _PyThreadState_UncheckedGet() function (issue #26154) which gets the value of the variable without exposing its implementation.

Other enhancements:

Issue #26099: The site module now writes an error into stderr if sitecustomize module can be imported but executing the module raise an ImportError. Same change for usercustomize.
Issue #26516: Enhance Python memory allocators documentation. Add link to PYTHONMALLOCSTATS environment variable. Add parameters to PyMem macros like PyMem_MALLOC().
Issue #26569: Fix pyclbr.readmodule() and pyclbr.readmodule_ex() to support importing packages.
Issue #26564, #26516, #26563: Enhance documentation on memory allocator debug hooks.
doctest now supports packages. Issue #26641: doctest.DocFileTest and doctest.testfile() now support packages (module splitted into multiple directories) for the package parameter.

Bugfixes

Issue #25843: When compiling code, don't merge constants if they are equal but have a different types. For example, f1, f2 = lambda: 1, lambda: 1.0 is now correctly compiled to two different functions: f1() returns 1 (int) and f2() returns 1.0 (int), even if 1 and 1.0 are equal.

Other fixes:

Issue #26101: Fix test_compilepath() of test_compileall. Exclude Lib/test/ from sys.path in test_compilepath(). The directory contains invalid Python files like Lib/test/badsyntax_pep3120.py, whereas the test ensures that all files can be compiled.
Issue #24520: Replace fpgetmask() with fedisableexcept(). On FreeBSD, fpgetmask() was deprecated long time ago. fedisableexcept() is now preferred.
Issue #26161: Use Py_uintptr_t instead of void* for atomic pointers in pyatomic.h. Use atomic_uintptr_t when <stdatomic.h> is used. Using void* causes compilation warnings depending on which implementation of atomic types is used.
Issue #26637: The importlib module now emits an ImportError rather than a TypeError if __import__() is tried during the Python shutdown process but sys.path is already cleared (set to None).
doctest: fix _module_relative_path() error message. Write the module name rather than <module> in the error message, if module has no __file__ attribute (ex: package).

Fix type downcasts on Windows 64-bit

In my spare time, I'm trying to fix a few compiler warnings on Windows 64-bit where the C long type is only 32-bit, whereas pointers are 64-bit long:

posix_getcwd(): limit to INT_MAX on Windows. It's more to fix a compiler warning during compilation, I don't think that Windows support current working directories larger than 2 GB :-)
_pickle: Fix load_counted_tuple(), use Py_ssize_t for size. Fix a warning on Windows 64-bit.
getpathp.c: fix compiler warning, wcsnlen_s() result type is size_t.
compiler.c: fix compiler warnings on Windows
_msi.c: try to fix compiler warnings
longobject.c: fix compilation warning on Windows 64-bit. We know that Py_SIZE(b) is -1 or 1 an so fits into the sdigit type.
On Windows, socket.setsockopt() now raises an OverflowError if the socket option is larger than INT_MAX bytes.

Unicode bugfixes

Issue #26227: On Windows, getnameinfo(), gethostbyaddr() and gethostbyname_ex() functions of the socket module now decode the hostname from the ANSI code page rather than UTF-8.
Issue #26217: Unicode resize_compact() must set wstr_length to 0 after freeing the wstr string. Otherwise, an assertion fails in _PyUnicode_CheckConsistency().
Issue #26464: Fix str.translate() when string is ASCII and first replacements removes characters, but next replacements use a non-ASCII character or a string longer than 1 character. Regression introduced in Python 3.5.0.

Buildbot, tests

Just to give you an idea of the work required to keep a working CI, here is the list of changes I maded in a single quarter to make tests and Python buildbots more reliable.

Issue #26610: Skip test_venv.test_with_pip() if ctypes miss
test_asyncio: fix test_timeout_time(). Accept time delta up to 0.12 second, instead of 0.11, for the "AMD64 FreeBSD 9.x" buildbot slave.
Issue #13305: Always test datetime.datetime.strftime("%4Y") for years < 1900. Change quickly reverted, strftime("%4Y") fails on most platforms.
Issue #17758: Skip test_site if site.USER_SITE directory doesn't exist and cannot be created.
Fix test_venv on FreeBSD buildbot. Ignore pip warning in test_venv.test_with_venv().
Issue #26566: Rewrite test_signal.InterProcessSignalTests. Don't use os.fork() with a subprocess to not inherit existing signal handlers or threads: start from a fresh process. Use a timeout of 10 seconds to wait for the signal instead of 1 second
Issue #26538: regrtest: Fix module.__path__. libregrtest: Fix setup_tests() to keep module.__path__ type (_NamespacePath), don't convert to a list. Add _NamespacePath.__setitem__() method to importlib._bootstrap_external.
regrtest: add time to output. Timestamps should help to debug slow buildbots, and timeout and hang on buildbots.
regrtest: add timeout to main process when using -jN. libregrtest: add a watchdog to run_tests_multiprocess() using faulthandler.dump_traceback_later().
Makefile: change default value of TESTTIMEOUT from 1 hour to 15 min. The whole test suite takes 6 minutes on my laptop. It takes less than 30 minutes on most buildbots. The TESTTIMEOUT is the timeout for a single test file.
Buildbots: change also Windows timeout from 1 hour to 15 min
regrtest: display test duration in sequential mode. Only display duration if a test takes more than 30 seconds.
Issue #18787: Try to fix test_spwd on OpenIndiana. Try to get the "root" entry which should exist on all UNIX instead of "bin" which doesn't exist on OpenIndiana.
regrtest: fix --fromfile feature. Update code for the name regrtest output format. Enhance also test_regrtest test on --fromfile
regrtest: mention if tests run sequentially or in parallel
regrtest: when parallel tests are interrupted, display progress
support.temp_dir(): call support.rmtree() instead of shutil.rmtree(). Try harder to remove directories on Windows.
rt.bat: use -m test instead of Libtestregrtest.py
Refactor regrtest.
Fix test_warnings.test_improper_option(). test_warnings: only run test_improper_option() and test_warnings_bootstrap() once. The unit test doesn't depend on self.module.
Fix test_os.test_symlink(): remove created symlink.
Issue #26643: Add missing shutil resources to regrtest.py
test_urllibnet: set timeout on test_fileno(). Use the default timeout of 30 seconds to avoid blocking forever.
Issue #26295: When using "python3 -m test --testdir=TESTDIR", regrtest doesn't add "test." prefix to test module names. regrtest also prepends testdir to sys.path.
Issue #26295: test_regrtest now uses a temporary directory

Contributions

I also pushed a few changes written by other contributors:

Issue #25907: Use {% trans %} tags in HTML templates to ease the translation of the documentation. The tag comes from Jinja templating system, used by Sphinx. Patch written by Julien Palard.
Issue #26248: Enhance os.scandir() doc, patch written by Ben Hoyt:
Fix error message in asyncio.selector_events. Patch written by Carlo Beccarini.
Issue #16851: Fix inspect.ismethod() doc, return also True if object is an unbound method. Patch written by Anna Koroliuk.
Issue #26574: Optimize bytes.replace(b'', b'.') and bytearray.replace(b'', b'.'): up to 80% faster. Patch written by Josh Snider.

Analysis of a Python performance issue

2016-11-19T00:30:00+01:00

I am working on the CPython benchmark suite (performance) and I run the benchmark suite to upload results to speed.python.org. While analying results, I noticed a temporary peak on the call_method benchmark at October 19th:

The graphic shows the performance of the call_method microbenchmark between Feb 29, 2016 and November 17, 2016 on the default branch of CPython. The average is around 17.2 ms, whereas the peak is at 29.0 ms: 68% slower!

The server has two "Intel(R) Xeon(R) CPU X5680 @ 3.33GHz" CPUs, total: 24 logical cores (12 physical cores with HyperThreading). This CPU was launched in 2010 and based on the Westmere-EP microarchitecture. Westmere-EP is based on Westmere, which is the 32 nm shrink of the Nehalem microarchitecture.

Reproduce results

Before going too far, the first step is to validate that results are reproductible: reboot the computer, recompile Python, run again the benchmark.

Instead of running the full benchmark suite, install Python, ..., we will run directly the benchmark manually using the Python freshly built in its source code directory.

Interesting dots on the graphic (can be seen at speed.python.org, not on the screenshot):

678fe178da0d, Oct 09, 17.0 ms: "Fast"
1ce50f7027c1, Oct 19, 28.9 ms: "Slow"
36af3566b67a, Nov 3, 16.9 ms: Fast again

I use the following directories:

~/perf: GitHub haypo/perf project
~/performance: GitHub python/performance project
~/cpython: Mercurial CPython repository

Tune the system for benchmarks:

sudo python3 -m perf system tune

Note: all system commands in this article are optional. They help to reduce the operating system jitter (make benchmarks more reliablee).

Fast:

$ hg up -C -r 678fe178da0d
$ ./configure --with-lto -C && make clean && make
$ mv python python-fast
$ PYTHONPATH=~/perf ./python-fast ~/performance/performance/benchmarks/bm_call_method.py --inherit-environ=PYTHONPATH --fast
call_method: Median +- std dev: 17.0 ms +- 0.1 ms

Slow:

$ hg up -C -r 1ce50f7027c1
$ ./configure --with-lto -C && make clean && make
$ mv python python-slow
$ PYTHONPATH=~/perf ./python-slow ~/performance/performance/benchmarks/bm_call_method.py --inherit-environ=PYTHONPATH --fast
call_method: Median +- std dev: 29.3 ms +- 0.9 ms

We reproduced the significant benchmark result: 17 ms => 29 ms.

I use ./configure and make clean instead of incremental compilation, make command, to avoid compilation errors, and to avoid potential side effects only caused by the incremental compilation.

Analysis with the Linux perf tool

To collect perf events, we will run the benchmark with --worker to run a single process and with -w0 -n100 to run the benchmark long enough: 100 samples means at least 10 seconds (a single sample takes at least 100 ms).

First, reset the system configuration to reset the Linux perf configuration:

sudo python3 -m perf system reset

Note: python3 -m perf system tune reduces the sampling rate of Linux perf to reduce operating system jitter.

perf stat

Command to get general statistics on the benchmark:

$ perf stat ./python-slow ~/performance/performance/benchmarks/bm_call_method.py --inherit-environ=PYTHONPATH --worker -v -w0 -n100

"Fast" results:

Performance counter stats for ./python-fast:

      3773.585194 task-clock (msec)         #    0.998 CPUs utilized
              369 context-switches          #    0.098 K/sec
                0 cpu-migrations            #    0.000 K/sec
            8,300 page-faults               #    0.002 M/sec
   12,981,234,867 cycles                    #    3.440 GHz                     [83.27%]
    1,460,980,720 stalled-cycles-frontend   #   11.25% frontend cycles idle    [83.36%]
      435,806,788 stalled-cycles-backend    #    3.36% backend  cycles idle    [66.72%]
   29,982,530,201 instructions              #    2.31  insns per cycle
                                            #    0.05  stalled cycles per insn [83.40%]
    5,613,631,616 branches                  # 1487.612 M/sec                   [83.40%]
       16,006,564 branch-misses             #    0.29% of all branches         [83.27%]

      3.780064486 seconds time elapsed

"Slow" results:

Performance counter stats for ./python-slow:

      5906.239860 task-clock (msec)         #    0.998 CPUs utilized
              556 context-switches          #    0.094 K/sec
                0 cpu-migrations            #    0.000 K/sec
            8,393 page-faults               #    0.001 M/sec
   20,651,474,102 cycles                    #    3.497 GHz                     [83.36%]
    8,480,803,345 stalled-cycles-frontend   #   41.07% frontend cycles idle    [83.37%]
    4,247,826,420 stalled-cycles-backend    #   20.57% backend  cycles idle    [66.64%]
   30,011,465,614 instructions              #    1.45  insns per cycle
                                            #    0.28  stalled cycles per insn [83.32%]
    5,612,485,730 branches                  #  950.264 M/sec                   [83.36%]
       13,584,136 branch-misses             #    0.24% of all branches         [83.29%]

      5.915402403 seconds time elapsed

Significant differences, Fast => Slow:

Instruction per cycle: 2.31 => 1.45
stalled-cycles-frontend: 11.25% => 41.07%
stalled-cycles-backend: 3.36% => 20.57%

The increase of stalled cycles is interesting. Since the code is supposed to be identical, it probably means that fetching instructions is slower. It sounds like an issue with CPU caches.

Statistics on the CPU L1 instruction cache

The perf list command can be used to get the name of events collecting statistics on the CPU L1 instruction cache:

$ perf list | grep L1
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  (...)

Collect statistics on the CPU L1 instruction cache:

PYTHONPATH=~/perf perf stat -e L1-icache-loads,L1-icache-load-misses ./python-slow ~/performance/performance/benchmarks/bm_call_method.py --inherit-environ=PYTHONPATH --worker -w0 -n10

"Fast" statistics:

Performance counter stats for './python-fast (...)':

   10,134,106,571 L1-icache-loads
       10,917,606 L1-icache-load-misses     #    0.11% of all L1-icache hits

      3.775067668 seconds time elapsed

"Slow" statistics:

Performance counter stats for './python-slow (...)':

   10,753,371,258 L1-icache-loads
      848,511,308 L1-icache-load-misses     #    7.89% of all L1-icache hits

      6.020490449 seconds time elapsed

Cache misses on the L1 cache: 0.1% (Fast) => 8.0% (Slow).

The slow Python has 71.7x more L1 cache misses than the fast Python! It can explain the significant performance drop.

perf report

The perf record command can be used to collect statistics on the functions where the benchmark spends most of its time. Commands:

PYTHONPATH=~/perf perf record ./python ~/performance/performance/benchmarks/bm_call_method.py --inherit-environ=PYTHONPATH --worker -v -w0 -n100
perf report

Output:

40.27%  python  python              [.] _PyEval_EvalFrameDefault
10.30%  python  python              [.] call_function
10.21%  python  python              [.] PyFrame_New
 8.56%  python  python              [.] frame_dealloc
 5.51%  python  python              [.] PyObject_GenericGetAttr
 (...)

More than 64% of the time is spent in these 5 functions.

system tune

To run benchmark, tune again the system for benchmarks:

sudo python3 -m perf system tune

hg bisect

To find the revision which introduces the performance slowdown, we use a shell script to automate the bisection of the Mercurial history.

cmd.sh script checking if a revision is fast or slow:

set -e -x
./configure --with-lto -C && make clean && make
rm -f json
PYTHONPATH=~/perf ./python ~/performance/performance/benchmarks/bm_call_method.py --inherit-environ=PYTHONPATH --worker -o json -v
PYTHONPATH=~/perf python3 cmd.py json

cmd.sh uses the following cmd.py script which checks if the benchmark is slow: if it takes longer than 23 ms (average between 17 ans 29 ms):

import perf, sys
bench = perf.Benchmark.load('json')
bad = (29 + 17) / 2.0
ms = bench.median() * 1e3
if ms >= bad:
    print("BAD! %.1f ms >= %.1f ms" % (ms, bad))
    sys.exit(1)
else:
    print("good: %.1f ms < %.1f ms" % (ms, bad))

In the bisection, "good" means "fast" (17 ms), whereas "bad" means "slow" (29 ms). The peak, revision 1ce50f7027c1, is used as the first "bad" revision. The previous fast revision before the peak is 678fe178da0d, our first "good" revision.

Commands to identify the first revision which introduced the slowdown:

hg bisect --reset
hg bisect -b 1ce50f7027c1
hg bisect -g 678fe178da0d
time hg bisect -c ./cmd.sh

3 min 52 sec later:

The first bad revision is:
changeset:   104531:83877018ef97
parent:      104528:ce85a1f129e3
parent:      104530:2d352bf2b228
user:        Serhiy Storchaka <storchaka@gmail.com>
date:        Tue Oct 18 13:27:54 2016 +0300
files:       Misc/NEWS
description:
Issue #23782: Fixed possible memory leak in _PyTraceback_Add() and exception
loss in PyTraceBack_Here().

Thank you hg bisect! I love this tool.

Even if I trust hg bisect, I don't trust benchmarks, so I recheck manually:

Slow:

$ hg up -C -r 83877018ef97
$ ./configure --with-lto -C && make clean && make
$ PYTHONPATH=~/perf ./python ~/performance/performance/benchmarks/bm_call_method.py --inherit-environ=PYTHONPATH --fast
call_method: Median +- std dev: 29.4 ms +- 1.8 ms

Use hg parents to get the latest fast revision:

$ hg parents -r 83877018ef97
changeset:   104528:ce85a1f129e3
(...)

changeset:   104530:2d352bf2b228
branch:      3.6
(...)

Check the parent:

$ hg up -C -r ce85a1f129e3
$ ./configure --with-lto -C && make clean && make
$ PYTHONPATH=~/perf ./python ~/performance/performance/benchmarks/bm_call_method.py --inherit-environ=PYTHONPATH --fast
call_method: Median +- std dev: 17.1 ms +- 0.1 ms

The revision ce85a1f129e3 is fast and the following revision 83877018ef97 is slow. The revision 83877018ef97 introduced the slowdown. We found it!

Analysis of the revision introducing the slowdown

The revision 83877018ef97 changes two files: Misc/NEWS and Python/traceback.c. The NEWS file is only documentation and so must not impact performances. Python/traceback.c is part of the C code and so is more interesting.

The commit only changes two C functions: PyTraceBack_Here() and _PyTraceback_Add(), but perf report didn't show these functions as "hot". In fact, these functions are never called by the benchmark.

The commit doesn't touch the C code used in the benchmark.

Unrelated C change impacting performances reminds me my previous deadcode horror story. The performance difference is probably caused by "code placement": perf stat showed a significant increase of the cache miss rate on the L1 instruction cache.

Use GCC attribute((hot))

Using PGO compilation was the solution for deadcode, but PGO doesn't work on Ubuntu 14.04 (the OS used by the benchmark server, speed-python) and PGO seems to make benchmarks less reliable.

I wanted to try something else: mark hot functions using the GCC __attribute__((hot)) attribute. PGO compilation does this automatically.

This attribute only has an impact on the code placement: where functions are loaded in memory. The flag declares functions in the .text.hot ELF section rather than the .text ELF section. Grouping hot functions in the same functions helps to reduce the distance between functions and so enhance the usage of CPU caches.

I wrote and then pushed a patch in the issue #28618: "Decorate hot functions using __attribute__((hot)) to optimize Python".

The patch marks 6 functions as hot:

_PyEval_EvalFrameDefault()
call_function()
_PyFunction_FastCall()
PyFrame_New()
frame_dealloc()
PyErr_Occurred()

Let's try the patch:

$ hg up -C -r 83877018ef97
$ wget https://hg.python.org/cpython/raw-rev/59b91b4e9506 -O patch
$ patch -p1 < patch
$ ./configure --with-lto -C && make clean && make
$ PYTHONPATH=~/perf ./python ~/performance/performance/benchmarks/bm_call_method.py --inherit-environ=PYTHONPATH --fast
call_method: Median +- std dev: 16.7 ms +- 0.3 ms

It's easy to make mistakes and benchmarks are always suprising, so let's retry without the patch:

$ hg up -C -r 83877018ef97
$ ./configure --with-lto -C && make clean && make
$ PYTHONPATH=~/perf ./python ~/performance/performance/benchmarks/bm_call_method.py --inherit-environ=PYTHONPATH --fast
call_method: Median +- std dev: 29.3 ms +- 0.6 ms

The check confirms that the GCC attribute fixed the issue!

Conclusion

On modern Intel CPUs, the code placement can have a major impact on the performance of microbenchmarks.

The GCC __attribute__((hot)) attribute can be used manually to make "hot functions" close in memory to enhance the usage of CPU caches.

To know more about the impact of code placement, see the very good talk of Zia Ansari (Intel) at the LLVM Developers' Meeting 2016: Causes of Performance Swings Due to Code Placement in IA. He describes well "performance swings" like the one described in this article and explains how CPUs work internally and how code placement impacts CPU performances.

Intel CPUs (part 2): Turbo Boost, temperature, frequency and Pstate C0 bug

2016-09-23T23:00:00+02:00

My first article Intel CPUs is a general introduction on modern CPU technologies having an impact on benchmarks.

This second article is much more concrete with numbers and a concrete bug having a major impact on benchmarks: a benchmark suddenly becomes 2x faster!

I will tell you how I first noticed the bug, which tests I ran to analyze the issue, how I found commands to reproduce the bug, and finally how I identified the bug.

"Glitch" in benchmarks

Last week I ran a benchmark to check if enabling Profile Guided Optimization (PGO) when compiling Python makes benchmark results less stable. I recompiled Python 5 times, and after each compilation I ran a benchmark. I tested different commands and options to compile Python. Everything was fine until the last benchmark of the last compilation. The benchmark suddenly became 2 times faster.

Hopefully, my perf module collects a lot of metadata. I was able to analyze in depth what happened.

The "glitch" occurred in a benchmark having 400 runs (benchmark run in 400 different processes), between the run 105 (20.3 ms) and the run 106 (11.0 ms).

I noticed that the CPU temperature was between 69°C and 72°C until the run 105, and then decreased to from 69°C to 58°C.

The system load slowly increased from 1.25 up to 1.62 around the run 108 and then slowly decreased to 1.00.

The system was not idle while the benchmark was running. I was working on the PC too! But according to timestamps, it seems like the glitch was close to when I stopped working. When I stopped working, I closed all applications (except of the benchmark running in background) and turned of my two monitors.

Well, at this point, it's hard to correlate for sure an event with the major performance change.

So I started to analyze different factors affecting CPUs and benchmarks: Turbo Boost, CPU temperature and CPU frequency.

Impact of Turbo Boost on benchmarks

Without Turbo Boost, the maximum frequency of the "Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz" of my laptop is 2.9 GHz. With Turbo Boost, the maximum frequency is 3.6 GHz if only one core is active, or 3.4 GHz otherwise:

$ sudo cpupower frequency-info
  ...
  boost state support:
    Supported: yes
    Active: yes
    3400 MHz max turbo 4 active cores
    3400 MHz max turbo 3 active cores
    3400 MHz max turbo 2 active cores
    3600 MHz max turbo 1 active cores

I ran the bm_call_simple.py microbenchmark (CPU-bound) of performance 0.2.2.

Turbo Boost disabled:

1 physical CPU active: 2.9 GHz, Median +- std dev: 14.6 ms +- 0.3 ms
2 physical CPU active: 2.9 GHz, Median +- std dev: 14.7 ms +- 0.5 ms

Turbo Boost enabled:

1 physical CPU active: 3.6 GHz, Median +- std dev: 11.8 ms +- 0.3 ms
2 physical CPU active: 3.4 GHz, Median +- std dev: 12.4 ms +- 0.1 ms

The maximum performance boost is 19% faster (14.6 ms => 11.8 ms), the minimum boost if 15% faster (14.6 ms => 12.4 ms).

Hum, I don't think that Turbo Boost can explain the bug.

Impact of the CPU temperature on benchmarks

The CPU temperature is mentionned in Intel Turbo Boost documentation as a factor used to decide which P-state will be used. I always wanted to check how the CPU temperature impacts its performance.

Burn the CPU of my desktop PC

CPU of my desktop PC: "Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz".

I used my system_load.py script to generate a system load higher than 10.

When the fan is cooling correctly the CPU, all CPU run at 3.4 GHz (Turbo Boost was disabled) and the CPU temperature is 66°C.

I used a simple sheet of paper to block the fan of my CPU. Yeah, I really wanted to burn my CPU! More seriously, I checked the CPU temperature every second using the sensors command and was prepared to unblock the fan if sometimes gone wrong.

After one minute, the CPU reached 97°C. I expected a system crash, smoke or something worse, but I was disappointed. At 97°C, I was still able to use my computer as everything was fine. The CPU was slowly down automatically to the minimum CPU frequency: 1533 MHz according to turbostat (the minimum frequency of this CPU is 1.6 GHz).

When I unblocked the fan, the temperature decreased quickly to go back to its previous state (62°C) and the CPU frequency quickly increased to 3.4 GHz as well.

My Intel CPU is really impressive! I didn't expect such very efficient protection against overheating!

Burn my laptop CPU

I used my system_load.py script to get a system load over 200. I also opened 4 tabs in Firefox playing Youtube videos to stress also the GPU which is integrated into the CPU (IGP) on such laptop.

With such crazy stress test, the CPU temperature was "only" 83°C.

Using a simple tissue, I closed the air hole used by the CPU fan. When the CPU temperature increased from 100°C to 101°C, the CPU frequency started slowly to decrease from 3391 MHz to 3077 MHz (with steps between 10 MHz and 50 MHz every second, or something like that).

When pushing hard the tissue and waiting longer than 5 minutes, the CPU temperature increased up to 102°C, but the CPU frequency was only decreased from 3.4 GHz (Turbo Mode with 4 active logical CPUs) to 3.1 GHz.

The maximum frequency is 2.9 GHz. Frequencies higher than 2.9 GHz means that the Turbo Mode was enabled! It means that even with overheating, the CPU is still fine and able to "overclock" itself!

Again, I was disapointed. With a CPU at 102°C, my laptop was still super fast and reactive. It seems like mobile CPUs handle even better overheating than desktop CPUs (which is not something suprising at all).

Impact of the CPU frequency on benchmarks

I ran the bm_call_simple.py microbenchmark (CPU-bound) of performance 0.2.2 on my desktop PC.

Command to set the frequency of CPU 0 to the minimum frequency (1.6 GHz):

$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq|sudo tee  /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
1600000

Command to set the frequency of CPU 0 to the maximum frequency (3.4 GHz):

$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq|sudo tee  /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
3400000

CPU running at 1.6 GHz (min freq): Median +- std dev: 27.7 ms +- 0.7 ms
CPU running at 3.4 GHz (min freq): Median +- std dev: 12.9 ms +- 0.2 ms

The impact of the CPU frequency is quite obvious: when the CPU frequency is doubled, the performance is also doubled. The benchmark is 53% faster (27.7 ms => 12.9 ms).

Bug reproduced and then identified in the Linux CPU driver

Two days ago, I ran a very simple "timeit" microbenchmark to try to bisect a performance regression in Python 3.6 on functools.partial. Again, suddenly, the microbenchmark became 2x faster!

But this time, I found something: I noticed that running or stopping cpupower monitor and/or turbostat can "enable" or "disable" the bug.

After a lot of tests, I understood that running the benchmark with turbostat "disables" the bug, whereas running "cpupower monitor" while running a benchmark enables the bug.

I reported the bug in the Fedora bug tracker, on the component kernel: intel_pstate C0 bug on isolated CPUs with the performance governor and NOHZ_FULL.

It seems like the bug is related to CPU isolation and NOHZ_FULL. The NOHZ_FULL option is able to fully disable the scheduler clock interruption on isolated CPUs. I understood the the intel_pstate driver uses a callback on the scheduler to update the Pstate of the CPU. According to an Intel engineer, the intel_pstate driver was never tested with CPU isolation.

The issue is not fully analyzed yet, but at least I succeeded to write a list of commands to reproduce it with a success rate of 100% :-) Moreover, the Intel engineer suggested to add an extra parameter to the Linux kernel command (rcu_nocbs=3,7) line which works around the issue.

Conclusion

This article describes how I found and then identified a bug in the Linux driver of my CPU.

Summary:

The maximum speedup of Turbo Boost is 20%
Overheating on a dekstop PC can decrease the CPU frequency to its minimum (half of the maximum in my case) which imply a slowdown of 50%
A bug in the Linux CPU driver changes suddenly the CPU frequency from its minimum to maximum (or the opposite) which means a speedup of 50% (or slowdown of 50%)

To get stable benchmarks, the safest fix for all these issues is probably to set the CPU frequency of the CPUs used by benchmarks to the minimum. It seems like nothing can reduce the frequency of a CPU below its minimum.

When running benchmarks, raw timings and CPU performance don't matter. Only comparisons between benchmark results and stable performances matter.

Intel CPUs: P-state, C-state, Turbo Boost, CPU frequency, etc.

2016-07-15T12:00:00+02:00

Ten years ago, most computers were desktop computers designed for best performances and their CPU frequency was fixed. Nowadays, most devices are embedded and use low power consumption processors like ARM CPUs. The power consumption now matters more than performance peaks.

Intel CPUs evolved from a single core to multiple physical cores in the same package and got new features: Hyper-threading to run two threads on the same physical core and Turbo Boost to maximum performances. CPU cores can be completely turned off (CPU HALT, frequency of 0) temporarily to reduce the power consumption, and the frequency of cores changes regulary depending on many factors like the workload and temperature. The power consumption is now an important part in the design of modern CPUs.

Warning! This article is a summary of what I learnt last weeks from random articles. It may be full of mistakes, don't hesitate to report them, so I can enhance the article! It's hard to find simple articles explaining performances of modern Intel CPUs, so I tried to write mine.

Tools used in this article

This article mentions various tools. Commands to install them on Fedora 24:

dnf install -y util-linux:

lscpu

dnf install -y kernel-tools:

cpupower
turbostat

sudo dnf install -y msr-tools:

rdmsr
wrmsr

Other interesting tools, not used in this article: i7z (sadly no more maintained), lshw, dmidecode, sensors.

The sensors tool is supposed to report the current CPU voltage, but it doesn't provide this information on my computers. At least, it gives the temperature of different components, but also the speed of fans.

Example of Intel CPUs

My laptop CPU: /proc/cpuinfo

On Linux, the most common way to retrieve information on the CPU is to read /proc/cpuinfo. Example on my laptop:

selma$ cat /proc/cpuinfo
processor  : 0
vendor_id  : GenuineIntel
model name : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
cpu MHz    : 1200.214
...

processor  : 1
vendor_id  : GenuineIntel
model name : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
cpu MHz    : 3299.882
...

"i7-3520M" CPU is a model designed for Mobile Platforms (see the "M" suffix). It was built in 2012 and is the third generation of the Intel i7 microarchitecture: Ivy Bridge.

The CPU has two physical cores, I disabled HyperThreading in the BIOS.

The first strange thing is that the CPU announces "2.90 GHz" but Linux reports 1.2 GHz on the first core, and 3.3 GHz on the second core. 3.3 GHz is greater than 2.9 GHz!

My desktop CPU: CPU topology with lscpu

cpuinfo:

smithers$ cat /proc/cpuinfo
processor   : 0
physical id : 0
core id     : 0
...
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
cpu cores   : 4
...

processor   : 1
physical id : 0
core id     : 1
...

(...)

processor   : 7
physical id : 0
core id     : 3
...

The CPU i7-2600 is the 2nd generation: Sandy Bridge microarchitecture. There are 8 logical cores and 4 physical cores (so with Hyper-threading).

The lscpu renders a short table which helps to understand the CPU topology:

smithers$ lscpu -a -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
0   0    0      0    0:0:0:0       yes    3800.0000 1600.0000
1   0    0      1    1:1:1:0       yes    3800.0000 1600.0000
2   0    0      2    2:2:2:0       yes    3800.0000 1600.0000
3   0    0      3    3:3:3:0       yes    3800.0000 1600.0000
4   0    0      0    0:0:0:0       yes    3800.0000 1600.0000
5   0    0      1    1:1:1:0       yes    3800.0000 1600.0000
6   0    0      2    2:2:2:0       yes    3800.0000 1600.0000
7   0    0      3    3:3:3:0       yes    3800.0000 1600.0000

There are 8 logical CPUs (CPU 0..7), all on the same node (NODE 0) and the same socket (SOCKET 0). There are only 4 physical cores (CORE 0..3). For example, the physical core 2 is made of the two logical CPUs: 2 and 6.

Using the L1d:L1i:L2:L3 column, we can see that each pair of two logical cores share the same physical core caches for levels 1 (L1 data, L1 instruction) and 2 (L2). All physical cores share the same cache level 3 (L3).

P-states

A new CPU driver intel_pstate was added to the Linux kernel 3.9 (April 2009). First, it only supported SandyBridge CPUs (2nd generation), Linux 3.10 extended it to Ivybridge generation CPUs (3rd gen), and so on and so forth.

This driver supports recent features and thermal control of modern Intel CPUs. Its name comes from P-states.

The processor P-state is the capability of running the processor at different voltage and/or frequency levels. Generally, P0 is the highest state resulting in maximum performance, while P1, P2, and so on, will save power but at some penalty to CPU performance.

It is possible to force the legacy CPU driver (acpi_cpufreq) using intel_pstate=disable option in the kernel command line.

Idle states: C-states

C-states are idle power saving states, in contrast to P-states, which are execution power saving states.

During a P-state, the processor is still executing instructions, whereas during a C-state (other than C0), the processor is idle, meaning that nothing is executing.

C-states:

C0 is the operational state, meaning that the CPU is doing useful work
C1 is the first idle state
C2 is the second idle state: The external I/O Controller Hub blocks interrupts to the processor.
etc.

When a logical processor is idle (C-state except of C0), its frequency is typically 0 (HALT).

The cpupower idle-info command lists supported C-states:

selma$ cpupower idle-info
CPUidle driver: intel_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 6
Available idle states: POLL C1-IVB C1E-IVB C3-IVB C6-IVB C7-IVB
...

The cpupower monitor shows statistics on C-states:

smithers$ sudo cpupower monitor -m Idle_Stats
    |Idle_Stats
CPU | POLL | C1-S | C1E- | C3-S | C6-S
   0|  0,00|  0,19|  0,09|  0,58| 96,23
   4|  0,00|  0,00|  0,00|  0,00| 99,90
   1|  0,00|  2,34|  0,00|  0,00| 97,63
   5|  0,00|  0,00|  0,17|  0,00| 98,02
   2|  0,00|  0,00|  0,00|  0,00|  0,00
   6|  0,00|  0,00|  0,00|  0,00|  0,00
   3|  0,00|  0,00|  0,00|  0,00|  0,00
   7|  0,00|  0,00|  0,00|  0,00| 49,97

Turbo Boost

In 2005, Intel introduced SpeedStep, a serie of dynamic frequency scaling technologies to reduce the power consumption of laptop CPUs. Turbo Boost is an enhancement of these technologies, now also used on desktop and server CPUs.

Turbo Boost allows to run one or many CPU cores to higher P-states than usual. The maximum P-state is constrained by the following factors:

The number of active cores (in C0 or C1 state)
The estimated current consumption of the processor (Imax)
The estimated power consumption (TDP - Thermal Design Power) of processor
The temperature of the processor

Example on my laptop:

selma$ cat /proc/cpuinfo
model name : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
...

selma$ sudo cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  ...
  boost state support:
    Supported: yes
    Active: yes
    3400 MHz max turbo 4 active cores
    3400 MHz max turbo 3 active cores
    3400 MHz max turbo 2 active cores
    3600 MHz max turbo 1 active cores

The CPU base frequency is 2.9 GHz. If more than one physical cores is "active" (busy), their frequency can be increased up to 3.4 GHz. If only 1 physical core is active, its frequency can be increased up to 3.6 GHz.

In this example, Turbo Boost is supported and active.

Turbo Boost MSR

The bit 38 of the Model-specific register (MSR) 0x1a0 can be used to check if the Turbo Boost is enabled:

selma$ sudo rdmsr -f 38:38 0x1a0
0

0 means that Turbo Boost is enabled, whereas 1 means disabled (no turbo). (The -f 38:38 option asks to only display the bit 38.)

If the command doesn't work, you may have to load the msr kernel module:

sudo modprobe msr

Note: I'm not sure that all Intel CPU uses the same MSR.

intel_state/no_turbo

Turbo Boost can also be disabled at runtime in the intel_pstate driver.

Check if Turbo Boost is enabled:

selma$ cat /sys/devices/system/cpu/intel_pstate/no_turbo
0

where 0 means that Turbo Boost is enabled. Disable Turbo Boost:

selma$ echo 1|sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

CPU flag "ida"

It looks like the Turbo Boost status (supported or not) can also be read by the CPUID(6): "Thermal/Power Management". It gives access to the flag Intel Dynamic Acceleration (IDA).

The ida flag can also be seen in CPU flags of /proc/cpuinfo.

Read the CPU frequency

General information using cpupower frequency-info:

selma$ cpupower -c 0 frequency-info
analyzing CPU 0:
  driver: intel_pstate
  ...
  hardware limits: 1.20 GHz - 3.60 GHz
  ...

The frequency of CPUs is between 1.2 GHz and 3.6 GHz (the base frequency is 2.9 GHz on this CPU).

Get the frequency of CPUs: turbostat

It looks like the most reliable way to get a relialistic estimation of the CPUs frequency is to use the tool turbostat:

selma$ sudo turbostat
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz
       -     224    7.80    2878    2893
       0     448   15.59    2878    2893
       1       0    0.01    2762    2893
     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz
       -     139    5.65    2469    2893
       0     278   11.29    2469    2893
       1       0    0.01    2686    2893
    ...

Avg_MHz: average frequency, based on APERF
Busy%: CPU usage in percent
Bzy_MHz: busy frequency, based on MPERF
TSC_MHz: fixed frequency, TSC stands for Time Stamp Counter

APERF (average) and MPERF (maximum) are MSR registers that can provide feedback on current CPU frequency.

Other tools to get the CPU frequency

It looks like the following tools are less reliable to estimate the CPU frequency.

cpuinfo:

selma$ grep MHz /proc/cpuinfo
cpu MHz : 1372.289
cpu MHz : 3401.042

In April 2016, Len Brown proposed a patch modifying cpuinfo to use APERF and MPERF MSR to estimate the CPU frequency: x86: Calculate MHz using APERF/MPERF for cpuinfo and scaling_cur_freq.

The tsc clock source logs the CPU frequency in kernel logs:

selma$ dmesg|grep 'MHz processor'
[    0.000000] tsc: Detected 2893.331 MHz processor

cpupower frequency-info:

selma$ for core in $(seq 0 1); do sudo cpupower -c $core frequency-info|grep 'current CPU'; done
  current CPU frequency: 3.48 GHz (asserted by call to hardware)
  current CPU frequency: 3.40 GHz (asserted by call to hardware)

cpupower monitor:

selma$ sudo cpupower monitor -m 'Mperf'
    |Mperf
CPU | C0   | Cx   | Freq
   0|  4.77| 95.23|  1924
   1|  0.01| 99.99|  1751

Conclusion

Modern Intel CPUs use various technologies to provide best performances without killing the power consumption. It became harder to monitor and understand CPU performances, than with older CPUs, since the performance now depends on much more factors.

It also becomes common to get an integrated graphics processor (IGP) in the same package, which makes the exact performance even more complex to predict, since the IGP produces heat and so has an impact on the CPU P-state.

I should also explain that P-state are "voted" between CPU cores, but I didn't understand this part. I'm not sure that understanding the exact algorithm matters much. I tried to not give too much information.

Annex: AMT and the ME (power management coprocessor)

Computers with Intel vPro technology includes Intel Active Management Technology (AMT): "hardware and firmware technology for remote out-of-band management of personal computers". AMT has many features which includes power management.

Management Engine (ME) is the hardware part: an isolated and protected coprocessor, embedded as a non-optional part in all current (as of 2015) Intel chipsets. The coprocessor is a special 32-bit ARC microprocessor (RISC architecture) that's physically located inside the PCH chipset (or MCH on older chipsets). The coprocessor can for example be found on Intel MCH chipsets Q35 and Q45.

See Intel x86s hide another CPU that can take over your machine (you can't audit it) for more information on the coprocessor.

More recently, the Intel Xeon Phi CPU (2016) also includes a coprocessor for power management. I didn't understand if it is the same coprocessor or not.

Visualize the system noise using perf and CPU isolation

2016-06-16T13:30:00+02:00

I developed a new perf module designed to run stable benchmarks, give fine control on benchmark parameters and compute statistics on results. With such tool, it becomes simple to visualize sources of noise. The CPU isolation will be used to visualize the system noise. Running a benchmark on isolated CPUs isolates it from the system noise.

Isolate CPUs

My computer has 4 physical CPU cores. I isolated half of them using isolcpus=2,3 parameter of the Linux kernel. I modified manually the command line in GRUB to add this parameter.

Check that CPUs are isolated:

$ cat /sys/devices/system/cpu/isolated
2-3

The CPU supports HyperThreading, but I disabled it in the BIOS.

Run a benchmark

The perf module automatically detects and uses isolated CPU cores. I will use the --affinity=0,1 option to force running the benchmark on the CPUs which are not isolated.

Microbenchmark with and without CPU isolation:

$ python3 -m perf.timeit --json-file=timeit_isolcpus.json --verbose -s 'x=1; y=2' 'x+y'
Pin process to isolated CPUs: 2-3
.........................
Median +- std dev: 36.6 ns +- 0.1 ns (25 runs x 3 samples x 10^7 loops; 1 warmup)

$ python3 -m perf.timeit --affinity=0,1 --json-file=timeit_no_isolcpus.json --verbose -s 'x=1; y=2' 'x+y'
Pin process to CPUs: 0-1
.........................
Median +- std dev: 36.7 ns +- 1.3 ns (25 runs x 3 samples x 10^7 loops; 1 warmup)

My computer was not 100% idle, I was using it while the benchmarks were running.

The median is almost the same (36.6 ns and 36.7 ns). The first major difference is the standard deviation: it is much larger without CPU isolation: 0.1 ns => 1.3 ns (13x larger).

Just in case, check manually CPU affinity in metadata:

$ python3 -m perf show timeit_isolcpus.json --metadata | grep cpu
- cpu_affinity: 2-3 (isolated)
- cpu_count: 4
- cpu_model_name: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

$ python3 -m perf show timeit_no_isolcpus.json --metadata | grep cpu_affinity
- cpu_affinity: 0-1

Statistics

The perf stats command computes statistics on the distribution of samples:

$ python3 -m perf stats timeit_isolcpus.json
Number of samples: 75

Minimum: 36.5 ns (-0.1%)
Median +- std dev: 36.6 ns +- 0.1 ns (36.5 ns .. 36.7 ns)
Maximum: 36.7 ns (+0.4%)

$ python3 -m perf stats timeit_no_isolcpus.json
Number of samples: 75

Minimum: 36.5 ns (-0.5%)
Median +- std dev: 36.7 ns +- 1.3 ns (35.4 ns .. 38.0 ns)
Maximum: 43.0 ns (+17.0%)

The minimum is the same. The second major difference is the maximum: it is much larger without CPU isolation: 36.7 ns (+0.4%) => 43.0 ns (+17.0%).

The difference between the maximum and the median is 63x larger without CPU isolation: 0.1 ns (36.7 - 36.6) => 6.3 ns (43.0 - 36.7).

Depending on the system load, a single sample of the microbenchmark is up to 17% slower (maximum of 43.0 ns with a median of 36.7 ns) without CPU isolation. The difference is smaller with CPU isolation: only 0.4% slower (for the maximum, and 0.1% faster for the minimum).

Histogram

Another way to analyze the distribution of samples is to render an histogram:

$ python3 -m perf hist --bins=8 timeit_isolcpus.json timeit_no_isolcpus.json
[ timeit_isolcpus ]
36.1 ns: 75 ################################################
36.9 ns:  0 |
37.7 ns:  0 |
38.5 ns:  0 |
39.3 ns:  0 |
40.1 ns:  0 |
40.9 ns:  0 |
41.7 ns:  0 |
42.5 ns:  0 |

[ timeit_no_isolcpus ]
36.1 ns: 52 ################################################
36.9 ns: 13 ############
37.7 ns:  1 #
38.5 ns:  4 ####
39.3 ns:  2 ##
40.1 ns:  0 |
40.9 ns:  1 #
41.7 ns:  0 |
42.5 ns:  2 ##

I choose the number of bars to get a small histogram and to get all samples of the first benchmark on the same bar. With 8 bars, each bar is a range of 0.8 ns.

The last major difference is the shape of these histogram. Without CPU isolation, there is a "long tail" at the right of the median: outliers in the range [37.7 ns; 42.5 ns]. The outliers come from the "noise" caused by the multitasking system.

Conclusion

The perf module provides multiple tools to analyze the distribution of benchmark samples. Three tools show a major difference without CPU isolation compared to results with CPU isolation:

Standard deviation: 13x larger without isolation
Maximum: difference to median 63x larger without isolation
Shape of the histogram: long tail at the right of the median

It explains why CPU isolation helps to make benchmarks more stable.

My journey to stable benchmark, part 3 (average)

2016-05-23T23:00:00+02:00

Stable benchmarks are so close, but ...

Address Space Layout Randomization

When I started to work on removing the noise of the system, I was told that disabling Address Space Layout Randomization (ASLR) makes benchmarks more stable.

I followed this advice without trying to understand it. We will see in this article that it was a bad idea, but I had to hit other issues to really understand the root issue with disabling ASLR.

Example of command to see the effect of ASLR, the first number of the output is the start address of the heap memory:

$ python -c 'import os; os.system("grep heap /proc/%s/maps" % os.getpid())'
55e6a716c000-55e6a7235000 rw-p 00000000 00:00 0                          [heap]

Heap address of 3 runs with ASLR enabled (random):

55e6a716c000
561c218eb000
55e6f628f000

Disable ASLR:

sudo bash -c 'echo 0 >| /proc/sys/kernel/randomize_va_space'

Heap addresses of 3 runs with ASLR disabled (all the same):

555555756000
555555756000
555555756000

Note: To reenable ASLR, it's better to use the value 2, the value 1 only partially enables the feature:

sudo bash -c 'echo 2 >| /proc/sys/kernel/randomize_va_space'

Python randomized hash function

With system tuning (part 1), a Python compiled with PGO (part 2) and ASLR disabled, I still I failed to get the same result when running manually bm_call_simple.py.

On Python 3, the hash function is now randomized by default: issue #13703. The problem is that for a microbenchmark, the number of hash collisions of an "hot" dictionary has a non-negligible impact on performances.

The PYTHONHASHSEED environment variable can be used to get a fixed hash function. Example with the patch:

$ PYTHONHASHSEED=1 taskset -c 1 ./python bm_call_simple.py -n 1
0.198
$ PYTHONHASHSEED=2 taskset -c 1 ./python bm_call_simple.py -n 1
0.201
$ PYTHONHASHSEED=3 taskset -c 1 ./python bm_call_simple.py -n 1
0.207
$ PYTHONHASHSEED=4 taskset -c 1 ./python bm_call_simple.py -n 1
0.187
$ PYTHONHASHSEED=5 taskset -c 1 ./python bm_call_simple.py -n 1
0.180

Timings of the reference python:

$ PYTHONHASHSEED=1 taskset -c 1 ./ref_python bm_call_simple.py -n 1
0.204
$ PYTHONHASHSEED=2 taskset -c 1 ./ref_python bm_call_simple.py -n 1
0.206
$ PYTHONHASHSEED=3 taskset -c 1 ./ref_python bm_call_simple.py -n 1
0.195
$ PYTHONHASHSEED=4 taskset -c 1 ./ref_python bm_call_simple.py -n 1
0.192
$ PYTHONHASHSEED=5 taskset -c 1 ./ref_python bm_call_simple.py -n 1
0.187

The minimums is 180 ms for the reference and 186 ms for the patch. The patched Python is 3% faster, yeah!

Wait. What if we only test PYTHONHASHSEED from 1 to 3? In this case, the minimum is 195 ms for the reference and 198 ms for the patch. The patched Python becomes 2% slower, oh no!

Faster? Slower? Who is right?

Maybe I should write a script to find a PYTHONHASHSEED value for which my patch is always faster :-)

Command line and environment variables

Well, let's say that we will use a fixed PYTHONHASHSEED value. Anyway, my patch doesn't touch the hash function. So it doesn't matter.

While running benchmarks, I noticed differences when running the benchmark from a different directory:

$ cd /home/haypo/prog/python/fastcall
$ PYTHONHASHSEED=3 taskset -c 1 pgo/python ../benchmarks/performance/bm_call_simple.py -n 1
0.215

$ cd /home/haypo/prog/python/benchmarks
$ PYTHONHASHSEED=3 taskset -c 1 ../fastcall/pgo/python ../benchmarks/performance/bm_call_simple.py -n 1
0.203

$ cd /home/haypo/prog/python
$ PYTHONHASHSEED=3 taskset -c 1 fastcall/pgo/python benchmarks/performance/bm_call_simple.py -n 1
0.200

In fact, a different command line is enough so get different results (added arguments are ignored):

$ PYTHONHASHSEED=3 taskset -c 1 ./python bm_call_simple.py -n 1
0.201
$ PYTHONHASHSEED=3 taskset -c 1 ./python bm_call_simple.py -n 1 arg1
0.198
$ PYTHONHASHSEED=3 taskset -c 1 ./python bm_call_simple.py -n 1 arg1 arg2 arg3
0.203
$ PYTHONHASHSEED=3 taskset -c 1 ./python bm_call_simple.py -n 1 arg1 arg2 arg3 arg4 arg5
0.206
$ PYTHONHASHSEED=3 taskset -c 1 ./python bm_call_simple.py -n 1 arg1 arg2 arg3 arg4 arg5 arg6
0.210

I also noticed minor differences when the environment changes (added variables are ignored):

$ taskset -c 1 env -i PYTHONHASHSEED=3 ./python bm_call_simple.py -n 1
0.201
$ taskset -c 1 env -i PYTHONHASHSEED=3 VAR1=1 VAR2=2 VAR3=3 VAR4=4 ./python bm_call_simple.py -n 1
0.202
$ taskset -c 1 env -i PYTHONHASHSEED=3 VAR1=1 VAR2=2 VAR3=3 VAR4=4 VAR5=5 ./python bm_call_simple.py -n 1
0.198

Using strace and ltrace, I saw the memory addresses are different when something (command line, env var, etc.) changes.

Average and standard deviation

Basically, it looks like a lot of "external factors" have an impact on the exact memory addresses, even if ASRL is disabled and PYTHONHASHSEED is set. I started to think how to get exactly the same command line, the same environment (easy), the same current directory (easy), etc. The problem is that it's just not possible to control all external factors (having an effect on the exact memory addresses).

Maybe I was plain wrong from the beginning and ASLR must be enabled, as the default on Linux:

$ taskset -c 1 env -i PYTHONHASHSEED=3 ./python bm_call_simple.py
0.198
$ taskset -c 1 env -i PYTHONHASHSEED=3 ./python bm_call_simple.py
0.202
$ taskset -c 1 env -i PYTHONHASHSEED=3 ./python bm_call_simple.py
0.199
$ taskset -c 1 env -i PYTHONHASHSEED=3 ./python bm_call_simple.py
0.207
$ taskset -c 1 env -i PYTHONHASHSEED=3 ./python bm_call_simple.py
0.200
$ taskset -c 1 env -i PYTHONHASHSEED=3 ./python bm_call_simple.py
0.201

These results look "random". Yes, they are. It's exactly the purpose of ASLR.

But how can we compare performances if results are random? Take the minimum?

No! You must never (ever again) use the minimum for benchmarking! Compute the average and some statistics like the standard deviation:

$ python3
Python 3.4.3
>>> timings=[0.198, 0.202, 0.199, 0.207, 0.200, 0.201]
>>> import statistics
>>> statistics.mean(timings)
0.2011666666666667
>>> statistics.stdev(timings)
0.0031885210782848245

On this example, the average is 201 ms +/- 3 ms. IMHO the standard deviation is quite small (reliable) which means that my benchmark is stable. To get a good distribution, it's better to have many samples. It looks like at least 25 processes are needed. Each process tests a different memory layout and a different hash function.

Result of 5 runs, each run uses 25 processes (ASLR enabled, random hash function):

Average: 205.2 ms +/- 3.0 ms (min: 201.1 ms, max: 214.9 ms)
Average: 205.6 ms +/- 3.3 ms (min: 201.4 ms, max: 216.5 ms)
Average: 206.0 ms +/- 3.9 ms (min: 201.1 ms, max: 215.3 ms)
Average: 205.7 ms +/- 3.6 ms (min: 201.5 ms, max: 217.8 ms)
Average: 206.4 ms +/- 3.5 ms (min: 201.9 ms, max: 214.9 ms)

While memory layout and hash functions are random again, the result looks less random, and so more reliable, than before!

With ASLR enabled, the effect of the environment variables, command line and current directory is negligible on the (average) result.

The average solves issues with uniform random noises

The user will run the application with default system settings which means ASLR enabled and Python hash function randomized. Running a benchmark in one specific environment is a mistake because it is not representative of the performance in practice.

Computing the average and standard deviation "fixes" the issue with hash randomization. It's much better to use random hash functions and compute the average, than using a fixed hash function (setting PYTHONHASHSEED variable to a value).

Oh wow, already 3 big articles explaing how to get stable benchmarks. Please tell me that it was the last one! Nope, more is coming...

Annex: why only -n1?

In this article, I ran bm_call_simple.py with -n 1 with only run one iteration.

Usually, a single iteration is not reliable at all, at least 50 iterations are needed. But thanks to system tuning, compilation with PGO, ASRL disabled and PYTHONHASHSEED set, a single iteration is enough.

Example of 3 runs, each with 3 iterations:

$ taskset -c 1 env -i PYTHONHASHSEED=3 ./python bm_call_simple.py -n 3
0.201
0.201
0.201
$ taskset -c 1 env -i PYTHONHASHSEED=3 ./python bm_call_simple.py -n 3
0.201
0.201
0.201
$ taskset -c 1 env -i PYTHONHASHSEED=3 ./python bm_call_simple.py -n 3
0.201
0.201
0.201

Always the same timing!

My journey to stable benchmark, part 2 (deadcode)

2016-05-22T22:00:00+02:00

With the system tuning (part 1), I expected to get very stable benchmarks and so I started to benchmark seriously my FASTCALL branch of CPython (a new calling convention avoiding temporary tuples).

I was disappointed to get many slowdowns in the CPython benchmark suite. I started to analyze why my change introduced performance regressions.

I took my overall patch and slowly reverted more and more code to check which changes introduced most of the slowdowns.

I focused on the call_simple benchmark which does only one thing: call Python functions which do nothing. Making Python function calls slower would be a big and inacceptable mistake of my work.

Linux perf

I started to learn how to use the great Linux perf tool to analyze why call_simple was slower. I tried to find a major difference between my reference python and the patched python.

I analyzed cache misses on L1 instruction and data caches. I analyzed stallen CPU cycles. I analyzed all memory events, branch events, etc. Basically, I tried all perf events and spent a lot of time to run benchmarks multiple times.

By the way, I strongly suggest to use perf stat using the --repeat command line option to get an average on multiple runs and see the standard deviation. It helps to get more reliable numbers. I even wrote a Python script implementing --repeat (run perf multiple times, parse the output), before seeing that it was already a builtin feature!

Use perf list to list all available (pre-defined) events.

After many days, I decided to give up with perf.

Cachegrind

Valgrind is a great tool known to detect memory leaks, but it also contains gems like the Cachegrind tool which simulates the CPU caches.

I used Cachegrind with the nice Kcachegrind GUI. Sadly, I also failed to see anything obvious in cache misses between the reference python and the patched python.

strace and ltrace

I also tried strace and ltrace tools to try to see a difference in the execution of the reference and the patched pythons. I saw different memory addresses, but no major difference which can explain a difference of the timing.

Morever, the hotcode simply does not call any syscall nor library function. It's pure CPU-bound code.

Compiler options

I used GCC to build to code. Just in case, I tried LLVM compiler, but it didn't "fix" the issue.

I also tried different optimization levels: -O0, -O1, -O2 and -O3.

I read that the exact address of functions can have an impact on the CPU L1 cache: Why does gcc generate 15-20% faster code if I optimize for size instead of speed?. I tried various values of the -falign-functions=N option (1, 2, 6, 12).

I also tried -fomit-pointer (omit frame pointer) to record the callgraph with perf record.

I also tried -flto: Link Time Optimization (LTO).

These compiler options didn't fix the issue.

The truth is out there.

UPDATE: See also Rethinking optimization for size article on Linux Weekly News (LWN): "Such an option has obvious value if one is compiling for a space-constrained environment like a small device. But it turns out that, in some situations, optimizing for space can also produce faster code."

When CPython performance depends on dead code

I continued to revert changes. At the end, my giant patch was reduced to very few changes only adding code which was never called (at least, I was sure that it was not called in the call_simple benchmark).

Let me rephase: adding dead code makes Python slower. What?

A colleague suggested me to remove the body (replace it with return;) of added function: the code became faster. Ok, now I'm completely lost. To be clear, I don't expect that adding dead code would have any impact on the performance.

My email When CPython performance depends on dead code... explains how to reproduce the issue and contains many information.

Solution: PGO

The solution is called Profiled Guided Optimization, "PGO". Python build system supports it in a single command: make profile-opt. It profiles the execution of the Python test suite.

Using PGO, adding dead code has no more impact on the performance.

With system tuning and PGO compilation, benchmarks must now be stable this time, no? ... No, sorry, not yet. We will see more sources of noise in following articles ;-)

My journey to stable benchmark, part 1 (system)

2016-05-21T16:50:00+02:00

Background

In the CPython development, it became common to require the result of the CPython benchmark suite ("The Grand Unified Python Benchmark Suite") to evaluate the effect of an optimization patch. The minimum requirement is to not introduce performance regressions.

I used the CPython benchmark suite and I had many bad surprises when trying to analyze (understand) results. A change expected to be faster makes some benchmarks slower without any obvious reason. At least, the change is expected to be faster on some specific benchmarks, but have no impact on the other benchmarks. The slowdown is usually between 5% and 10% slower. I am not confortable with any kind of slowdown.

Many benchmarks look unstable. The problem is to trust the overall report. Some developers started to say that they learnt to ignore some benchmarks known to be unstable.

It's not the first time that I am totally disappointed by microbenchmark results, so I decided to analyze completely the issue and go as deep as possible to really understand the problem.

How to get stable benchmarks on a busy Linux system

A common advice to get stable benchmark is to stay away the keyboard ("freeze!") and stop all other applications to only run one application, the benchmark.

Well, I'm working on a single computer and the full CPython benchmark suite take up to 2 hours in rigorous mode. I just cannot stop working during 2 hours to wait for the result of the benchmark. I like running benchmarks locally. It is convenient to run benchmarks on the same computer used to develop.

The goal here is to "remove the noise of the system". Get the same result on a busy system than an idle system. My simple system_load.py program can be used to increase the system load. For example, run system_load.py 10 in a terminal to get at least a system load of 10 (busy system) and run the benchmark in a different terminal. Use CTRL+c to stop system_load.py.

CPU isolation

In 2016, it is common to get a CPU with multiple physical cores. For example, my Intel CPU has 4 physical cores and 8 logical cores thanks to Hyper-Threading. It is possible to configure the Linux kernel to not schedule processes on some CPUs using the "CPU isolation" feature. It is the isolcpus parameter of the Linux command line, the value is a list of CPUs. Example:

isolcpus=2,3,6,7

Check with:

$ cat /sys/devices/system/cpu/isolated
2-3,6-7

If you have Hyper-Threading, you must isolate the two logicial cores of each isolated physical core. You can use the lscpu --all --extended command to identify physical cores. Example:

$ lscpu -a -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
0   0    0      0    0:0:0:0       yes    5900,0000 1600,0000
1   0    0      1    1:1:1:0       yes    5900,0000 1600,0000
2   0    0      2    2:2:2:0       yes    5900,0000 1600,0000
3   0    0      3    3:3:3:0       yes    5900,0000 1600,0000
4   0    0      0    0:0:0:0       yes    5900,0000 1600,0000
5   0    0      1    1:1:1:0       yes    5900,0000 1600,0000
6   0    0      2    2:2:2:0       yes    5900,0000 1600,0000
7   0    0      3    3:3:3:0       yes    5900,0000 1600,0000

The physical core 0 (CORE column) is made of two logical cores (CPU column): 0 and 4.

NOHZ mode

By default, the Linux kernel uses a scheduling-clock which interrupts the running application HZ times per second to run the scheduler. HZ is usually between 100 and 1000: time slice between 1 ms and 10 ms.

Linux supports a NOHZ mode which is able to disable the scheduling-clock when the system is idle to reduce the power consumption. Linux 3.10 introduces a full ticketless mode, NOHZ full, which is able to disable the scheduling-clock when only one application is running on a CPU.

NOHZ full is disabled by default. It can be enabled with the nohz_full parameter of the Linux command line, the value is a list of CPUs. Example:

nohz_full=2,3,6,7

Check with:

$ cat /sys/devices/system/cpu/nohz_full
2-3,6-7

Interrupts (IRQ)

The Linux kernel can also be configured to not run interruptions (IRQ) handlers on some CPUs using /proc/irq/default_smp_affinity and /proc/irq/<number>/smp_affinity files. The value is not a list of CPUs but a bitmask.

The /proc/interrupts file can be read to see the number of interruptions per CPU.

Read the Linux SMP IRQ affinity documentation.

Example of effect of CPU isolation on a microbenchmark

Example with Linux parameters:

isolcpus=2,3,6,7 nohz_full=2,3,6,7

Microbenchmark on an idle system (without CPU isolation):

$ python3 -m timeit 'sum(range(10**7))'
10 loops, best of 3: 229 msec per loop

Result on a busy system using system_load.py 10 and find / commands running in other terminals:

$ python3 -m timeit 'sum(range(10**7))'
10 loops, best of 3: 372 msec per loop

The microbenchmark is 56% slower because of the high system load!

Result on the same busy system but using isolated CPUs. The taskset command allows to pin an application to specific CPUs:

$ taskset -c 1,3 python3 -m timeit 'sum(range(10**7))'
10 loops, best of 3: 230 msec per loop

Just to check, new run without CPU isolation:

$ python3 -m timeit 'sum(range(10**7))'
10 loops, best of 3: 357 msec per loop

The result with CPU isolation on a busy system is the same than the result an idle system! CPU isolation removes most of the noise of the system.

Conclusion

Great job Linux!

Ok! Now, the benchmark is super stable, no? ... Sorry, no, it's not stable yet. I found a lot of other sources of "noise". We will see them in the following articles ;-)

Status of Python 3 in OpenStack Mitaka

2016-03-02T14:00:00+01:00

Now that most OpenStack services have reached feature freeze for the Mitaka cycle (November 2015-April 2016), it's time to look back on the progress made for Python 3 support.

Previous status update: Python 3 Status in OpenStack Liberty (September 2015).

Services ported to Python 3

13 services were ported to Python 3 during the Mitaka cycle:

Cinder
Congress
Designate
Glance
Heat
Horizon
Manila
Mistral
Octavia
Searchlight
Solum
Watcher
Zaqar

Red Hat contributed to the Cinder, Designate, Glance and Horizon service porting efforts.

"Ported to Python 3" means that all unit tests pass on Python 3.4 which is verified by a voting gate job. It is not enough to run applications in production with Python 3. Integration and functional tests are not run on Python 3 yet. See the section dedicated to these tests below.

See the Python 3 wiki page for the current status of the OpenStack port to Python 3; especially the list of services ported to Python 3.

Services not ported yet

It's become easier to list services which are not compatible with Python 3 than listing services already ported to Python 3!

9 services still need to be ported:

Work-in-progress:
- Magnum: 83% (959 unit tests/1,161)
- Cue: 81% (208 unit tests/257)
- Nova: 74% (10,859 unit tests/14,726)
- Barbican: 34% (392 unit tests/1168)
- Murano: 29% (133 unit tests/455)
- Keystone: 27% (1200 unit tests/4455)
- Swift: 0% (3 unit tests/4,435)
- Neutron-LBaaS: 0% (1 unit test/806)
Port not started yet:
- Trove: no python34 gate

Red Hat contributed Python 3 patches to Cue, Neutron-LBaaS, Swift and Trove during the Mitaka cycle.

Trove developers are ready to start the port at the beginning of the next cycle (Newton). The py34 test environment was blocked by the MySQL-Python dependency (it was not possible to build the test environment), but this dependency is now skipped on Python 3. Later, it will be replaced with PyMySQL on Python 2 and Python 3.

Python 3 issues in Eventlet

Four Python 3 issues were fixed in Eventlet:

Next Milestone: Functional and integration tests

The next major milestone will be to run functional and integration tests on Python 3.

functional tests are restricted to one component (ex: only Glance)
integration tests, like Tempest, test the integration of multiple components

It is now possible to install some packages on Python 3 in DevStack using USE_PYTHON3 and PYTHON3_VERSION variables: Enable optional Python 3 support. It means that it is possible to run tests with some services running on Python 3, and the remaining services on Python 2.

The port to Python 3 of Glance, Heat and Neutron functional and integration tests have already started.

For Glance, 159 functional tests already pass on Python 3.4.

Heat:

project-config: Add python34 integration test job for Heat (WIP)
heat: py34: integration tests (WIP)

Neutron: the Add the functional-py34 and dsvm-functional-py34 targets to tox.ini change was merged, but a gate job hasn't been added for it yet.

Another pending project is to fix issues specific to Python 3.5, but the gate doesn’t use Python 3.5 yet. There are some minor issues, probably easy to fix.

How to port remaining code?

The Python 3 wiki page contains a lot of information about adding Python 3 support to Python 2 code.

Join us in the #openstack-python3 IRC channel on Freenode to discuss Python 3!

Fast _PyAccu, _PyUnicodeWriter and_PyBytesWriter APIs to produce strings in CPython

2016-03-01T16:00:00+01:00

This article described the _PyBytesWriter and _PyUnicodeWriter private APIs of CPython. These APIs are design to optimize code producing strings when the ouput size is not known in advance.

I created the _PyUnicodeWriter API to reply to complains that Python 3 was much slower than Python 2, especially with the new Unicode implementation (PEP 393).

_PyAccu API

Issue #12778: In 2011, Antoine Pitrou found a performance issue in the JSON serializer when serializing many small objects: it used way too much memory for temporary objects compared to the final output string.

The JSON serializer used a list of strings and joined all strings at the end of create a final output string. Pseudocode:

def serialize():
    pieces = [serialize(item) for item in self]
    return ''.join(pieces)

Antoine introduced an accumulator compacting the temporary list of "small" strings and put the result in a second list of "large" strings. At the end, the list of "large" strings was also compacted to build the final output string. Pseudo-code:

def serialize():
    small = []
    large = []
    for item in self:
        small.append(serialize(item))
        if len(small) > 10000:
            large.append(''.join(small))
            small.clear()
    if small
        large.append(''.join(small))
    return ''.join(large)

The threshold of 10,000 strings is justified by this comment:

/* Each item in a list of unicode objects has an overhead (in 64-bit
 * builds) of:
 *   - 8 bytes for the list slot
 *   - 56 bytes for the header of the unicode object
 * that is, 64 bytes.  100000 such objects waste more than 6MB
 * compared to a single concatenated string.
 */

Issue #12911: Antoine Pitrou found a similar performance issue in repr(list), and so proposed to convert its accumular code into a new private _PyAccu API. He added the _PyAccu API to Python 2.7.5 and 3.2.3. Title of te repr(list) change: "Fix memory consumption when calculating the repr() of huge tuples or lists".

The _PyUnicodeWriter API

Inefficient implementation of the PEP 393

In 2010, Python 3.3 got a completly new Unicode implementation, the Python type str, with the PEP 393. The implementation of the PEP was the topic of a Google Summer of Code 2011 with the student Torsten Becker menthored by Martin v. Löwis (author of the PEP). The project was successful: the PEP 393 was implemented, it worked!

The first implementation of the PEP 393 used a lot of 32-bit character buffers (Py_UCS4) which uses a lot of memory and requires expensive conversion to 8-bit (Py_UCS1, ASCII and Latin1) or 16-bit (Py_UCS2, BMP) characters.

The new internal structures for Unicode strings are now very complex and require to be smart when building a new string to avoid memory copies. I created the _PyUnicodeWriter API to try to reduce expensive memory copies, and even completly avoid memory copies in best cases.

Design of the _PyUnicodeWriter API

According to benchmarks, creating a Py_UCS1* buffer and then expand it to Py_UCS2* or Py_UCS4* is more efficient, since Py_UCS1* is the most common format.

Python str type is used for a wide range of usages. For example, it is used for the name of variable names in the Python language itself. Variable names are almost always ASCII.

The worst case for _PyUnicodeWriter is when a long Py_UCS1* buffer must be converted to Py_UCS2*, and then converted to Py_UCS4*. Each conversion is expensive: need to allocate a second memory block and convert characters to the new format.

_PyUnicodeWriter features:

Optional overallocation: overallocate the buffer by 50% on Windows and 25% on Linux. The ratio changes depending on the OS, it is a raw heuristic to get the best performances depending on the malloc() memory allocator.
The buffer can be a shared read-only string if the buffer was only created from a single string. Micro-optimization for "%s" % str.

The API allows to disable overallocation before the last write. For example, "%s%s" % ('abc', 'def') disables the overallocation before writing 'def'.

The _PyUnicodeWriter was introduced by the issue #14716 (change 7be716a47e9d):

Close #14716: str.format() now uses the new "unicode writer" API instead of the PyAccu API. For example, it makes str.format() from 25% to 30% faster on Linux.

Fast-path for ASCII

The cool and unexpected side-effect of the _PyUnicodeWriter is that many intermediate operations got a fast-path for Py_UCS1*, especially for ASCII strings. For example, padding a number with spaces on '%10i' % 123 is implemented with memset().

Formating a floating point number uses the PyOS_double_to_string() function which creates an ASCII buffer. If the writer buffer uses Py_UCS1, a memcpy() is enough to copy the formatted number.

Avoid temporary buffers

Since the beginning, I had the idea of avoiding temporary buffers thanks to an unified API to handle a "Unicode buffer". Slowly, I spread my changes to all functions producing Unicode strings.

The obvious target were str % args and str.format(args). Both instructions use very different code, but it was possible to share a few functions especially the code to format integers in bases 2 (binary), 8 (octal), 10 (decimal) and 16 (hexadecimal).

The function formatting an integer computes the exact size of the output, requests a number of characters and then write characters. The characters are written directly in the writer buffer. No temporary memory block is needed anymore, and moreover no Py_UCS conversion is need: _PyLong_Format() writes directly characters into the character format (PyUCS1, Py_UCS2 or Py_UCS4) of the buffer.

Performance compared to Python 2

The PEP 393 uses a complex storage for strings, so the exact performances now depends on the character set used in the benchmark. For benchmarks using a character set different than ASCII, the result are more tricky to understand.

To compare performances with Python 2, I focused my benchmarks on ASCII. I compared Python 3 str with Python 2 unicode, but also sometimes to Python 2 str (bytes). On ASCII, Python 3.3 was as fast as Python 2, or even faster on some very specific cases, but these cases are probably artificial and never seen in real applications.

In the best case, Python 3 str (Unicode) was faster than Python 2 bytes.

_PyBytesWriter API: first try, big fail

Since Python was much faster with _PyUnicodeWriter, I expected to get good speedup with a similar API for bytes. The graal would be to share code for bytes and Unicode (Spoiler alert! I reached this goal, but only for a single function: format an integer to decimal).

My first attempt of a _PyBytesWriter API was in 2013: Issue #17742: Add _PyBytesWriter API. But quickly, I noticed with microbenchmarks that my change made Python slower! I spent hours to understand why GCC produced less efficient machine code. When I started to dig the "strict aliasing" optimization issue, I realized that I reached a deadend.

Extract of the _PyBytesWriter structure:

typedef struct {
    /* Current position in the buffer */
    char *str;

    /* Start of the buffer */
    char *start;

    /* End of the buffer */
    char *end;

    ...
} _PyBytesWriter;

The problem is that GCC emited less efficient machine code for the C code (see my msg187595):

while (collstart++<collend)
    *writer.str++ = '?';

For the writer.str++ instruction, the new pointer value is written immediatly in the structure. The pointer value is read again at each iteration. So we have 1 LOAD and 1 STORE per iteration.

GCC emits better code for the original C code:

while (collstart++<collend)
    *str++ = '?';

The str variable is stored in a register and the new value of str is only written once, at the end of loop (instead of writing it at each iteration). The pointer value is only read once before the loop. So we have 0 LOAD and 0 STORE (related to the pointer value) in the loop body.

It looks like an aliasing issue, but I didn't find how to say to GCC that the new value of writer.str can be written only once at the end of the loop. I tried the __restrict__ keyword: the LOAD (get the pointer value) was moved out of the loop. But the STORE was still in the loop body.

I wrote to gcc-help: Missed optimization when using a structure, but I didn't get any reply. I just gave up.

_PyBytesWriter API: new try, the good one

In 2015, I created the Issue #25318: Add _PyBytesWriter API to optimize Unicode encoders. I redesigned the API to avoid the aliasing issue.

The new _PyBytesWriter doesn't contain the char* pointers anymore: they are now local variables in functions. Instead, functions of API requires two parameters: the bytes writer and a char* parameter. Example:

PyObject * _PyBytesWriter_Finish(_PyBytesWriter *writer, char *str)

The idea is to keep char* pointers in functions to keep the most efficient machine code in loops. The compiler doesn't have to compute complex aliasing rules to decide if a CPU register can be used or not.

_PyBytesWriter features:

Optional overallocation: overallocate the buffer by 25% on Windows and 50% on Linux. Same idea than _PyUnicodeWriter.
Support bytes and bytearray type as output format to avoid an expensive memory copy from bytes to bytearray.
Small buffer of 512 bytes allocated on the stack to avoid the need of a buffer allocated on the heap, before creating the final bytes/bytearray object.

A _PyBytesWriter structure must always be allocated on the stack (to get fast memory allocation of the smaller buffer).

While _PyUnicodeWriter has a 5 functions and 1 macro to write a single character, write strings, write a substring, etc. _PyBytesWriter has a single _PyBytesWriter_WriteBytes() function to write a string, since all other writes are done directly with regular C code on char* pointers.

The API itself doesn't make the code faster. Disabling overallocation on the last write and the usage of the small buffer allocated on the stack may be faster.

In Python 3.6, I optimized error handlers on various codecs: ASCII, Latin1 and UTF-8. For example, the UTF-8 encoder is now up to 75 times as fast for error handlers: ignore, replace, surrogateescape, surrogatepass. The bytes % int instruction became between 30% and 50% faster on a microbenchmark.

Later, I replaced char* type with void* to avoid compiler warnings in functions using Py_UCS1* or unsigned char*, unsigned types.

My contributions to CPython during 2015 Q4

2016-03-01T15:00:00+01:00

My contributions to CPython during 2015 Q4 (october, november, december):

hg log -r 'date("2015-10-01"):date("2015-12-31")' --no-merges -u Stinner

Statistics: 100 non-merge commits + 25 merge commits (total: 125 commits).

As usual, I pushed changes of various contributors and helped them to polish their change.

I fighted against a recursion error, a regression introduced by my recent work on the Python test suite.

I focused on optimizing the bytes type during this quarter. It started with the issue #24870 opened by INADA Naoki who works on PyMySQL: decoding bytes using the surrogateescape error handler was the bottleneck of this benchmark. For me, it was an opportunity for a new attempt to implement a fast "bytes writer API".

I pushed my first change related to FAT Python! Fix parser and AST: fill lineno and col_offset of "arg" node when compiling AST from Python objects.

Previous report: My contributions to CPython during 2015 Q3. Next report: My contributions to CPython during 2016 Q1.

Recursion error

The bug: issue #25274

During the previous quarter, I refactored Lib/test/regrtest.py huge file (1,600 lines) into a new Lib/test/libregrtest/ library (8 files). The problem is that test_sys started to crash with "Fatal Python error: Cannot recover from stack overflow" on test_recursionlimit_recovery(). The regression was introduced by a change on regrtest which indirectly added one more Python frame in the code executing test_sys.

CPython has a limit on the depth of a call stack: sys.getrecursionlimit(), 1000 by default. The limit is a weak protection against overflow of the C stack. Weak because it only counts Python frames, intermediate C functions may allocate a lot of memory on the stack.

When we reach the limit, an "overflow" flag is set, but we still allow up to limit+50 frames, because handling a RecursionError may need a few more frames. The overflow flag is cleared when the stack level goes below a "low-water mark".

After the regrtest change, test_recursionlimit_recovery() was called at stack level 36. Before, it was called at level 35. The test triggers a RecursionError. The problem is that we never goes again below the low-water mark, so the overflow flag is never cleared.

The fix

Another problem is that the function used to compute the "low-level mark" was not monotonic:

if limit > 100:
    low_water_mark = limit - 50
else:
    low_water_mark = 3 * limit // 4

The gap occurs near a limit of 100 frames:

limit = 99 => low_level_mark = 74
limit = 100 => low_level_mark = 75
limit = 101 => low_level_mark = 51

The formula was replaced with:

if limit > 200:
    low_water_mark = limit - 50
else:
    low_water_mark = 3 * limit // 4

The fix (change eb0c76442cee) modified the sys.setrecursionlimit() function to raise a RecursionError exception if the new limit is too low depending on the current stack depth.

Optimizations

As usual for performance, Serhiy Storchaka was very helpful on reviews, to run independant benchmarks, etc.

Optimizations on the bytes type, ASCII, Latin1 and UTF-8 codecs:

Issue #25318: Add _PyBytesWriter API. Add a new private API to optimize Unicode encoders. It uses a small buffer of 512 bytes allocated on the stack and supports configurable overallocation.
Use _PyBytesWriter API for UCS1 (ASCII and Latin1) and UTF-8 encoders. Enable overallocation for the UTF-8 encoder with error handlers.
unicode_encode_ucs1(): initialize collend to collstart+1 to not check the current character twice, we already know that it is not ASCII.
Issue #25267: The UTF-8 encoder is now up to 75 times as fast for error handlers: ignore, replace, surrogateescape, surrogatepass. Patch co-written with Serhiy Storchaka.
Issue #25301: The UTF-8 decoder is now up to 15 times as fast for error handlers: ignore, replace and surrogateescape.
Issue #25318: Optimize backslashreplace and xmlcharrefreplace error handlers in UTF-8 encoder. Optimize also backslashreplace error handler for ASCII and Latin1 encoders.
Issue #25349: Optimize bytes % args using the new private _PyBytesWriter API
Optimize error handlers of ASCII and Latin1 encoders when the replacement string is pure ASCII: use _PyBytesWriter_WriteBytes(), don't check individual character.
Issue #25349: Optimize bytes % int. Formatting is between 30% and 50% faster on a microbenchmark.
Issue #25357: Add an optional newline paramer to binascii.b2a_base64(). base64.b64encode() uses it to avoid a memory copy.
Issue #25353: Optimize unicode escape and raw unicode escape encoders: use the new _PyBytesWriter API.
Rewrite PyBytes_FromFormatV() using _PyBytesWriter API
Issue #25399: Optimize bytearray % args. Most formatting operations are now between 2.5 and 5 times faster.
Issue #25401: Optimize bytes.fromhex() and bytearray.fromhex(): they are now between 2x and 3.5x faster.

Changes

Issue #25003: On Solaris 11.3 or newer, os.urandom() now uses the getrandom() function instead of the getentropy() function. The getentropy() function is blocking to generate very good quality entropy, os.urandom() doesn't need such high-quality entropy.
Issue #22806: Add python -m test --list-tests command to list tests.
Issue #25670: Remove duplicate getattr() in ast.NodeTransformer
Issue #25557: Refactor _PyDict_LoadGlobal(). Don't fallback to PyDict_GetItemWithError() if the hash is unknown: compute the hash instead. Add also comments to explain the _PyDict_LoadGlobal() optimization.
Issue #25868: Try to make test_eintr.test_sigwaitinfo() more reliable especially on slow buildbots

Changes specific to Python 2.7

Closes #25742: locale.setlocale() now accepts a Unicode string for its second parameter.

Bugfixes

Fix regrtest --coverage on Windows
Fix pytime on OpenBSD
More fixes for test_eintr on FreeBSD
Close #25373: Fix regrtest --slow with interrupted test
Issue #25555: Fix parser and AST: fill lineno and col_offset of "arg" node when compiling AST from Python objects. First contribution related to FAT Python ;-)
Issue #25696: Fix installation of Python on UNIX with make -j9.

My contributions to CPython during 2015 Q3

2016-02-18T01:00:00+01:00

A few years ago, someone asked me: "Why do you contribute to CPython? Python is perfect, there are no more bugs, right?". The article list most of my contributions to CPython during 2015 Q3 (july, august, september). It gives an idea of which areas of Python are not perfect yet :-)

My contributions to CPython during 2015 Q3 (july, august, september):

hg log -r 'date("2015-07-01"):date("2015-09-30")' --no-merges -u Stinner

Statistics: 153 non-merge commits + 75 merge commits (total: 228 commits).

The major event in Python of this quarter was the release of Python 3.5.0.

As usual, I helped various contributors to refine their changes and I pushed their final changes.

Next report: My contributions to CPython during 2015 Q4.

FreeBSD kernel bug

It took me a while to polish the implementation of the PEP 475 (retry syscall on EINTR) especially its unit test test_eintr. The unit test is supposed to test Python, but as usual, it also tests indirectly the operating system.

I spent some days investigating a random hang on the FreeBSD buildbots: issue #25122. I quickly found the guilty test (test_eintr.test_open), but it took me a while to understand that it was a kernel bug in the FIFO driver. Hopefully at the end, I was able to reproduce the bug with a short C program in my FreeBSD VM. It is the best way to ask a fix upstream.

My FreeBSD bug report #203162 ("when close(fd) on a fifo fails with EINTR, the file descriptor is not really closed") was quickly fixed. The FreeBSD team is reactive!

I like free softwares because it's possible to investigate bugs deep in the code, and it's usually quick to get a fix.

Timestamp rounding issue

Even if the issue #23517 is well defined and simple to fix, it took me days (weeks?) to understand exactly how timestamps are supposed to be rounded and agree on the "right" rounding method. Alexander Belopolsky reminded me the important property:

(datetime(1970,1,1) + timedelta(seconds=t)) == datetime.utcfromtimestamp(t)

Tim Peters helped me to understand why Python rounds to nearest with ties going away from zero (ROUND_HALF_UP) in round(float) and other functions. At the first look, the rounding method doesn't look natural nor logical:

>>> round(0.5)
0
>>> round(1.5)
2

See my previous article on the _PyTime API for the long story of rounding methods between Python 3.2 and Python 3.6: History of the Python private C API _PyTime.

Enhancements

type_call() now detect C bugs in type __new__() and __init__() methods.
Issue #25220: Enhancements of the test runner: add more info when regrtest runs tests in parallel, fix some features of regrtest, add functional tests to test_regrtest.

Optimizations

Issue #25227: Optimize ASCII and latin1 encoders with the surrogateescape error handler: the encoders are now up to 3 times as fast.

Changes

Polish the implementation of the PEP 475 (retry syscall on EINTR)
Work on the "What's New in Python 3.5" document: add my changes (PEP 475, socket timeout, os.urandom)
Work on asyncio: fix ResourceWarning warnings, fixes specific to Windows
test_time: rewrite rounding tests of the private pytime API
Issue #24707: Remove an assertion in monotonic clock. Don't check anymore at runtime that the monotonic clock doesn't go backward. Yes, it happens! It occurs sometimes each month on a Debian buildbot slave running in a VM.
test_eintr: replace os.fork() with subprocess (fork+exec) to make the test more reliable

Changes specific to Python 2.7

Backport python-gdb.py changes: enhance py-bt command
Issue #23375: Fix test_py3kwarn for modules implemented in C

Bug fixes

Closes #23247: Fix a crash in the StreamWriter.reset() of CJK codecs
Issue #24732, #23834: Fix sock_accept_impl() on Windows. Regression of the PEP 475 (retry syscall on EINTR)
test_gdb: fix regex to parse the GDB version and fix ResourceWarning on error
Fix test_warnings: don't modify warnings.filters to fix random failures of the test.
Issue #24891: Fix a race condition at Python startup if the file descriptor of stdin (0), stdout (1) or stderr (2) is closed while Python is creating sys.stdin, sys.stdout and sys.stderr objects.
Issue #24684: socket.socket.getaddrinfo() now calls PyUnicode_AsEncodedString() instead of calling the encode() method of the host, to handle correctly custom string with an encode() method which doesn't return a byte string. The encoder of the IDNA codec is now called directly instead of calling the encode() method of the string.
Issue #25118: Fix a regression of Python 3.5.0 in os.waitpid() on Windows. Add an unit test on os.waitpid()
Issue #25122: Fix test_eintr, kill child process on error
Issue #25155: Add _PyTime_AsTimevalTime_t() function to fix a regression: support again years after 2038.
Issue #25150: Hide the private _Py_atomic_xxx symbols from the public Python.h header to fix a compilation error with OpenMP. PyThreadState_GET() becomes an alias to PyThreadState_Get() to avoid ABI incompatibilies.
Issue #25003: On Solaris 11.3 or newer, os.urandom() now uses the getrandom() function instead of the getentropy() function.

History of the Python private C API _PyTime

2016-02-17T22:00:00+01:00

I added functions to the private "pytime" library to convert timestamps from/to various formats. I expected to spend a few days, at the end I spent 3 years (2012-2015) on them!

Python 3.3

In 2012, I proposed the PEP 410 -- Use decimal.Decimal type for timestamps because storing timestamps as floating point numbers looses precision. The PEP was rejected because it modified many functions and had a bad API. At least, os.stat() got 3 new fields (atime_ns, mtime_ns, ctime_ns): timestamps as a number of nanoseconds (int).

My PEP 418 -- Add monotonic time, performance counter, and process time functions was accepted, Python 3.3 got a new time.monotonic() function (and a few others). Again, I spent much more time than I expected on a problem which looked simple at the first look.

With the issue #14180, I added functions to convert timestamps to the private "pytime" API to factorize the code of various modules. Timestamps were rounded towards +infinity (ROUND_CEILING), but it was not a deliberate choice.

Python 3.4

To fix correctly a performance issue in asyncio (issue20311), I added two rounding modes to the pytime API: _PyTime_ROUND_DOWN (round towards zero), and _PyTime_ROUND_UP (round away from zero). Polling for events (ex: using select.select()) with a non-zero timestamp must not call the underlying C level in non-blocking mode.

Python 3.5

When working on the issue #22117, I noticed that the implementation of rounding methods was buggy for negative timestamps. I replaced the _PyTime_ROUND_DOWN with _PyTime_ROUND_FLOOR (round towards minus infinity), and _PyTime_ROUND_UP with _PyTime_ROUND_CEILING (round towards infinity).

This issue also introduced a new private _PyTime_t type to support nanosecond resolution. The type is an opaque integer type to store timestamps. In practice, it's a signed 64-bit integer. Since it's an integer, it's easy and natural to compute the sum or differecence of two timestamps: t1 + t2 and t2 - t1. I added _PyTime_XXX() functions to create a timestamp and _PyTime_AsXXX() functions to convert a timestamp to a different format.

I had to keep three _PyTime_ObjectToXXX() functions for fromtimestamp() methods of the datetime module. These methods must support extreme timestamps (year 1..9999), whereas _PyTime_t is "limited" to a delta of +/- 292 years (year 1678..2262).

Python 3.6

In 2015, the issue #23517 reported that Python 2 and Python 3 don't use the same rounding method in datetime.datetime.fromtimestamp(): there was a difference of 1 microsecond.

After a long discussion, I modified fromtimestamp() methods of the datetime module to round to nearest with ties going away from zero (ROUND_HALF_UP), as done in Python 2.7, as round() in all Python versions.

Conclusion

It took me three years to stabilize the API and fix all issues. Well, I didn't spend all my days on it, but it shows that handling time is not a simple issue.

At the Python level, nothing changed, timestamps are still stored as float (except of the 3 new fieleds of os.stat()).

Python 3.5 only supports timezones with fixed offset, it does not support the locale timestamp for example. Timezones are still an hot topic: the datetime-sig mailing list was created to enhance timezone support in Python.

Status of the FAT Python project, January 12, 2016

2016-01-12T13:42:00+01:00

Previous status: Status of the FAT Python project, November 26, 2015.

Summary

New optimizations implemented:
- constant propagation
- constant folding
- dead code elimination
- simplify iterable
- replace builtin __debug__ variable with its value
Major API refactoring to make the API more generic and reusable by other projects, and maybe different use case.
Work on 3 different Python Enhancement Proposals (PEP): API for pluggable static optimizers and function specialization

The two previously known major bugs, "Wrong Line Numbers (and Tracebacks)" and "exec(code, dict)", are now fixed.

Python Enhancement Proposals (PEP)

I proposed an API for to support function specialization and static optimizers. I splitted changes in 3 different Python Enhancement Proposals (PEP):

PEP 509 - Add a private version to dict: "Add a new private version to builtin dict type, incremented at each change, to implement fast guards on namespaces."
PEP 510 - Specialize functions: "Add functions to the Python C API to specialize pure Python functions: add specialized codes with guards. It allows to implement static optimizers respecting the Python semantics."
PEP 511 - API for AST transformers: "Propose an API to support AST transformers."

The PEP 509 was sent to the python-ideas mailing list for a first round, and then to python-dev mailing list. The PEP 510 was sent to python-ideas to a first round. The last PEP was not published yet, I'm still working on it.

Major API refactor

The API has been deeply refactored to write the Python Enhancement Proposals.

First set of changes for function specialization (PEP 510):

astoptimizer now adds import fat to optimized code when specialization is used
Remove the function subtype: add directly the specialize() method to functions
Add support of any callable object to func.specialize(), not only code object (bytecode)
Create guard objects:
- fat.Guard
- fat.GuardArgType
- fat.GuardBuiltins
- fat.GuardDict
- fat.GuardFunc
Add functions to create guards:
- fat.GuardGlobals
- fat.GuardTypeDict
Move code.replace_consts() to fat.replace_consts()

Second set of changes for AST transformers (PEP 511):

Add sys.implementation.ast_transformers and sys.implementation.optim_tag
Rename sys.asthook to sys.ast_transformers
Add -X fat command line option to enable the FAT mode: register the astoptimizer in AST transformers
Replace -F command line option with -o OPTIM_TAG
Remove sys.flags.fat (Python flag) and Py_FatPython (C variable)
Rewrite how an AST transformer is registered
importlib skips .py if optim_tag is not 'opt' and required AST transformers are missing. Raise ImportError if the .pyc file is missing.

Third set of changes for dictionary versionning, updates after the first round of the PEP 509 on python-ideas:

Remove dict.__version__ read-only property: the version is now only accessible from the C API
Change the type of the C field ma_version from size_t to unsigned PY_INT64_T to also use 64-bit unsigned integer on 32-bit platforms. The risk of missing a change in a guard with a 32-bit version is too high, whereas the risk with a 64-bit version is very very low.

Fourth set of changes for function specialization, updates after the first round of the PEP 510 on python-ideas:

Remove func.specialize() and func.get_specialized() at the Python level, replace them with C functions. Expose them again as fat.specialize(func, ...) and fat.get_specialized(func)
fat.get_specialized() now returns a list of tuples, instead of a list of dict
Make fat.Guard type private: rename it to fat._Guard
Add fat.PyGuard: toy to implement a guard in pure Python
Guard C API: rename first_check to init and support reporting errors

Change log

Detailed changes of the FAT Python between November 24, 2015 and January 12, 2016.

End of november

Major change:

Add a __version__ read-only property to dict, remove the verdict subtype of dict. As a consequence, dictionary guards now hold a strong reference to the dict value

Minor changes:

Allocate dynamically memory for specialized code and guards, don't use fixed-size arrays anymore
astoptimizer: enhance scope detection
optimize astoptimizer: don't copy a whole AST tree anymore with copy.deepcopy(), only copy modified nodes.
Add Config.max_constant_size
Reenable checks on cell variables: allow cell variables if they are the same
Reenable optimizations on methods calling super(), but never copy super() builtin to constants. If super() is replaced with a string, the required free variable (reference to the current class) is not created by the compiler
Add PureBuiltin config
NodeVisitor now calls generic_visit() before visit_XXX()
Loop unrolling now also optimizes tuple iterators
At the end of Python initialization, create a copy of the builtins dictionary to be able later to detect if a builtin name was replaced.
Implement collections.UserDict.__version__

December (first half)

Major changes:

Implement 4 new optimizations:
- constant propagation
- constant folding
- replace builtin __debug__ variable with its value
- dead code elimination
Add support of per module configuration using an __astoptimizer__ variable
code.co_lnotab now supports negative line number delta. Change the type of line number delta in co_lnotab from unsigned 8-bit integer to signed 8-bit integer. This change fixes almost all issues about line numbers.

Minor changes:

Change .pyc magic number to 3600
Remove unused fat.specialized_method() function
Remove Lib/fat.py, rename Modules/_fat.c to Modules/fat.c: fat module is now only implemented in C
Fix more tests of the Python test suite
A builtin guard now adds a guard on globals. Ignore also the specialization if globals()[name] already exists.
Ignore duplicated guards
Implement namespace following the control flow for constant propagation
Config.max_int_bits becomes a simple integer
Fix bytecode compilation for tuple constants. Don't merge (0, 0) and (0.0, 0.0) constants, they are different.
Call more builtin functions
Optimize the optimizer: write a metaclass to discover visitors when the class is created, not when the class is instanciated

December (second half)

Major changes:

Implement "simplify iterable" optimization. The loop unrolling optimization now relies on it to replace range(n).
Split the function optimization in two stages: first apply optimizations which don't require specialization, then apply optimizations which require specialization.
Replace the builtin __fat__ variable with a new sys.flags.fat flag

Minor changes:

Extend optimizations to optimize more cases (more builtins, more loop unrolling, remove more dead code, etc.)
Add Config.logger attribute. astoptimize logs into sys.stderr when Python is started in verbose mode (python3 -v)
Move func.patch_constants() to code.replace_consts()
Enhance marshal to fix tests: call frozenset() to get the empty frozenset singleton
Don't remove code which must raise a SyntaxError. Don't remove code containing the continue instruction.
Restrict GlobalNonlocalVisitor to the current namespace
Emit logs when optimizations are skipped
Use some maths to avoid optimization pow() if result is an integer and will be larger than the configuration. For example, don't optimize 2 ** (2**100).

January

Major changes:

astoptimizer now produces a single builtin guard with all names, instead of a guard per name.
Major API refactoring detailed in a dedicated section above

Minor changes:

Start to write PEPs
Dictionary guards now expect a list of names, instead of a single name, to reduce the cost of guards.
GuardFunc now uses a strong reference to the function, instead of a weak reference to simplify the code
Initialize dictionary version to 0

Status of the FAT Python project, November 26, 2015

2015-11-26T17:30:00+01:00

Previous status: [python-dev] Second milestone of FAT Python (Nov 4, 2015).

Documentation

I combined the documentation of various optimizations projects into a single documentation: Faster CPython. My previous optimizations projects:

"old" astoptimizer (now replaced with a "new" astoptimizer included in the FAT Python)
registervm
read-only Python

The FAT Python project has its own page: FAT Python project.

Copy builtins to constants optimization

The LOAD_GLOBAL instruction is used to load a builtin function. The instruction requires two dictionary lookup: one in the global namespace (which almost always fail) and then in the builtin namespaces.

It's rare to replace builtins, so the idea here is to replace the dynamic LOAD_GLOBAL instruction with a static LOAD_CONST instruction which loads the function from a C array, a fast O(1) lookup.

It is not possible to inject a builtin function during the compilation. Python code objects are serialized by the marshal module which only support simple types like integers, strings and tuples, not functions. The trick is to modify the constants at runtime when the module is loaded. I added a new patch_constants() method to functions.

Example:

def log(message):
    print(message)

This function is specialized to:

def log(message):
    'LOAD_GLOBAL print'(message)
log.patch_constants({'LOAD_GLOBAL print': print})

The specialized bytecode uses two guards on builtin and global namespaces to disable the optimization if the builtin function is replaced.

See Copy builtin functions to constants for more information.

Loop unrolling optimization

A simple optimization is to "unroll" a loop to reduce the cost of loops. The optimization generates assignement statements (for the loop index variable) and duplicates the loop body.

Example with a range() iterator:

def func():
    for i in (1, 2, 3):
        print(i)

The function is specialized to:

def func():
    i = 1
    print(i)

    i = 2
    print(i)

    i = 3
    print(i)

If the iterator uses the builtin range function, two guards are required on builtin and global namespaces.

The optimization also handles tuple iterator. No guard is needed in this case (the code is always optimized).

See Loop unrolling for more information.

Lot of enhancements of the AST optimizer

New optimizations helped to find bugs in the AST optimizer. Many fixes and various enhancements were done in the AST optimizer.

The number of lines of code more than doubled: 500 to 1200 lines.

Optimization: copy.deepcopy() is no more used to duplicate a full tree. The new NodeTransformer class now only copies a single node, if at least one field is modified.

The VariableVisitor class which detects local and global variables was heavily modified. It understands much more kinds of AST node: For, AugAssign, AsyncFunctionDef, ClassDef, etc. It now also detects non-local variables (nonlocal keyword). The scope is now limited to the current function, it doesn't enter inside nested DictComp, FunctionDef, Lambda, etc. These nodes create a new separated namespace.

The optimizer is now able to optimize a function without guards: it's needed to unroll a loop using a tuple as iterator.

Known bugs

See the TODO.rst file for known bugs.

Wrong Line Numbers (and Tracebacks)

AST nodes have lineno and col_offset fields, so an AST optimizer is not "supposed" to break line numbers. In practice, line numbers, and so tracebacks, are completly wrong in FAT mode. The problem is probably that AST optimizer can copy and move instructions. Line numbers are no more motononic. CPython probably don't handle this case (negative line delta).

It should be possible to fix it, but right now I prefer to focus on new optimizations and fix other bugs.

exec(code, dict)

In FAT mode, some optimizations require guards on the global namespace. If exec() if called with a Python dict for globals, an exception is raised because func.specialize() requires a fat.verdict for globals.

It's not possible to convert implicitly the dict to a fat.verdict, because the dict is expected to be mutated, and the guards be will on fat.verdict not on the original dict.

I worked around the bug by creating manually a fat.verdict in FAT mode, instead of a dict.

This bug will go avoid if the versionning feature is moved directly into the builtin dict type (and the fat.verdict type is removed).

Port your Python 2 applications to Python 3 with sixer

2015-06-16T15:00:00+02:00

From 2to3 to 2to6

When Python 3.0 was released, the official statement was to port your application using 2to3 and drop Python 2 support. It didn't work because you had to port all libraries first. If a library drops Python 2 support, existing applications running on Python 2 cannot use this library anymore.

This chicken-and-egg issue was solved by the creation of the six module by Benjamin Peterson. Thank you so much Benjamin! Using the six module, it is possible to write a single code base working on Python 2 and Python 3.

2to3 was hacked to create the modernize and 2to6 projects to add Python 3 support without loosing Python 2 support. Problem solved!

Creation of the sixer tool

Problem solved? Well, not for my specific use case. I'm porting the huge OpenStack project to Python 3. modernize and 2to6 modify a lot of things at once, add unwanted changes (ex: add from __future__ import absolute_import at the top of each file), and don't respect the OpenStack coding style (especially the complex rules to sort and group Python imports).

I wrote the sixer project to generate patches for OpenStack. The problem is that OpenStack code changes very quickly, so it's common to have to fix conflicts the day after submiting a change. At the beginning, it took at least one week to get Python 3 changes merged, whereas many changes are merged every day, so being able to regenerate patches helped a lot.

I created the sixer tool using a list of regular expressions to replace a pattern with another. For example, it replaces dict.itervalues() with six.itervalues(dict). The code was very simple. The most difficult part was to respect the OpenStack coding style for Python imports.

sixer is a success since its creationg, it helped me to fix the all obvious Python 3 issues: replace unicode(x) with six.text_type(x), replace dict.itervalues() with six.itervalues(dict), etc. These changes are simple, but it's boring to have to modify manually many files. The OpenStack Nova project has almost 1500 Python files for example.

The development version of sixer supports the following operations:

all
basestring
dict0
dict_add
iteritems
iterkeys
itertools
itervalues
long
next
raise
six_moves
stringio
unicode
urllib
xrange

Creation of the Sixer Test Suite

Slowly, I added more and more patterns to sixer. The code became too complex to be able to check regressions manually, so I also started to write unit tests. Now each operation has at least one unit test. Some complex operations have four tests or more.

At the beginning, tests called directly the Python function. It is fast and convenient, but it failed to catch regressions on the command line program. So I added tests running sixer has a blackbox: pass an input file and check the output file. Then I added specific tests on the code parsing command line options.

The new "all" operation

At the beginning, I used sixer to generate a patch for a single pattern. For example, replace unicode() in a whole project.

Later, I started to use it differently: I fixed all Python 3 issues at once, but only in some selected files. I did that when we reached a minimum set of tests which pass on Python 3 to have a green py34 check on Jenkins. Then we ported tests one by one. It's better to write short patches, they are easier and faster to review. And the review process is the bottlebeck of the OpenStack development process.

To fix all Python 3 at once, I added an all operation which simply applies sequentially each operation. So sixer can now be used as modernize and 2to6 to fix all Python 3 issues at once in a whole project.

I also added the ability to pass filenames instead of having to pass a directory to modify all files in all subdirectories.

New urllib, six_moves and stringio operations

urllib

I tried to keep the sixer code simple. But some changes are boring to write, like replacing urllib imports six.moves.urllib imports. Python 2 has 3 modules (urllib, urllib2, urlparse), whereas Pytohn 3 uses a single urllib namespace with submodules (urllib.request, urllib.parse, urllib.error). Some Python 2 functions moved to one submodule, whereas others moved to another submodules. It required to know well the old and new layout.

After loosing many hours to write manually patches for urllib, I decided to add a urllib operation. In fact, it was so not long to implement it, compared to the time taken to write patches manually.

stringio

Handling StringIO is also a little bit tricky because String.StringIO and String.cStringIO don't have the same performance on Python 2. Producing patches without killing performances require to pick the right module or symbol from six: six.StringIO() or six.moves.cStringIO.StringIO for example.

six_moves

The generic six_moves operation replaces various Python 2 imports with imports from six.moves:

BaseHTTPServer
ConfigParser
Cookie
HTMLParser
Queue
SimpleHTTPServer
SimpleXMLRPCServer
__builtin__
cPickle
cookielib
htmlentitydefs
httplib
repr
xmlrpclib

KISS: emit warnings instead of complex implementation

As I wrote, I tried to keep sixer simple (KISS principle: Keep It Simple, Stupid). I'm also lazy, I didn't try to write a perfect tool. I don't want to spend hours on the sixer project.

When it was too tricky to make a decision or to implement a pattern, sixer emits "warnings" instead. For example, a warning is emitted on def next(self): to remind that a __next__ = next alias is probably needed on this class for Python 3.

Conclusion

The sixer tool is incomplete and generates invalid changes. For example, it replaces patterns in comments, docstrings and strings, whereas usually these changes don't make sense. But I'm happy because the tool helped me a lot for to port OpenStack, it saved me hours.

I hope that the tool will now be useful to others! Don't hesitate to give me feedback.