Add PyUnicodeWriter C API

In May, I designed a new C API to build a Python str object: the PyUnicodeWriter API. Many people were involved in the design and the discussion was quite long. The C API Working Group helped to design a better and more convenient API. It took me basically a whole month to get the design done and fully implement the API.

Painting: La Danse by Matisse (1910).

Initial API

Building a Python str object in C is not easy. I wrote the private _PyUnicodeWriter C API 9 years ago (see my previous article), but it's not usable outside Python since it's a private API. So I proposed to make it public.

On May 19, I create an issue and a pull request to discuss the API. The initial API was:

typedef struct PyUnicodeWriter PyUnicodeWriter;

PyAPI_FUNC(PyUnicodeWriter*) PyUnicodeWriter_Create(void);
PyAPI_FUNC(void) PyUnicodeWriter_Free(PyUnicodeWriter *writer);
PyAPI_FUNC(PyObject*) PyUnicodeWriter_Finish(PyUnicodeWriter *writer);
PyAPI_FUNC(void) PyUnicodeWriter_SetOverallocate(
    PyUnicodeWriter *writer,
    int overallocate);

PyAPI_FUNC(int) PyUnicodeWriter_WriteChar(
    PyUnicodeWriter *writer,
    Py_UCS4 ch);
PyAPI_FUNC(int) PyUnicodeWriter_WriteStr(
    PyUnicodeWriter *writer,
    PyObject *str);
PyAPI_FUNC(int) PyUnicodeWriter_WriteSubstring(
    PyUnicodeWriter *writer,
    PyObject *str,
    Py_ssize_t start,
    Py_ssize_t stop);
PyAPI_FUNC(int) PyUnicodeWriter_WriteASCIIString(
    PyUnicodeWriter *writer,
    const char *ascii,
    Py_ssize_t len);

API changes

PyUnicodeWriter_WriteUTF8()

My first implementation made the assumption that the caller would only pass ASCII characters to PyUnicodeWriter_WriteASCIIString() which is a bold assumption. It would crash if non-ASCII characters would be passed by mistake. UTF-8 is more common and Python has a fast UTF-8 decoder. The first change was to replace PyUnicodeWriter_WriteASCIIString() with PyUnicodeWriter_WriteUTF8().

PyUnicodeWriter_WriteStr()

I really wanted PyUnicodeWriter_WriteStr() to only accept a Python str object. Others insisted to accept any Python object and write str(obj) instead. I changed PyUnicodeWriter_WriteStr() to implement that.

PyUnicodeWriter_WriteRepr()

Since str(obj) was there, repr(obj) becomes the next question: should we added it? It was decided to add PyUnicodeWriter_WriteRepr(obj) to write repr(obj). It's convenient to use.

PyUnicodeWriter_Format()

While discussing, it was proposed to add many functions to write various formats. I proposed to add PyUnicodeWriter_FromFormat(format, ...) similiar to PyUnicode_FromFormat(). It was decided to add it under the name: PyUnicodeWriter_Format(). Its implementation is efficient since multiple formats write directly into the writer, without having to create a temporary string object.

PyUnicodeWriter_Create()

The initial version of PyUnicodeWriter_Create() had no argument. It was asked to add a size parameter to preallocate the internal buffer: PyUnicodeWriter_Create(size).

Remove PyUnicodeWriter_SetOverallocate()

I tried to justify that calling PyUnicodeWriter_SetOverallocate(0) before the last write was a killer feature for performance, but it looked too complicated to others and it was decided to simply remove this API.

C API Working Group discussion

On May 24, once most of the API was stable, I created a decision issue for the API to the C API Working Group.

On June 7, the API was approved by a majority vote.

On June 10, Marc-Andre Lemburg reopened the issue since he had concerns about the incomplete UTF-8 Decoder API and the fact that the functions were not atomic: on error, the behavior was undefined.

I modified my implementation to make all functions atomic: either the whole string is written, or nothing is written (restore the writer to its previous state).

I also proposed to extend the PyUnicodeWriter API once we agreed on an minimum API.

On June 17, issue was closed again and I merged my implementation.

Extensions

PyUnicodeWriter_WriteWideChar()

I added a function to write wide strings (wchar_t*) which are common on Windows.

PyUnicodeWriter_DecodeUTF8Stateful()

I added a stateful UTF-8 decoder as an answer to Marc-Andre's request. API:

int PyUnicodeWriter_DecodeUTF8Stateful(
    PyUnicodeWriter *writer,
    const char *string,
    Py_ssize_t length,
    const char *errors,
    Py_ssize_t *consumed);

PyUnicodeWriter_WriteUCS4()

While less common, UCS-4 strings are convenient to manipulate Unicode code points. I added an API to support natively this string format.

Documentation

Read the PyUnicodeWriter API documentation.

Example of contextvar_tp_repr()

Simplified code:

static PyObject *
contextvar_tp_repr(PyContextVar *self)
{
    // "<ContextVar name='a' at 0x1234567812345678>"
    Py_ssize_t estimate = 43;
    PyUnicodeWriter *writer = PyUnicodeWriter_Create(estimate);
    if (writer == NULL) {
        return NULL;
    }

    if (PyUnicodeWriter_WriteUTF8(writer, "<ContextVar name=", 17) < 0) {
        goto error;
    }
    if (PyUnicodeWriter_WriteRepr(writer, self->var_name) < 0) {
        goto error;
    }
    if (PyUnicodeWriter_Format(writer, " at %p>", self) < 0) {
        goto error;
    }
    return PyUnicodeWriter_Finish(writer);

error:
    PyUnicodeWriter_Discard(writer);
    return NULL;
}

Conclusion

Thanks for great discussions, the final PyUnicodeWriter API is better, more convenient, less error-prone, and maybe even a little bit more efficient!

Thanks to everyone who was involved in these discussions!

Victor Stinner blog 3

Victor Stinner