Python 3.7 UTF-8 Mode — Victor Stinner blog 3

Since Python 3.0 was released in 2008, each time an user reported an encoding issue, someone showed up and asked why Python does not "simply" always use UTF-8. Well, it's not that easy. UTF-8 is the best encoding in most cases, but it is still not the best encoding in all cases, even in 2018. The locale encoding remains the best default filesystem encoding for Python. I would say that the locale encoding is the least bad filesystem encoding.

This article tells the story of my PEP 540: Add a new UTF-8 Mode which adds an opt-in option to "use UTF-8" everywhere". Moreover, the UTF-8 Mode is enabled by the POSIX locale: Python 3.7 now uses UTF-8 for the POSIX locale. My PEP 540 is complementary to Nick Coghlan's PEP 538.

When I started to write this article, I wrote something like: "Hey! I added a new option to use UTF-8, enjoy!". Written like that, it seems like using UTF-8 was an obvious choice and that it was really easy to write such PEP. No. Nothing was obvious, nothing was simple.

It took me one year to design and implement my PEP 540, and to get it accepted. I wrote five articles before this one to show that the PEP 540 only came after a long painful journey, starting with Python 3.0, to choose the best Python encoding. My PEP rely on the all the great work done previously.

This article is the sixth and last in a series of articles telling the history and rationale of the Python 3 Unicode model for the operating system:

Fallback to UTF-8 if getting the locale encoding fails?

May 2010, I reported bpo-8610: "Python3/POSIX: errors if file system encoding is None". I asked what should be the default encoding when getting the locale encoding fails. I proposed to fallback to UTF-8. I wrote:

UTF-8 is also an optimist choice: I bet that more and more operating systems will move to UTF-8.

Marc-Andre commented:

Ouch, that was a poor choice. In Python we have a tradition to avoid guessing, if possible. Since we cannot guarantee that the file system will indeed use UTF-8, it would have been safer to use ASCII. Not sure why this reasoning wasn't applied for the file system encoding.

In practice, Python already used UTF-8 when the filesystem encoding was set to None. I pushed the commit b744ba1d into the Python 3.2 development branch to make the default encoding (UTF-8) more obvious. But before Python 3.2 was released, I removed the fallback with my commit e474309b (Oct 2010):

initfsencoding(): get_codeset() failure is now a fatal error

Don't fallback to UTF-8 anymore to avoid mojibake. I never got any error from his function.

The utf8 option proposed for Windows

August 2016, bpo-27781: when Steve Dower was working on changing the filesystem encoding to UTF-8, I was not sure that Windows should use UTF-8 by default. I was more in favor on making the backward incompatible change an opt-in option. I wrote:

If you go in this direction, I would like to follow you for the UNIX/BSD side to make the switch portable. I was thinking about "-X utf8" which avoids to change the command line parser.

If we agree on a plan, I would like to write it down as a PEP since I expect a lot of complains and questions which I would prefer to only answer once (see for example the length of your thread on python-ideas where each people repeated the same things multiple times ;-))

I added:

I mean that python3 -X utf8 should force sys.getfilesystemencoding() to UTF-8 on UNIX/BSD, it would ignore the current locale setting.

Since Steve chose to change the default to UTF-8 on Windows, my -X utf8 option idea was ignored in this issue.

The utf8 option proposed for the POSIX locale

September 2016: Jan Niklas Hasse opened bpo-28180 about Docker images, "sys.getfilesystemencoding() should default to utf-8".

I proposed again my option:

I proposed to add -X utf8 command line option for UNIX to force utf8 encoding. Would it work for you?

Jan Niklas Hasse answered:

Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with

December 2016, I added:

Usually, when a new option is added to Python, we add a command line option (-X utf8) but also an environment variable: I propose PYTHONUTF8=1.

Use your favorite method to define the env var "system wide" in your docker containers.

Note: Technically, I'm not sure that it's possible to support -E option with PYTHONUTF8, since -E comes from the command line, and we first need to decode command line arguments with an encoding to parse these options.... Chicken-and-egg issue ;-)

Nick Coghlan wrote his PEP 538 "Coercing the C locale to a UTF-8 based locale" which has been approved in May 2017 and finally implemented in June 2017.

Again, my utf8 idea was ignored in this issue.

First version of my PEP 540: Add a new UTF-8 Mode

January 2017, as a follow-up of bpo-27781 and bpo-28180, I wrote the PEP 540: Add a new UTF-8 Mode and I posted it to python-ideas for comments.

Abstract:

Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system data instead of the locale encoding. Add -X utf8 command line option and PYTHONUTF8 environment variable.

After ten hours after and a few messages, I wrote a second version:

I modified my PEP: the POSIX locale now enables the UTF-8 mode.

INADA Naoki wrote:

I want UTF-8 mode is enabled by default (opt-out option) even if locale is not POSIX, like PYTHONLEGACYWINDOWSFSENCODING.

Users depends on locale know what locale is and how to configure it. They can understand difference between locale mode and UTF-8 mode and they can opt-out UTF-8 mode.

But many people lives in "UTF-8 everywhere" world, and don't know about locale.

Always ignoring the locale to always use UTF-8 would be a backward incompatible change. I wasn't brave enough to propose it, I only wanted to propose an opt-in option, except of the specific case of the POSIX locale.

Not only people had different opinons, but most people had strong opinions on how to handle Unicode and were not ready for compromises.

Third version of my PEP 540

One week and 59 emails later, I implemented my PEP 540 and I wrote a third version of my PEP:

I made multiple changes since the first version of my PEP:

The UTF-8 Strict mode now only uses strict for inputs and outputs: it keeps surrogateescape for operating system data. Read the "Use the strict error handler for operating system data" alternative for the rationale.

The POSIX locale now enables the UTF-8 mode. See the "Don't modify the encoding of the POSIX locale" alternative for the rationale.

Specify the priority between -X utf8, PYTHONUTF8, PYTHONIOENCODING, etc.

The PEP version 3 has a longer rationale with more example. (...)

The new thread also got 19 emails, total: 78 emails in one month. The same month, Nick Coghlan's PEP 538 was also under discussion.

Silence during one year

Because of the tone of the python-ideas threads and because I didn't know how to deal with Nick Coghlan's PEP 538, I decided to do nothing during one year (January to December 2017).

April 2017, Nick proposed INADA Naoki as the BDFL Delegate for his PEP 538 and my PEP 540. Guido accepted to delegate.

May 2017, Naoki approved Nick's PEP 538, and Nick implemented it.

PEP 540 version 3 posted to python-dev

At the end of 2017, when I looked at my contributions in Python 3.7 in the What’s New In Python 3.7 document, I didn't see any significant contribution. I wanted to propose something. Moreover, the deadline for the Python 3.7 feature freeze (first beta version) was getting close, end of January 2018: see the PEP 537: Python 3.7 Release Schedule.

December 2017, I decided to move to the next step: I sent my PEP to the python-dev mailing list.

Guido van Rossum complained about the length of the PEP:

I've been discussing this PEP offline with Victor, but he suggested we should discuss it in public instead.

I am very worried about this long and rambling PEP, and I propose that it not be accepted without a major rewrite to focus on clarity of the specification. The "Unicode just works" summary is more a wish than a proper summary of the PEP.

(...)

So I guess PEP acceptance week is over. :-(

PEP rewritten from scratch

Even if I was not fully convinced myself that my PEP was a good idea, I wanted to get an official vote, to know if my idea should be implemented or abandonned. I decided to rewrite my PEP from scratch:

PEP version 3 (before rewrite): 1,017 lines
PEP version 4 (after rewrite): 263 lines (26% of the previous version)

I reduced the rationale to the strict minimum, to explain key points of the PEP:

Locale encoding and UTF-8
Passthough undecodable bytes: surrogateescape
Strict UTF-8 for correctness
No change by default for best backward compatibility

Reading JPEG pictures with surrogateescape

December 2017, I sent the shorter PEP version 4 to python-dev.

INADA Naoki, the BDFL-delegate, spotted a design issue:

And I have one worrying point. With UTF-8 mode, open()'s default encoding/error handler is UTF-8/surrogateescape.

(...)

And opening binary file without "b" option is very common mistake of new developers. If default error handler is surrogateescape, they lose a chance to notice their bug.

He gave a concrete example:

With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not UTF-8/surrogateescape.

For example, this code raises UnicodeDecodeError with PEP 538 if the file is JPEG file.
with open(fn) as f:
    f.read()

I replied:

While I'm not strongly convinced that open() error handler must be changed for surrogateescape, first I would like to make sure that it's really a very bad idea before changing it :-)

(...)

Using a JPEG image, the example is obviously wrong.

But using surrogateescape on open() has been chosen to read text files which are mostly correctly encoded to UTF-8, except a few bytes.

I'm not sure how to explain the issue. The Mercurial wiki page has a good example of this issue that they call the "Makefile problem".

Guido van Rossum finished to convinced me:

You will quickly get decoding errors, and that is INADA's point. (Unless you use encoding='Latin-1'.) His worry is that the surrogateescape error handler makes it so that you won't get decoding errors, and then the failure mode is much harder to debug.

I wrote a 5th version of my PEP:

I made the following two changes to the PEP 540:

open() error handler remains "strict"

Remove the "Strict UTF8 mode" which doesn't make much sense anymore

Last question on locale.getpreferredencoding()

December 2017, INADA Naoki asked:

Or locale.getpreferredencoding() returns 'UTF-8' in UTF-8 mode too?

Oh, that's a good question! I looked at the code and agreed to return UTF-8:

I checked the stdlib, and I found many places where locale.getpreferredencoding() is used to get the user preferred encoding:

builtin open(): default encoding

cgi.FieldStorage: encode the query string

encoding._alias_mbcs(): check if the requested encoding is the ANSI code page

gettext.GNUTranslations: lgettext() and lngettext() methods

xml.etree.ElementTree: ElementTree.write(encoding='unicode')

In the UTF-8 mode, I would expect that cgi, gettext and xml.etree all use the UTF-8 encoding by default. So locale.getpreferredencoding() should return UTF-8 if the UTF-8 mode is enabled.

I sent a 6th version of my PEP:

locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 Mode.

Moreover, I also wrote a new much better written "Relationship with the locale coercion (PEP 538)" section replacing the "Annex: Differences between PEP 538 and PEP 540" section. The new section was asked by many people who were confused by the relationship between PEP 538 and PEP 540.

Finally, one year after the first PEP version, INADA Naoki approved my PEP!

First incomplete implementation

I started to work on the implementation of my PEP 540 in March 2017. Once the PEP has been approved, I asked INADA Naoki for a review. He asked me to fix the command line parsing to handle properly the -X utf8 option:

And when -X utf8 option is found, we can decode from char **argv again. Since mbstowcs() doesn't guarantee round tripping, it is better than re-encode wchar_t **argv.

Implementing properly the -X utf8 option was tricky. Parsing the command line was done on wchar_t* C strings (Unicode), which requires to decode the char** argv C array of byte strings (bytes). Python starts by decoding byte strings from the locale encoding. If the utf8 option is detected, argv byte strings must be decoded again, but now from UTF-8. The problem was that the code was not designed for that, and it required to refactor a lot of code in Py_Main().

I replied:

main() and Py_Main() are very complex. With the PEP 432, Nick Coghlan, Eric Snow and me are working on making this code better. See for example bpo-32030.

(...)

For all these reasons, I propose to merge this uncomplete PR and write a different PR for the most complex part, re-encode wchar_t* command line arguments, implement Py_UnixMain() or another even better option?

I wanted to get my code merged as soon as possible to make sure that it will get into the first Python 3.7 beta, to get a longer testing period before Python 3.7 final.

December 2017, bpo-29240, I pushed my commit 91106cd9:

PEP 540: Add a new UTF-8 Mode

Add -X utf8 command line option, PYTHONUTF8 environment variable and a new sys.flags.utf8_mode flag.

locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 mode. As a side effect, open() now uses the UTF-8 encoding by default in this mode.

Split Py_Main() into subfunctions

November 2017, I created bpo-32030 to split the big Py_Main() function into smaller subfunctions. My motivation was to be able to properly implement my PEP 540.

It will take me 3 months of work and 45 commits to completely cleanup Py_Main() and put almost all Python configuration options into the private C _PyCoreConfig structure.

Parse again the command line when -X utf8 is used

December 2017, bpo-32030, thanks to the Py_Main() refactoring, I was able to finish the implementation of my PEP.

I pushed my commit 9454060e:

Py_Main() re-reads config if encoding changes

If the encoding change (C locale coerced or UTF-8 Mode changed), Py_Main() now reads again the configuration with the new encoding.

If the encoding changed after reading the Python configuration, cleanup the configuration and read again the configuration with the new encoding. The key feature here allowed by the refactoring is to be able to cleanup properly all the configuration.

UTF-8 Mode and the locale encoding

January 2018, while working on bpo-31900 "localeconv() should decode numeric fields from LC_NUMERIC encoding, not from LC_CTYPE encoding", I tested various combinations of locales and encodings. I found bugs with the UTF-8 mode.

When the UTF-8 mode is enabled explicitly by -X utf8, the intent is to use UTF-8 "everywhere". Right. But there are some places, where the current locale encoding is really the correct encoding, like the time.strftime() function.

bpo-29240: I pushed a first fix, commit cb3ae558:

Ignore UTF-8 Mode in the time module

time.strftime() must use the current LC_CTYPE encoding, not UTF-8 if the UTF-8 mode is enabled.

I tested more cases and found... more bugs. More functions must really use the current locale encoding, rather than UTF-8 if the UTF-8 Mode is enabled.

I pushed a second fix, commit 7ed7aead:

Fix locale encodings in UTF-8 Mode

Modify locale.localeconv(), time.tzname, os.strerror() and other functions to ignore the UTF-8 Mode: always use the current locale encoding.

The second fix documented the encoding used by the public C functions Py_DecodeLocale() and Py_EncodeLocale():

Encoding, highest priority to lowest priority:

UTF-8 on macOS and Android;

UTF-8 if the Python UTF-8 mode is enabled;

ASCII if the LC_CTYPE locale is "C", nl_langinfo(CODESET) returns the ASCII encoding (or an alias), and mbstowcs() and wcstombs() functions uses the ISO-8859-1 encoding.

the current locale encoding.

The fix was complex to be written because I had to extend Py_DecodeLocale() and Py_EncodeLocale() to support internally the strict error handler. I also extended to API to report an error message (called "reason") on failure.

For example, Py_DecodeLocale() has the prototype:

wchar_t*
Py_DecodeLocale(const char* arg, size_t *wlen)

whereas the new extended and more generic _Py_DecodeLocaleEx() has a much more complex prototype:

int
_Py_DecodeLocaleEx(const char* arg, wchar_t **wstr, size_t *wlen,
                   const char **reason,
                   int current_locale, int surrogateescape)

To decode, there are two main use cases:

(FILENAME) Use UTF-8 if the UTF-8 Mode is enabled, or the locale encoding otherwise. See Py_DecodeLocale() documentation for the exact used encoding, the truth is more complex.
(LOCALE) Always use the current locale encoding

(FILENAME) examples:

Py_DecodeLocale(), PyUnicode_DecodeFSDefaultAndSize(): use the surrogateescape error handler
os.fsdecode()
os.listdir()
os.environ
sys.argv
etc.

(LOCALE) examples:

PyUnicode_DecodeLocale(): the error handler is passed as an argument and must be strict or surrogateescape
time.strftime()
locale.localeconv()
time.tzname
os.strerror()
readline module: internal decode() function
etc.

Summary of PEP 540 history

Version 1: first version sent to python-ideas
Version 2: the POSIX locale now enables the UTF-8 mode
Version 3: the UTF-8 Strict mode now only uses the strict error handler for inputs and outputs
Version 4: PEP rewritten from scratch to be shorter
Version 5: open() error handler remains strict, and the "Strict UTF8 mode" has been removed
Version 6: locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 Mode.

Abstract of the final approved PEP:

Add a new "UTF-8 Mode" to enhance Python's use of UTF-8. When UTF-8 Mode is active, Python will:

use the utf-8 encoding, irregardless of the locale currently set by the current platform, and

change the stdin and stdout error handlers to surrogateescape.

This mode is off by default, but is automatically activated when using the "POSIX" locale.

Add the -X utf8 command line option and PYTHONUTF8 environment variable to control UTF-8 Mode.

Conclusion

It's now time for a well deserved nap... until the next major Unicode issue in Python.

(I love tigers: my favorite animals!)