Published: mar. 27 mars 2018
Since Python 3.0 was released in 2008, each time an user reported an encoding
issue, someone showed up and asked why Python does not "simply" always use UTF-8.
Well, it's not that easy.
UTF-8 is the best encoding in most cases, but it is
still not the best encoding in all cases, even in 2018. The locale encoding
remains the best default filesystem encoding for Python. I would say that the
locale encoding is the least bad filesystem encoding.
This article tells the story of my
PEP 540: Add a new UTF-8 Mode which adds an opt-in option to
"use UTF-8" everywhere". Moreover, the UTF-8 Mode is enabled by the POSIX
locale: Python 3.7 now uses UTF-8 for the POSIX locale. My
PEP 540 is complementary to Nick Coghlan's PEP 538.
When I started to write this article, I wrote something like: "Hey! I added a
new option to use UTF-8, enjoy!". Written like that, it seems like using UTF-8
was an obvious choice and that it was really easy to write such PEP. No.
Nothing was obvious, nothing was simple.
It took me one year to design and implement my PEP 540, and to get it accepted.
I wrote five articles before this one to show that the PEP 540 only came after
a long painful journey, starting with Python 3.0, to choose the best Python
encoding. My PEP rely on the all the great work done previously.
This article is the sixth and last in a series of articles telling the
history and rationale of the Python 3 Unicode model for the operating system:
Fallback to UTF-8 if getting the locale encoding fails?
May 2010, I reported
"Python3/POSIX: errors if file system encoding is None". I asked what should
be the default encoding when getting the locale encoding fails. I proposed
to fallback to UTF-8. I wrote:
UTF-8 is also an optimist choice: I bet that more and more operating
systems will move to UTF-8.
Ouch, that was a poor choice.
In Python we have a tradition to avoid
guessing, if possible. Since we cannot guarantee that the file system
will indeed use UTF-8, it would have been safer to use ASCII. Not sure why
this reasoning wasn't applied for the file system encoding.
In practice, Python already used UTF-8 when the filesystem encoding was set to
None. I pushed the commit b744ba1d
into the Python 3.2 development branch to make the default encoding (UTF-8)
more obvious. But before Python 3.2 was released, I removed the fallback with
my commit e474309b
initfsencoding(): get_codeset() failure is now a fatal error
Don't fallback to UTF-8 anymore to avoid mojibake. I never got any error
from his function.
The utf8 option proposed for Windows
bpo-27781: when Steve
Dower was working on changing the filesystem encoding to UTF-8, I was not sure that Windows should use UTF-8
by default. I was more in favor on making the backward incompatible change an
opt-in option. I wrote:
If you go in this direction, I would like to follow you for the UNIX/BSD
side to make the switch portable. I was thinking about "-X utf8" which
avoids to change the command line parser.
If we agree on a plan,
I would like to write it down as a PEP since I
expect a lot of complains and questions which I would prefer to only answer
once (see for example the length of your thread on python-ideas where
each people repeated the same things multiple times ;-))
I mean that
python3 should force
-X utf8 sys.getfilesystemencoding() to UTF-8 on UNIX/BSD, it would ignore the
current locale setting.
Since Steve chose to
change the default to UTF-8 on Windows, my
option idea was ignored in this issue. -X utf8
The utf8 option proposed for the POSIX locale
Jan Niklas Hasse opened bpo-28180 about Docker images,
"sys.getfilesystemencoding() should default to utf-8".
I proposed again my option:
I proposed to add
command line option for UNIX to force utf8
encoding. Would it work for you? -X utf8
Jan Niklas Hasse answered:
Unfortunately no, as this would mean I'll have to change all my python
invocations in my scripts and it wouldn't work for executable files with
Usually, when a new option is added to Python, we add a command line option
(-X utf8) but also an environment variable:
I propose PYTHONUTF8=1.
Use your favorite method to define the env var "system wide" in your docker
Note: Technically, I'm not sure that it's possible to support -E option
with PYTHONUTF8, since -E comes from the command line, and we first need to
decode command line arguments with an encoding to parse these options....
Chicken-and-egg issue ;-)
Nick Coghlan wrote his PEP 538 "Coercing the C locale to a UTF-8 based
locale" which has been approved in May 2017
and finally implemented in June 2017.
Again, my utf8 idea was ignored in this issue.
First version of my PEP 540: Add a new UTF-8 Mode
January 2017, as a follow-up of
bpo-27781 and bpo-28180, I wrote the PEP 540: Add a new UTF-8
Mode and I posted it to
python-ideas for comments.
Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system
data instead of the locale encoding. Add
command line option
and -X utf8 PYTHONUTF8 environment variable.
After ten hours after and a few messages, I
wrote a second version:
I modified my PEP:
the POSIX locale now enables the UTF-8 mode.
INADA Naoki wrote:
I want UTF-8 mode is
enabled by default (opt-out option) even if locale
is not POSIX, like PYTHONLEGACYWINDOWSFSENCODING.
Users depends on locale know what locale is and how to configure it. They
can understand difference between locale mode and UTF-8 mode and they can
opt-out UTF-8 mode.
But many people lives in "UTF-8 everywhere" world, and don't know about
Always ignoring the locale to
always use UTF-8 would be a backward
incompatible change. I wasn't brave enough to propose it, I only
wanted to propose an opt-in option, except of the specific case of the POSIX
Not only people had different opinons, but most people had strong opinions on
how to handle Unicode and were not ready for compromises.
Third version of my PEP 540
One week and 59 emails later, I
implemented my PEP 540 and I wrote a third version of my PEP:
I made multiple changes since the first version of my PEP:
UTF-8 Strict mode now only uses strict for inputs and outputs:
it keeps surrogateescape for operating system data. Read the "Use the
strict error handler for operating system data" alternative for the
rationale. The POSIX locale now enables the UTF-8 mode. See the "Don't modify
the encoding of the POSIX locale" alternative for the rationale.
Specify the priority between -X utf8, PYTHONUTF8, PYTHONIOENCODING, etc.
The PEP version 3 has a longer rationale with more example. (...)
The new thread also got 19 emails, total:
78 emails in one month. The same
month, Nick Coghlan's PEP 538 was also under discussion.
Silence during one year
Because of the tone of the python-ideas threads and because I didn't know how
to deal with Nick Coghlan's PEP 538,
I decided to do nothing during one
year (January to December 2017).
April 2017, Nick
INADA Naoki as the BDFL Delegate for his PEP 538 and my PEP 540. Guido
accepted to delegate.
May 2017, Naoki approved Nick's PEP 538, and Nick implemented it.
PEP 540 version 3 posted to python-dev
At the end of 2017, when I looked at my contributions in Python 3.7 in the
What’s New In Python 3.7
document, I didn't see any significant contribution. I wanted to propose
something. Moreover, the deadline for the Python 3.7 feature freeze (first beta
version) was getting close, end of January 2018: see the PEP 537: Python 3.7
December 2017, I decided to move to the next step:
I sent my PEP to the
python-dev mailing list.
Guido van Rossum
complained about the length of the PEP:
I've been discussing this PEP offline with Victor, but he suggested we
should discuss it in public instead.
I am very worried about this long and rambling PEP, and I propose that it
not be accepted without a major rewrite to focus on clarity of the
specification. The "Unicode just works" summary is more a wish than a
proper summary of the PEP.
So I guess PEP acceptance week is over. :-(
PEP rewritten from scratch
I was not fully convinced myself that my PEP was a good idea, I
wanted to get an official vote, to know if my idea should be implemented or
abandonned. I decided to rewrite my PEP from scratch:
I reduced the rationale to the strict minimum, to explain
key points of the
Locale encoding and UTF-8
Passthough undecodable bytes: surrogateescape
Strict UTF-8 for correctness
No change by default for best backward compatibility
Reading JPEG pictures with surrogateescape
December 2017, I sent the
shorter PEP version 4 to python-dev.
INADA Naoki, the BDFL-delegate,
spotted a design issue:
And I have one worrying point. With UTF-8 mode,
encoding/error handler is UTF-8/surrogateescape.
opening binary file without "b" option is very common mistake of
new developers. If default error handler is surrogateescape, they lose a
chance to notice their bug.
gave a concrete example:
With PEP 538 (C.UTF-8 locale),
open() uses UTF-8/strict, not
For example, this code raises
UnicodeDecodeError with PEP 538 if the
file is JPEG file.
with open(fn) as f:
While I'm not strongly convinced that
open() error handler must be
changed for surrogateescape, first I would like to make sure that
it's really a very bad idea before changing it :-)
Using a JPEG image, the example is obviously wrong.
But using surrogateescape on open() has been chosen to
read text files
which are mostly correctly encoded to UTF-8, except a few bytes.
I'm not sure how to explain the issue. The Mercurial wiki page has a good
example of this issue that they call the
Guido van Rossum finished to convinced me:
You will quickly get decoding errors, and that is
(Unless you use .) His worry is that the
surrogateescape error handler makes it so that you won't get decoding
errors, and then encoding='Latin-1' the failure mode is much harder to debug.
wrote a 5th version of my PEP:
I made the following two changes to the PEP 540:
open() error handler remains
"strict" Remove the "Strict UTF8 mode" which doesn't make much sense anymore
Last question on locale.getpreferredencoding()
INADA Naoki asked:
locale.getpreferredencoding() returns in UTF-8 mode too? 'UTF-8'
Oh, that's a good question! I
looked at the code and
agreed to return UTF-8:
I checked the stdlib, and I found many places where
locale.getpreferredencoding() is used to get the user preferred
open(): default encoding
cgi.FieldStorage: encode the query string
encoding._alias_mbcs(): check if the requested encoding is the ANSI
gettext.GNUTranslations: lgettext() and lngettext() methods
In the UTF-8 mode, I would expect that cgi, gettext and xml.etree all use
the UTF-8 encoding by default. So
return UTF-8 if the UTF-8 mode is enabled.
sent a 6th version of my PEP:
locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 Mode.
Moreover, I also wrote a new much better written "Relationship with the locale
coercion (PEP 538)" section replacing the "Annex: Differences between
PEP 538 and PEP 540" section. The new section was asked by many people who were
confused by the relationship between PEP 538 and PEP 540.
Finally, one year after the first PEP version, INADA Naoki
approved my PEP!
First incomplete implementation
I started to work on the implementation of my PEP 540 in March 2017. Once the
PEP has been approved, I asked INADA Naoki for a review.
He asked me to fix the
command line parsing to handle
properly the option: -X utf8
option is found, we can decode from -X utf8 char **argv
again. Since mbstowcs() doesn't guarantee round tripping, it is better
than re-encode wchar_t **argv.
Implementing properly the
option was tricky. Parsing the command line
was done on -X utf8 wchar_t* C strings (Unicode), which requires to decode the
char** argv C array of byte strings (bytes). Python starts by decoding byte
strings from the locale encoding. If the utf8 option is detected, argv byte
strings must be decoded again, but now from UTF-8. The problem was that the
code was not designed for that, and it required to refactor a lot of code in
main() and Py_Main() are very complex. With the PEP 432, Nick Coghlan, Eric
Snow and me are working on making this code better. See for example
For all these reasons,
I propose to merge this uncomplete PR and write a
different PR for the most complex part, re-encode wchar_t* command line
arguments, implement Py_UnixMain() or another even better option?
I wanted to get my code merged as soon as possible to make sure that it will
get into the first Python 3.7 beta, to get a longer testing period before
Python 3.7 final.
bpo-29240, I pushed my
PEP 540: Add a new UTF-8 Mode
command line option, -X utf8 PYTHONUTF8 environment variable
and a new sys.flags.utf8_mode flag.
locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8
mode. As a side effect, open() now uses the UTF-8 encoding by
default in this mode.
Split Py_Main() into subfunctions
November 2017, I created
split the big Py_Main() function into smaller subfunctions. My motivation
was to be able to properly implement my PEP 540.
It will take me
3 months of work and 45 commits to completely cleanup
Py_Main() and put almost all Python configuration options into the private
C _PyCoreConfig structure.
Parse again the command line when -X utf8 is used
bpo-32030, thanks to
the Py_Main() refactoring, I was able to finish the implementation of my
I pushed my
Py_Main() re-reads config if encoding changes
If the encoding change (C locale coerced or UTF-8 Mode changed),
Py_Main() now reads again the configuration with the new encoding.
If the encoding changed after reading the Python configuration, cleanup the
read again the configuration with the new encoding. The
key feature here allowed by the refactoring is to be able to cleanup properly
all the configuration.
UTF-8 Mode and the locale encoding
January 2018, while working on
bpo-31900 "localeconv() should decode numeric
fields from LC_NUMERIC encoding, not from LC_CTYPE encoding", I tested various
combinations of locales and encodings. I found bugs with the UTF-8 mode.
When the UTF-8 mode is enabled explicitly by
, the intent is to use
UTF-8 "everywhere". Right. But -X utf8 there are some places, where the current
locale encoding is really the correct encoding, like the time.strftime()
bpo-29240: I pushed a first fix,
Ignore UTF-8 Mode in the
time.strftime() must use the current LC_CTYPE encoding, not UTF-8
if the UTF-8 mode is enabled.
I tested more cases and found...
more bugs. More functions must really use the
current locale encoding, rather than UTF-8 if the UTF-8 Mode is enabled.
I pushed a second fix,
Fix locale encodings in UTF-8 Mode
locale.localeconv(), time.tzname, os.strerror() and
other functions to ignore the UTF-8 Mode: always use the current locale
The second fix documented the encoding used by the public C functions
Encoding, highest priority to lowest priority:
on macOS and Android; UTF-8
if the Python UTF-8 mode is enabled; UTF-8
ASCII if the LC_CTYPE locale is "C",
nl_langinfo(CODESET) returns the ASCII encoding (or an alias),
and mbstowcs() and wcstombs() functions uses the
encoding. ISO-8859-1 the current locale encoding.
The fix was complex to be written because I had to extend Py_DecodeLocale() and
Py_EncodeLocale() to support internally the
strict error handler. I also
extended to API to report an error message (called "reason") on failure.
Py_DecodeLocale() has the prototype:
Py_DecodeLocale(const char* arg, size_t *wlen)
whereas the new extended and more generic
_Py_DecodeLocaleEx() has a much
more complex prototype:
_Py_DecodeLocaleEx(const char* arg, wchar_t **wstr, size_t *wlen,
const char **reason,
int current_locale, int surrogateescape)
To decode, there are two main use cases:
(FILENAME) Use UTF-8 if the UTF-8 Mode is enabled, or the locale encoding
Py_DecodeLocale() documentation for the exact used
encoding, the truth is more complex. (LOCALE) Always use the current locale encoding
Py_DecodeLocale(), PyUnicode_DecodeFSDefaultAndSize(): use the
surrogateescape error handler
PyUnicode_DecodeLocale(): the error handler is passed as an argument and
must be strict or surrogateescape
readline module: internal decode() function etc.
Summary of PEP 540 history
Version 1: first version sent to python-ideas
Version 2: the POSIX locale now enables the UTF-8 mode
Version 3: the UTF-8 Strict mode now only uses the
strict error handler
for inputs and outputs Version 4: PEP rewritten from scratch to be shorter
Version 5: open() error handler remains
strict, and the "Strict UTF8
mode" has been removed Version 6: locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8
Abstract of the final approved PEP:
Add a new "UTF-8 Mode" to enhance Python's use of UTF-8. When UTF-8 Mode
is active, Python will:
encoding, irregardless of the locale currently set by
the current platform, and utf-8 change the
stdin and stdout error handlers to
This mode is off by default, but is automatically activated when using
the "POSIX" locale.
command line option and -X utf8 PYTHONUTF8 environment
variable to control UTF-8 Mode.
It's now time for a well deserved nap... until the next major Unicode issue in Python.
(I love tigers: my favorite animals!)