In my previous article, I wrote that os.listdir(str) ignored silently undecodable filenames in Python 3.0 and that lying on the real content of a directory looks like a very bad idea.
Martin v. Löwis found a very smart solution to this problem: the surrogateescape error handler.
This article is the second in a series of articles telling the history and rationale of the Python 3 Unicode model for the operating system:
First attempt to propose the solution
I'd like to propose yet another approach: make sure that conversion according to the file system encoding always succeeds. If an unconvertable byte is detected, map it into some private-use character. To reduce the chance of conflict with other people's private-use characters, we can use some of the plane 15 private-use characters, e.g. map byte 0xPQ to U+F30PQ (in two-byte Unicode mode, this would result in a surrogate pair).
This would make all file names accessible to all text processing (including glob and friends); UI display would typically either report an encoding error, or arrange for some replacement glyph to be shown.
There are certain variations of the approach possible, in case there is objection to a specific detail.
He amended this proposal:
James Knight points out that UTF-8b can be used to give unambiguous round-tripping of characters in a UTF-8 locale. So I would like to amend my previous proposal:
- for a non-UTF-8 encoding, use private-use characters for roundtripping
- if the locale's charset is UTF-8, use UTF-8b as the file system encoding.
But Martin's smart idea was lost in the middle of long discussion.
File names, environment variables, and command line arguments are defined as being character data in POSIX; the C APIs however allow passing arbitrary bytes - whether these conform to a certain encoding or not.
This PEP proposes a means of dealing with such irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string.
The surrogateescape encoding is based on Markus Kuhn's idea that he called UTF-8b. Undecodable bytes in range 0x80-0xff are mapped as Unicode surrogate characters: range U+DC80 - U+DCFF.
>>> b'nonascii\xff'.decode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0xff (...) >>> b'nonascii\xff'.decode('ascii', 'surrogateescape') 'nonascii\udcff' >>> 'nonascii\udcff'.encode('ascii', 'surrogateescape') b'nonascii\xff'
Using the surrogateescape error handler, decoding cannot fail. For example, os.listdir(str) no longer ignores silently undecodable filenames, since all filenames became decodable with any encoding. Moreover, encoding filenames with surrogateescape returns the original bytes unchanged.
The PEP was accepted by Guido van Rossum in less than one week!
May 2009, Martin v. Löwis opened the bpo-5915 to get a review on his implementation.
Two days later, after Benjamin Peterson and Antoine Pitrou reviews, Martin pushed the commit 011e8420:
Issue #5915: Implement PEP 383, Non-decodable Bytes in System Character Interfaces.
Five days later, Martin renamed his "utf8b" error handler to its final name surrogateescape, commit 43c57785:
Rename utf8b error handler to surrogateescape.
Python 3.1 will be the first release getting the surrogateescape error handler.
In Python 3.0, os.listdir(str) ignored silently undecodable filenames which was not ideal.
Martin v. Löwis proposed to apply Markus Kuhn's idea called UTF-8b in Python as a new surrogateescape error handler.
Martin's PEP was approved in less than one week and implemented a few days later.
Using the surrogateescape error handler, decoding cannot fail: os.listdir(str) no longer ignores silently undecodable filenames. Moreover, encoding filenames with surrogateescape returns the original bytes unchanged.
The surrogateescape error handler fixed a lot of old and very complex Unicode issues on Unix. It is still widely used in Python 3.6 to not annoy users with Unicode errors.