gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

dlenski · 2022-05-17T20:52:48Z

This function's possible return types have been surprising and error-prone
for the entirety of its Python 3.x history. It can return either:

typing.List[typing.Tuple[str, None]], of length exactly 1
or typing.List[typing.Tuple[bytes, typing.Optional[str]]]

This function can't be rewritten to be more consistent in a backwards-compatible way, because some users of this function depend on the existing return type(s).

This PR addresses the inconsistency as suggested by @JelleZijlstra in #67022 (comment):

we should document the surprising return type at https://docs.python.org/3.10/library/email.header.html.
[create] a new function with a sane return type.

The "sane", Pythonic way to handle the decoding of an email/MIME message header value is simply to convert the whole header to a str; the details of exactly which parts of that header were encoded in which charsets are not relevant to the users. Fortunately, the email.header module already contains a mechanism to do this, via the __str__ method of email.header.header, so we can simply create a wrapper function to guide users in the right direction.

Example of the old/inconsistent (decode_header) vs. new/sane (decode_header_to_string) functions:

>>> from email import decode_header, decode_header_to_string
>>>
>>> # Do most users care about this distinction in (sub)encodings? I think not.
>>> print(decode_header('hello =?utf-8?B?ZsOzbw==?= bar'))
[(b'hello ', None), (b'f\xc3\xb3o', 'utf-8'), (b' bar', None)]
>>> print(decode_header('=?iso-8859-1?q?hello_f=F3o_bar?='))
[(b'hello f\xf3o bar', 'iso-8859-1')]
>>>
>>> # Assuming not, this is a much saner interface
>>> print(decode_header_to_string('hello =?utf-8?B?ZsOzbw==?= bar'))
hello fóo bar
>>> print(decode_header_to_string('=?iso-8859-1?q?hello_f=F3o_bar?='))
hello fóo bar

(Closes #30548 and replaces it.)

cpython-cla-bot · 2022-05-17T20:52:50Z

All commit authors signed the Contributor License Agreement.

…de_header() This function's possible return types have been surprising and error-prone for the entirety of its Python 3.x history. It can return either: 1. `typing.List[typing.Tuple[bytes, typing.Optional[str]]]` of length >1 2. or `typing.List[typing.Tuple[str, None]]`, of length exactly 1 This means that any user of this function must be prepared to accept either `bytes` or `str` for the first member of the 2-tuples it returns, which is a very surprising behavior in Python 3.x, particularly given that the second member of the tuple is supposed to represent the charset/encoding of the first member. This patch documents the behavior of this function, and adds test cases to demonstrate it. As discussed in bpo-22833, this cannot be changed in a backwards-compatible way, and some users of this function depend precisely on the existing behavior.

This function takes an email header, possibly with portions encoded according to RFC2047, and converts it to a standard Python string. It is intended to provide a sane, Pythonic replacement for `email.header.decode_header()`, which has two major problems: 1. May return either bytes or str (bpo-22833/pythongh-67022), an inconsistent and error-prone interface 2. Exposes details of an email header value's encoding which most users will not care about or want to deal with. Many users likely just want to decode an email header value to a Python string. It turns out that `email.header` already contained most of the code necessary to do this, and providing `decode_header_to_string` as a documented wrapper function points users in the right direction.

dlenski requested a review from as a code owner May 17, 2022

bedevere-bot added the awaiting review label May 17, 2022

dlenski force-pushed the gh60722 branch from 6b35dc0 to 712d83d Compare May 17, 2022

dlenski mentioned this pull request May 17, 2022

bpo-22833: Fix bytes/str inconsistency in email.header.decode_header() #30548

Closed

dlenski added 2 commits May 17, 2022

dlenski force-pushed the gh60722 branch from 712d83d to 9a8a34c Compare May 17, 2022

May	JUN	Jul
	12
2021	2022	2023

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

dlenski commented May 17, 2022

cpython-cla-bot bot commented May 17, 2022 •

edited

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

Are you sure you want to change the base?

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

Conversation

dlenski commented May 17, 2022

cpython-cla-bot bot commented May 17, 2022 • edited

cpython-cla-bot bot commented May 17, 2022 •

edited