X Tutup
The Wayback Machine - https://web.archive.org/web/20220612141648/https://github.com/python/cpython/pull/92900
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-67022: Document bytes/str inconsistency in email.header.decode_header() and add .decode_header_to_string() as a sane alternative #92900

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dlenski
Copy link

@dlenski dlenski commented May 17, 2022

This function's possible return types have been surprising and error-prone
for the entirety of its Python 3.x history. It can return either:

typing.List[typing.Tuple[str, None]], of length exactly 1
or typing.List[typing.Tuple[bytes, typing.Optional[str]]]

This function can't be rewritten to be more consistent in a backwards-compatible way, because some users of this function depend on the existing return type(s).

This PR addresses the inconsistency as suggested by @JelleZijlstra in #67022 (comment):

  1. we should document the surprising return type at https://docs.python.org/3.10/library/email.header.html.

  2. [create] a new function with a sane return type.

The "sane", Pythonic way to handle the decoding of an email/MIME message header value is simply to convert the whole header to a str; the details of exactly which parts of that header were encoded in which charsets are not relevant to the users. Fortunately, the email.header module already contains a mechanism to do this, via the __str__ method of email.header.header, so we can simply create a wrapper function to guide users in the right direction.

Example of the old/inconsistent (decode_header) vs. new/sane (decode_header_to_string) functions:

>>> from email import decode_header, decode_header_to_string
>>>
>>> # Do most users care about this distinction in (sub)encodings? I think not.
>>> print(decode_header('hello =?utf-8?B?ZsOzbw==?= bar'))
[(b'hello ', None), (b'f\xc3\xb3o', 'utf-8'), (b' bar', None)]
>>> print(decode_header('=?iso-8859-1?q?hello_f=F3o_bar?='))
[(b'hello f\xf3o bar', 'iso-8859-1')]
>>>
>>> # Assuming not, this is a much saner interface
>>> print(decode_header_to_string('hello =?utf-8?B?ZsOzbw==?= bar'))
hello fóo bar
>>> print(decode_header_to_string('=?iso-8859-1?q?hello_f=F3o_bar?='))
hello fóo bar

(Closes #30548 and replaces it.)

@dlenski dlenski requested a review from as a code owner May 17, 2022
@cpython-cla-bot
Copy link

@cpython-cla-bot cpython-cla-bot bot commented May 17, 2022

All commit authors signed the Contributor License Agreement.
CLA signed

dlenski added 2 commits May 17, 2022
…de_header()

This function's possible return types have been surprising and error-prone
for the entirety of its Python 3.x history. It can return either:

1. `typing.List[typing.Tuple[bytes, typing.Optional[str]]]` of length >1
2. or `typing.List[typing.Tuple[str, None]]`, of length exactly 1

This means that any user of this function must be prepared to accept either
`bytes` or `str` for the first member of the 2-tuples it returns, which is a
very surprising behavior in Python 3.x, particularly given that the second
member of the tuple is supposed to represent the charset/encoding of the
first member.

This patch documents the behavior of this function, and adds test cases
to demonstrate it.

As discussed in bpo-22833, this cannot be changed in a backwards-compatible
way, and some users of this function depend precisely on the existing
behavior.
This function takes an email header, possibly with portions encoded
according to RFC2047, and converts it to a standard Python string.

It is intended to provide a sane, Pythonic replacement for
`email.header.decode_header()`, which has two major problems:

1. May return either bytes or str (bpo-22833/pythongh-67022), an
   inconsistent and error-prone interface
2. Exposes details of an email header value's encoding which
   most users will not care about or want to deal with. Many users
   likely just want to decode an email header value to a Python
   string.

It turns out that `email.header` already contained most of the code
necessary to do this, and providing `decode_header_to_string` as a
documented wrapper function points users in the right direction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants
X Tutup