X Tutup
The Wayback Machine - https://web.archive.org/web/20231117153832/https://github.com/python/cpython/issues/112203
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codecs.StreamRecoder implements read() but passes read1() to underlying buffer #112203

Open
ali1234 opened this issue Nov 17, 2023 · 0 comments
Open
Labels
type-bug An unexpected behavior, bug, or error

Comments

@ali1234
Copy link

ali1234 commented Nov 17, 2023

Bug report

Bug description:

StreamRecoder implements read() but does not implement read1() - instead, that will be passed through to the underlying buffer via a __getattr__ wrapper:

cpython/Lib/codecs.py

Lines 868 to 873 in 7c50800

def __getattr__(self, name,
getattr=getattr):
""" Inherit all other methods from the underlying stream.
"""
return getattr(self.stream, name)

This means if you call read() then you get the recoded data as expected. But if you call read1() then you get the original bytes instead.

This becomes a problem if you then wrap the recoder in a TextIOWrapper(). If you call read() on the resulting object, it will call read() on the recoder, but if you call read(1000) (ie with a size) then it calls read1() on the recoder instead. Meaning you can get a mixture of correctly and incorrectly decoded strings.

Found while attempting to use StreamRecoder to fix mojibake.

Example code:

import codecs
import io


def mojibake(s):
    # makes an illegal double-utf encoding, to simulate mojibake
    return s.encode('utf8').decode('latin1').encode('utf8')


def unmojibake(b):
    # decodes the illegal mojibake
    return b.decode('utf8').encode('latin1').decode('utf8')


def make_stream(b):
    i = io.BytesIO(b)
    utf8 = codecs.lookup("utf-8")
    latin1 = codecs.lookup("latin1")
    # set up recoder to decode as utf8 and then encode as latin1
    return codecs.StreamRecoder(
        i,
        latin1.encode,
        utf8.decode,     # won't be used, we do not write
        utf8.streamreader,
        latin1.streamwriter  # won't be used, we do not write
    )

s = 'Hello \u2013my\u2013 World' # 'Hello –my– World'
b = mojibake(s)
assert unmojibake(b) == s

read = make_stream(b).read()
read1 = make_stream(b).read1()

print('original', b)
print('read()  ', read)
print('read1() ', read1)

# expected result: read == read1
# actual result: read1 == b (the original bytes we put into the io)

CPython versions tested on:

3.11

Operating systems tested on:

Linux

@ali1234 ali1234 added the type-bug An unexpected behavior, bug, or error label Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant
X Tutup