You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.
StreamRecoder implements read() but does not implement read1() - instead, that will be passed through to the underlying buffer via a __getattr__ wrapper:
""" Inherit all other methods from the underlying stream.
"""
returngetattr(self.stream, name)
This means if you call read() then you get the recoded data as expected. But if you call read1() then you get the original bytes instead.
This becomes a problem if you then wrap the recoder in a TextIOWrapper(). If you call read() on the resulting object, it will call read() on the recoder, but if you call read(1000) (ie with a size) then it calls read1() on the recoder instead. Meaning you can get a mixture of correctly and incorrectly decoded strings.
Found while attempting to use StreamRecoder to fix mojibake.
Example code:
importcodecsimportiodefmojibake(s):
# makes an illegal double-utf encoding, to simulate mojibakereturns.encode('utf8').decode('latin1').encode('utf8')
defunmojibake(b):
# decodes the illegal mojibakereturnb.decode('utf8').encode('latin1').decode('utf8')
defmake_stream(b):
i=io.BytesIO(b)
utf8=codecs.lookup("utf-8")
latin1=codecs.lookup("latin1")
# set up recoder to decode as utf8 and then encode as latin1returncodecs.StreamRecoder(
i,
latin1.encode,
utf8.decode, # won't be used, we do not writeutf8.streamreader,
latin1.streamwriter# won't be used, we do not write
)
s='Hello \u2013my\u2013 World'# 'Hello –my– World'b=mojibake(s)
assertunmojibake(b) ==sread=make_stream(b).read()
read1=make_stream(b).read1()
print('original', b)
print('read() ', read)
print('read1() ', read1)
# expected result: read == read1# actual result: read1 == b (the original bytes we put into the io)
CPython versions tested on:
3.11
Operating systems tested on:
Linux
The text was updated successfully, but these errors were encountered:
Bug report
Bug description:
StreamRecoder implements
read()but does not implementread1()- instead, that will be passed through to the underlying buffer via a__getattr__wrapper:cpython/Lib/codecs.py
Lines 868 to 873 in 7c50800
This means if you call
read()then you get the recoded data as expected. But if you callread1()then you get the original bytes instead.This becomes a problem if you then wrap the recoder in a
TextIOWrapper(). If you callread()on the resulting object, it will callread()on the recoder, but if you callread(1000)(ie with a size) then it callsread1()on the recoder instead. Meaning you can get a mixture of correctly and incorrectly decoded strings.Found while attempting to use
StreamRecoderto fix mojibake.Example code:
CPython versions tested on:
3.11
Operating systems tested on:
Linux
The text was updated successfully, but these errors were encountered: