X Tutup
The Wayback Machine - https://web.archive.org/web/20230302181750/https://github.com/python/cpython/issues/93518
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment of single byte to a bytearray[index] broken #93518

Closed
robomartin opened this issue Jun 5, 2022 · 22 comments
Closed

Assignment of single byte to a bytearray[index] broken #93518

robomartin opened this issue Jun 5, 2022 · 22 comments

Comments

@robomartin
Copy link

robomartin commented Jun 5, 2022

Bug report

>>> a = bytearray([1,2,3,4,5])
>>> b = b'\x10'

>>> a[2] = b
Traceback (most recent call last):
  File "C:\Python\Python39\lib\code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
TypeError: 'bytes' object cannot be interpreted as an integer

>>> a[2:3] = b    # Trick to make this work; had to use it to get around the problem
>>> a
bytearray(b'\x01\x02\x10\x04\x05')

>>> a[2] = ord(b)    # This also works, but it should not be required b'\x10' is already a single byte

Environment

Python 3.9 on Windows 10
Python 3.8 on WSL
MicroPython 1.18

Notes

The documentation for both bytes and bytearrays clearly indicates that they are both sequences of bytes in the range of 0 to 255, inclusive.

https://docs.python.org/3/library/functions.html#func-bytearray

A byte is the most fundamental form of integer (if we ignore nibbles, which nobody uses these days). At the very least the error message is wrong, an integer in the range from 0 to 255 can, most definitely, be interpreted as an integer...because it is one.

@robomartin robomartin added the type-bug An unexpected behavior, bug, or error label Jun 5, 2022
@robomartin robomartin changed the title Assignment single byte to a bytearray[index] broken Assignment of single byte to a bytearray[index] broken Jun 5, 2022
@ericvsmith
Copy link
Member

I think there's an implicit "by Python" at the end of the error message, but it could probably be improved.

I'm not sure we can change the behavior of bytes being incompatible with integers. I think you're asking for a special case of bytes of length 1 being compatible with integers. But I think we'd prefer the "bytes assigned to integers" case to always be an error, and not have it dependent on the length of the bytes object.

@robomartin
Copy link
Author

Not sure I understand given the documentation:

https://docs.python.org/3/library/functions.html#func-bytearray

Quoting:

"class bytearray([source[, encoding[, errors]]])
Return a new array of bytes. The bytearray class is a mutable sequence of integers in the range 0 <= x < 256."

"class bytes([source[, encoding[, errors]]])
Return a new “bytes” object which is an immutable sequence of integers in the range 0 <= x < 256. bytes is an immutable version of bytearray – it has the same non-mutating methods and the same indexing and slicing behavior."

I don't think there is (or should be) a distinction between bytearray and bytes other than with respect to their mutability. Length should not make a difference in interpretation.

I'll come back and add a few more examples of what I think is inconsistent behavior later. I don't have the time to do it right now and want to make sure my examples are clear.

Thanks for your input.

@ericvsmith
Copy link
Member

If you think:

>>> a = bytearray([1,2,3,4,5])
>>> b = b'\x10'
>>> a[2] = b

should work, what would:

>>> b = b'\x10\x10'
>>> a[2] = b

do?

@ericvsmith
Copy link
Member

In your original example,

>>> a[2] = b[0]

might be a better way to solve your problem.

@rhettinger
Copy link
Contributor

I can see why this violates your expections, but isn't a bug. It is working as designed and tested. For better or for worse, the design concept is that bytearrays hold numbers in the range 0 <= x < 256. Numbers go in and numbers come out and the numbers are not equivalent to corresponding bytes object

>>> ba = bytearray([10, 20, 30, 40, 50])
>>> ba[0]            # Numbers come out
10
>>> b'\n' == 10  # A byte is not equivalent to a number
False
>>> ba[0] = 11   # Numbers go in
>>> list(ba)
[11, 20, 30, 40, 50]

It may help to think of bytes and bytearrays as containers. For a list, we would not expect the assignment of a number to be the same as the assignment of a another list:

>>> s = [10, 20, 30, 40, 50]
>>> s[0]              # Number comes out
10
>>> s[0] = 11              # Number goes in
>>> s
[11, 20, 30, 40, 50]
>>> s[0] = [11]    # This is NOT the same!
>>> s
[[11], 20, 30, 40, 50]

@ericvsmith
Copy link
Member

Since this is working as intended, I’m going to close this.

@robomartin
Copy link
Author

If you think:

>>> a = bytearray([1,2,3,4,5])
>>> b = b'\x10'
>>> a[2] = b

should work, what would:

>>> b = b'\x10\x10'
>>> a[2] = b

do?

That's a length error.

@ericvsmith
Copy link
Member

If you think:

>>> a = bytearray([1,2,3,4,5])
>>> b = b'\x10'
>>> a[2] = b

should work, what would:

>>> b = b'\x10\x10'
>>> a[2] = b

do?

That's a length error.

And that's exactly the kind of data-dependent error I think we don't want to have.

Sort of like python 2.7 would give UnicodeDecodeErrors only if non-ascii data was present for some operations.

@robomartin
Copy link
Author

That's a length error.

And that's exactly the kind of data-dependent error I think we don't want to have.

I don't understand what you mean by this. There is no data dependent error here. It's bug. You are trying to take a value that is a number in memory and store it in another memory location that contains exactly the same type and size of numbers. There is no conversion needed at all. The error is a bug.

This bug report was closed prematurely.

In case I am not making myself clear, here's some code anyone can run:

from ctypes import string_at
from sys import getsizeof


# data elements limited to [0, 255] range
def test_b_to_ba_assignment(data: list):
    b = bytes([0xFF - n for n in data])
    i = 0

    print(f"ba = {bytearray(data)}")
    print(f"b = {b}")
    print()

    for j in range(1+len(data)):
        operation = f"ba[{i}:{j}] = b[{i}:{j}]"
        print(operation)
        ba = bytearray(data)
        print(f"ba before: {ba}")
        exec(operation)
        print(f"ba after : {ba} - No problem")
        print()


def bytearray_vs_bytes_test():
    print("Bytes and bytearrays are exactly the same thing, except that bytes are not mutable:")
    print("They are not characters. They are bytes. Numbers.")
    print(ba)
    print(b)
    ba = bytearray([1, 2, 3, 4])
    b = b'\x01\x02\x03\x04'
    # Print bytearray details
    print("ba:", end=" ")
    for n in ba:
        print(f"{n}", end=" ")
    print("  types:", end=" ")
    for n in ba:
        print(f"{type(n)}", end=" ")

    # Print bytes details
    print("\nb: ", end=" ")
    for n in b:
        print(f"{n}", end=" ")
    print("  types:", end=" ")
    for n in b:
        print(f"{type(n)}", end=" ")

    # Internal representation
    print("\n\nThe data structure for a bytes type shows what we are working with.")
    print("b = b'\\x01\\x02\\x03\\x04'")
    print(f"{string_at(id(b), getsizeof(b)).hex()}")
    print("                                 ^                              ^^^^^^^^")
    print("                                 length                          data")
    print("A bytearray structure is a bit more complex to unpack, so I'll leave it at that.\n")

    print("The error message I reported was:")
    print("    TypeError: 'bytes' object cannot be interpreted as an integer")

    print("\nNot sure how the above cannot be interpreted as an integer...")

    print("\nDoes it change if it is a single element bytes object?")

    c = b'\x01'
    print("c = b'\\x01'")
    print(f"{string_at(id(c), getsizeof(c)).hex()}")
    print("                                 ^                              ^^")
    print("                                 length                         data")
    print("Nope.  Same thing.  It's an integer.  In memory.")

    print("\nSo...what are the issues with assignment then...?")
    test_b_to_ba_assignment([1,2,3,4])

    print("In all of the above, assigning bytes into a bytearray does not pose any challenges.")
    print("They are byte, after all, numbers, and the same as in a bytearray.")
    print("Therefore, assignement is straightforward and requires no conversion at all.")
    print("\nThis should work:")
    print("ba = bytearray(b'\\x01\\x02\\x03\\x04')")
    print("b = b'\\xFF'")
    print("ba[2] = b")

    print("\n...it does not")
    try:
        ba[2] = b
    except TypeError as error:
        print(f"TypeError: {error}")

    print("\nWhich makes no sense at all considering the above.")

bytearray_vs_bytes_test()

If you look through the CPython source code that defines both bytes and bytearrays it says so right there. There's a comment, I believe in the bytearrays.h file, that pretty much says, paraphrasing, "these are not characters, they are bytes, numbers".

In fact, the code for bytearrays actually imports bytes. I didn't have the time to take a deep dive into the codebase (I won't for at least a couple of months, until I finish a project) but looked around enough to be reasonably convinced that this error should not be.

Once again, this isn't about my feelings on how bytes and bytearrays should work. I want to re-center the discussion because it is all too easy to go off into the weeds.

The error was:

TypeError: 'bytes' object cannot be interpreted as an integer

This is false, bytes ARE integers. They require no interpretation. This is particularly true in the context of the definition of both bytes and bytearrays, which clearly states they store 8 bit integers. In other words, the values in memory are not open to interpretation at all. All you have to do is fetch the byte from one memory location and store it in another.

The error is wrong. Should this assignment fail for some other reason? I don't think so. Yet that's a different matter. Whatever the case may be, this error is wrong and it does nothing but confuse anyone who doesn't have low level coding (assembler, C, Forth, closer to the hardware) experience when coding Python and using bytes and bytearrays. With the exception of mutability, these things are the same. The data structure overhead (header, reference count, pointers, etc.) is different because of one being mutable. The payload, however, the data, is exactly the same. No interpretation required.

b = bytes([n for n in range(256)])
print("                                   length                       sequence of 256 bytes starts here")
print("                                   |                            |")
print(f"{string_at(id(b), getsizeof(b)).hex()}")

Look at all those bytes in memory.

@ericvsmith
Copy link
Member

Unfortunately your example is too long for me to read through.

The error was:

TypeError: 'bytes' object cannot be interpreted as an integer

This is false, bytes ARE integers.

Would you understand it better if the error was "Containers that contain multiple 8-bit values cannot be interpreted as a single value in the range 0-255"? Because that's what the error is complaining about. As Raymond explained, the python "bytes" and "bytearray" objects are containers that contain multiple 8-bit values. They are not a single 8-bit value.

@ericvsmith
Copy link
Member

"Containers that contain multiple 8-bit values cannot be interpreted as a single value in the range 0-255"

Or perhaps better would be "Containers that contain zero, one, or many 8-bit values cannot be interpreted as a single value in the range 0-255".

Note that I'm not actually proposing that we actually change the error message to this, I'm just trying to break down what it means. And for this explanation I'm specifically trying to avoid the word "bytes", because it's meaning is overloaded here: "bytes" refers to both the Python object which is a container, and the 8-bit values that are actually stored in memory.

@robomartin
Copy link
Author

Unfortunately your example is too long for me to read through.

It's code you copy and run (as I say just prior to the code). One click, paste into IDE, Run. It should make it clearer.

I read what you are saying, but I think both you and Raymond are not talking about the same thing I am.

I understand data structures very well. I understand how these data types are constructed and stored in memory, down to the byte. I have been programming for 40 years in probably a dozen languages. So, yes, I get the internals. And, while I don't have the time to take a dive to depth in the CPython codebase, I took a cursory look in order to confirm that my assumptions about what ends-up in memory were correct. They are.

It's pretty hard to have a conversation if you don't read what I am writing because of length. I am trying to explain what I am thinking in as much detail as necessary to communicate the issue. Sadly, sometimes one can't reduce things down to one or two sentences.

Please run the code. Click, paste, run.

@robomartin
Copy link
Author

Or perhaps better would be "Containers that contain zero, one, or many 8-bit values cannot be interpreted as a single value in the range 0-255".

Except that is precisely what is happening in every single assignment example I give in my code. In fact, I will go further in saying that there is no interpretation at all. The data structures store bytes (not Python bytes, real bytes). It says so in the source code where each type is defined. And, if you memory dump the data structure (as I have shown in my code) you can read the bytes right there, no interpretation required.

This is why I think you are might be talking about something else. I am talking about the fact that a byte --real byte-- in memory needs to be stored at another location in memory that already contains a byte --real byte-- and Python is throwing an error saying that it cannot interpret a real byte to complete the operation. This is wrong.

@ericvsmith
Copy link
Member

In C, you're essentially saying:

char ch = "1";

or

char ch = "12";

or

char ch = "";

You're assigning into a location that expects a single byte, but you're passing in string (which I'm interpreting as "container of zero or more bytes" for this example).

Are there bytes inside the string? Yes. Does C understand what you're doing and allow it to happen? No.

I'm sorry I've failed to convey what's happening here in a way you can understand it. Maybe you'll get a better explanation if you ask on StackOverflow.

@robomartin
Copy link
Author

robomartin commented Jun 6, 2022

That is not at all what I am saying. Not even close.

Here, I'll try one more time.

Here's a simplified version of the memory footprint of a Python bytes object defined as:

b = b'\x01'

<header><length><...><01><end>

I am using pairs of digits to list an 8 bit hex number. Trying to keep this visually light. So, the "01" you see above would, in Python notation, be written as "0x01"

In the case of a bytearray:

ba = bytearray([5,6,7,8])  

The structure, again, keeping it simple:

<header><length><pointers to data><end>

Since this is a short definition, let's assume there's only one data pointer. At that location we have:

@pointer[0]: <header><length><...><05><06><07><08><end>

Here b has a length of 1 and ba a length is 4. Of course, ba's structure is more complex because it has to allow for mutability, which might include adding and removing chunks of data and managing a list of pointers to said data.

When you enter:

ba[2] = b 

You are saying, take b --which in this case is a single byte-- and place that at index 2 in ba.

b, at the end of the day, is an 8 bit integer. No interpretation needed other than parsing the data structure to get to the data field.

ba, when all the smoke and bs clears out, stores 8 bit integers.

The statement above says: Take this integers and put it where this other integer used to be.

And Python says:

TypeError: 'bytes' object cannot be interpreted as an integer

Which is nonsense. A bytes object IS a a collection of integers. Just like a bytearray. If the length of the assignment is correct, there's nothing to do other than read the byte from b and store it at the appropriate location in ba.

You can call them containers if you want. No problem. However, at the end of the day, in both cases, you have a bunch of bytes in memory, one after the other. All of the assignment variants I presented in my code work just fine. The thing just throws an error when it has to deal with a single byte. This is a bug.

@ericvsmith
Copy link
Member

ericvsmith commented Jun 6, 2022

I don't know why you refuse to accept that Python treats a bytes object as a container of 0 or more small integers, and never as a single small integer. But since you won't, I don't have anything more to add.

As Raymond said, the existing behavior is documented and tested, so this is not a bug.

@robomartin
Copy link
Author

robomartin commented Jun 6, 2022

You are saying something I have NEVER said.

The bytes, in both the bytes and bytearray types are contained in a data structure. This is true of almost any OO language I can think of. Raw single bytes in memory as a data type isn't a reality in Python and many other languages.

And yet this is NOT at all what I am discussing. You keep coming back to this idea of single bytes. Something I have never said or implied. At all. Don't put words in my mouth.

This works:

>>> ba = bytearray([5,6,7,8])
>>> b = bytes([1, 2])  # this is EXACTLY the same thing as b = b'\x01\x02'
>>> ba[1:3] = b[0:2]
>>> ba
bytearray(b'\x05\x01\x02\x08')

This works:

>>> ba = bytearray([5,6,7,8])
>>> b = bytes([1,])  # this is EXACTLY the same thing as b = b'\x01'
>>> ba[1] = b[0]
>>> ba
bytearray(b'\x05\x01\x07\x08')

This does not work:

>>> ba = bytearray([5,6,7,8])
>>> b = bytes([1,])  # this is EXACTLY the same thing as b = b'\x01'
>>> ba[1] = b
Traceback (most recent call last):
  File "C:\Python\Python39\lib\code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
TypeError: 'bytes' object cannot be interpreted as an integer

Why?
The relevant portions of the data structures --and, more importantly, the data-- for

b = bytes([1,2])
and
b = bytes([1,]) 

Are exactly the same:

b = bytes([1,2])
print(f"{string_at(id(b), getsizeof(b)).hex()}")

b = bytes([1,])
print(f"{string_at(id(b), getsizeof(b)).hex()}")

Which produces:

                                  length                         data
                                 |                               ||||
01000000000000005070e239fb7f00000200000000000000ffffffffffffffff010200
04000000000000005070e239fb7f00000100000000000000fb2e9dfc44b8af560100
                                 |                               ||
                                  length                         data

So, when we say:

ba[1:3] = b[0:2]

The code takes those two bytes in b's data structure, reads them, and writes them into the corresponding bytes in ba's data structure. Done. No problem.

Once we get down to a single byte the behavior becomes incongruent.

If you do this it works fine:

ba[1] = b[0]

If, instead, you do this, it fails:

ba[1] = b

This should not fails. b[0] and b are the same thing. One refers to the first element of b explicitly. The other presents the entire structure. Being that it has a single element, it should make the assignment and move on.

What I think you are saying is: You cannot assign an entire bytes object to a single location in a bytearray because that location holds single 8-bit byte (the real thing) and a bytes object is this large data structure that happens to contain zero or more bytes. If that's what you are saying, you are missing the entire point.

Excepts this is a single element bytes object being assigned to a single location in the bytearray. Exactly one byte on each side of the assignement operator (the payload, not the data structures).

And now you have to explain this:

ba[1:1] = b

Which works perfectly fine and inserts, in this case, the single byte from b into ba, making it longer.

Wait. Is it appending an entire bytes data structure, not just an 8-bit byte? Let's see:

ba = bytearray([5,6,7,8])
b = bytes([1,])
print(getsizeof(ba))
ba[1:1] = b
print(getsizeof(ba))
ba[1:1] = b
print(getsizeof(ba))
ba[1:1] = b
print(getsizeof(ba))
ba[1:1] = b
print(getsizeof(ba))
ba[1:1] = b
print(getsizeof(ba))
print(ba)
61
61
64
64
64
68
68

Nope. Not doing that. Just gently expanding the allocated storage and adding single bytes.

You also have to explain this:

>>> ba = bytearray([5,6,7,8])
>>> b = bytes([1,2])
>>> ba[1:3] = b
>>> ba
bytearray(b'\x05\x01\x02\x08')

Wait. What? So, when I make an assignment of a bytes object with a single element we get a TypeError because it can't convert it, however, when we do the same thing while inserting or with a bytes object that has two or more elements it works just fine? It is literally doing exactly what it should be doing in the single element case.

ba = bytearray([5,6,7,8])
b = bytes([1,2])
ba[1:2] = b    # Works
ba[1] = b      # Fails

Why? These are exactly the same thing.

Explain that. And kindly show me where this is "well documented behavior".

@robomartin
Copy link
Author

Wait, is it a case of one being an iterator and the other one not?

ba
bytearray(b'\x05\x01\x07\x08')
b
b'\x01'
iter(ba)
<bytearray_iterator object at 0x0000028FF7530850>
iter(b)
<bytes_iterator object at 0x0000028FF75301F0>

Nope.

@TeamSpen210
Copy link

Looking at the raw memory is a distraction and irrelevant, because it's entirely an implementation detail of CPython. The language doesn't dictate anything about the layout, and it doesn't affect semantics. It would be entirely conforming (but silly) if they stored the bytes as hex character pairs, or anything at all. Single bytes are not a thing in Python, both bytes and bytearray are sequences of integers, restricted to the 0-255 range. They follow the same behaviour as other sequences (like lists, tuples and strings) in regards to indexing - item assignment with items, slice assignment with iterables of the item.

When you do ba[1:3] = b[0:2], conceptually what's occuring is that b is sliced, creating a copy. This is then iterated over, producing individual integers, which are then assigned into the bytearray. As an optimisation CPython likely special cases bytes-bytearray copies, but that's not required by the language. But the requirement holds that the RHS for a slice assignment must be an iterable of integers. Note that ba[1:2] = 4 also fails.

ba[1] = ... is a single item assignment, which requires the RHS to be a single item - so an integer. Python therefore tries to do int(b"\x01") which fails because a bytes object is not an integer, it's a sequence of them (which just happens to be one long). Compare the behaviour to a list (which is also a sequence) - mylist[4] = [4] does not unpack the 4, it puts the other list itself into mylist.

@robomartin
Copy link
Author

First, and most important: I really appreciate you taking the time to provide a sensible response that shows you both read and understood what I have been saying.

When you do ba[1:3] = b[0:2], conceptually what's occuring is that b is sliced, creating a copy.

Just so we are clear. Do you think this is what is happening or you know so due to being familiar with the code?

Here's where I think the comparison to lists doesn't quite fit. Yes, of course, I get exactly what you are saying. No issues there. However, lists are sequences of a varied range of object. You can put anything you want there, even function (pointers).

In that sense, yes, of course, mylist[4] = [4] should behave exactly as it does.

Bytes and bytearrays are different. They are restricted to having a single byte per element and nothing more. Just 8 bits. Period.

And so, the "intelligent" interpretation of ba[1] = b can't possibly be "What? You want to plug an entire bytes structure into a byte?". What makes sense is: "Oh, since bytearrays only deal in bytes, let's copy that byte over".

In other words, it is impossible to store a sequence in a single bytearray element. By definition. Therefore, the most benevolent interpretation of that assignment might not be throwing an error. In that sense, we have a byte location on the LHS and a byte on the RHS.

I go back to the error message, which is what set me off in this direction:

TypeError: 'bytes' object cannot be interpreted as an integer

This is absolutely not true, particularly of a single element bytes object. True of other types. You can't interpret a single element string as an integer without a lot more information. Single element list? Nope. That single element could be a huge dictionary.

I think it would be silly and confusing to attempt to parse through that and suggest something like: Well, in the case of a string let's interpret a single element string as an ASCII 0-255 character. No, that would be wrong.

In the case of a single element bytes type, this issue simply isn't there at all. There is no question whatsoever that it is an integer. It cannot be anything but that integer. And, therefore, assigning a single element bytes object to a single element within a bytearray is a hand-in-glove condition. You can't store anything but an integer into an element of a bytearray, not so with other data types.

In fact, in my very light look at the source code I thought I saw bytearray actually include bytes. This let me to believe that the array of pointers to data in a bytearray structure actually points to a bunch of structures that are nothing less than bytes objects with a flag thrown to make them mutable only in that context. Again, I did not dive deep, this is nothing less than conjecture on my part.

OK, maybe Python developers disagree and a byte is no longer a byte. That's OK. The error is still wrong. If the behavior is retained, the error should read something like:

TypeError: 'bytes' object cannot be a bytearray element, only integers

or

TypeError: 'bytearray' can only store integers, not 'bytes' objects

The behavior might be as intended. Although I disagree, I can accept that. However, the error is wrong and can definitely confuse someone who might be thinking bytes (since we are saying we should not mind the underlying data structures).

I'll drop this here. I have too much to do and, frankly, the experience of making an attempt to contribute has not been pleasant at all (except for this last interaction).

@TeamSpen210
Copy link

What I was meaning in the abstract language sense (IE what the docs say the language does), not necessarily what CPython the implementation is going to do - it certainly would have optimisations for common things like this if it doesn't change observed behaviour. I agree with you that it could be a reasonable behaviour to special case single bytes objects, but it is still a special case.

Perhaps the confusion is because Cannot be interpreted as an integer isn't an error message from bytearray, it's actually produced by the C equivalent to operator.index() and the __index__() magic method. That's used to convert integer-like objects to integers (for instance Numpy scalars), while avoiding accidentally converting say strings and floats. If you try ba[3] = 4.0 you'll get the same exception text.

In this context though it is confusing, and might be worth having a unique error message to clarify. There's also additional confusion potentially from the quirk that strings contain 1-long strings, and though bytes don't do that they're otherwise very similar.

@robomartin
Copy link
Author

I appreciate the insight you provided. Thanks.

@ericvsmith ericvsmith removed the type-bug An unexpected behavior, bug, or error label Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
X Tutup