X Tutup
The Wayback Machine - https://web.archive.org/web/20240914112802/https://github.com/python/cpython/pull/123738
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking 鈥淪ign up for GitHub鈥, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-119609: Add PyUnicode_Export() function #123738

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Sep 5, 2024

Add PyUnicode_Export(), PyUnicode_GetBufferFormat() and PyUnicode_Import() functions to the limited C API.


馃摎 Documentation preview 馃摎: https://cpython-previews--123738.org.readthedocs.build/

Add PyUnicode_Export(), PyUnicode_GetBufferFormat() and
PyUnicode_Import() functions to the limited C API.
Doc/c-api/unicode.rst Outdated Show resolved Hide resolved
Doc/c-api/unicode.rst Outdated Show resolved Hide resolved
Doc/c-api/unicode.rst Outdated Show resolved Hide resolved
@bedevere-app
Copy link

bedevere-app bot commented Sep 5, 2024

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved
Doc/c-api/unicode.rst Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
@vstinner
Copy link
Member Author

vstinner commented Sep 5, 2024

I have made the requested changes; please review again.

@bedevere-app
Copy link

bedevere-app bot commented Sep 5, 2024

Thanks for making the requested changes!

@mdboom: please review the changes made to this pull request.

@bedevere-app bedevere-app bot requested a review from mdboom September 5, 2024 16:56
@vstinner
Copy link
Member Author

vstinner commented Sep 5, 2024

@mdboom @picnixz: Thanks for your reviews. I think that I addressed most, if not all, of them :-)

Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A final nitpick on my side (sorry but I only skimmed through the implementation since I don't have much energy now...).

A bit off-topic, but do we use the PRI* macros in the code base? I saw that you used the %i for formatting a uint32_t value, which usually works, but I wondered whether you prefer using the platform-dependent ones.

Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
@vstinner
Copy link
Member Author

vstinner commented Sep 5, 2024

A side effect of this change is to add the __release_buffer__() method to the built-in str type.

I had to implement collections.UserString.__release_buffer__() to fix test_collections (the UserString simply raises NotImplementedError).

ucs2);
ucs2[len] = 0;

return unicode_export(unicode, view, format,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the cases where there is a conversion, does the buffer need to hold a reference to the unicode object? Shouldn't we instead be passing NULL here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to store unicode as view.obj in all cases. So PyBuffer_Release() is able to find the unicode_releasebuffer() function. Technically, you're right that in UCS2 and UCS4 cases, when we allocate a buffer, we don't need to hold a reference to unicode to make sure that data remains valid.

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved
Doc/c-api/unicode.rst Outdated Show resolved Hide resolved
Doc/c-api/unicode.rst Outdated Show resolved Hide resolved
Doc/c-api/unicode.rst Show resolved Hide resolved
Doc/c-api/unicode.rst Outdated Show resolved Hide resolved
Objects/unicodeobject.c Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Doc/c-api/unicode.rst Outdated Show resolved Hide resolved
@vstinner
Copy link
Member Author

@serhiy-storchaka: I updated the PR to use _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32(), and address your other comments.

@vstinner
Copy link
Member Author

I had to remove the check "last character in a NUL character" in tests, since _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32() don't write such last NUL character.

@encukou
Copy link
Member

encukou commented Sep 12, 2024

I had to remove the check "last character in a NUL character" in tests, since _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32() don't write such last NUL character.

That's a security vulnerability waiting to happen.

Since the internal buffers do have the terminating NUL, and in most cases we expose those, people will expect the NUL even if we'd explicitly document that it's not guaranteed. IMO, we need to add it.

@vstinner
Copy link
Member Author

@encukou:

Since the internal buffers do have the terminating NUL, and in most cases we expose those, people will expect the NUL even if we'd explicitly document that it's not guaranteed. IMO, we need to add it.

@serhiy-storchaka: Sorry, I reverted the "Use _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32()" change to get back the NUL trailing character.

@vstinner
Copy link
Member Author

I'm not sure if we should guarantee that the exported buffer ends with a NUL character. I'm not sure that all Python implementations will be able to provide such guarantee in an efficient way (without having to allocate a temporary buffer for that).

@encukou
Copy link
Member

encukou commented Sep 12, 2024

We should. As long as the API is used from C, exported strings should be NUL-terminated for safety.
Another implementation can add a function like XPyUnicode_Export_Raw; if it becomes popular CPython can adopt it as an alias of PyUnicode_Export.

@vstinner
Copy link
Member Author

We should. As long as the API is used from C, exported strings should be NUL-terminated for safety.

I suggest to continue this discussion at: capi-workgroup/decisions#33 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
X Tutup