gh-119609: Add PyUnicode_Export() function #123738

vstinner · 2024-09-05T15:23:10Z

Add PyUnicode_Export(), PyUnicode_GetBufferFormat() and PyUnicode_Import() functions to the limited C API.

Issue: [C API] Add PyUnicode_Export() and PyUnicode_Import() functions #119609

📚 Documentation preview 📚: https://cpython-previews--123738.org.readthedocs.build/

Add PyUnicode_Export(), PyUnicode_GetBufferFormat() and PyUnicode_Import() functions to the limited C API.

Doc/c-api/unicode.rst

bedevere-app · 2024-09-05T15:57:39Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

Doc/c-api/unicode.rst

Objects/unicodeobject.c

vstinner · 2024-09-05T16:56:01Z

I have made the requested changes; please review again.

bedevere-app · 2024-09-05T16:56:05Z

Thanks for making the requested changes!

@mdboom: please review the changes made to this pull request.

vstinner · 2024-09-05T16:56:26Z

@mdboom @picnixz: Thanks for your reviews. I think that I addressed most, if not all, of them :-)

picnixz

A final nitpick on my side (sorry but I only skimmed through the implementation since I don't have much energy now...).

A bit off-topic, but do we use the PRI* macros in the code base? I saw that you used the %i for formatting a uint32_t value, which usually works, but I wondered whether you prefer using the platform-dependent ones.

Objects/unicodeobject.c

vstinner · 2024-09-05T18:40:40Z

A side effect of this change is to add the __release_buffer__() method to the built-in str type.

I had to implement collections.UserString.__release_buffer__() to fix test_collections (the UserString simply raises NotImplementedError).

mdboom · 2024-09-09T20:17:07Z

Objects/unicodeobject.c

+                                 ucs2);
+        ucs2[len] = 0;
+
+        return unicode_export(unicode, view, format,


In the cases where there is a conversion, does the buffer need to hold a reference to the unicode object? Shouldn't we instead be passing NULL here?

I would prefer to store unicode as view.obj in all cases. So PyBuffer_Release() is able to find the unicode_releasebuffer() function. Technically, you're right that in UCS2 and UCS4 cases, when we allocate a buffer, we don't need to hold a reference to unicode to make sure that data remains valid.

UCS2 can also copy the buffer.

Doc/c-api/unicode.rst

Co-authored-by: Petr Viktorin <encukou@gmail.com>

Doc/c-api/unicode.rst

Use signed int32_t for the format.

Objects/unicodeobject.c

Doc/c-api/unicode.rst

vstinner · 2024-09-12T10:37:42Z

@serhiy-storchaka: I updated the PR to use _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32(), and address your other comments.

vstinner · 2024-09-12T10:38:27Z

I had to remove the check "last character in a NUL character" in tests, since _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32() don't write such last NUL character.

encukou · 2024-09-12T11:54:57Z

I had to remove the check "last character in a NUL character" in tests, since _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32() don't write such last NUL character.

That's a security vulnerability waiting to happen.

Since the internal buffers do have the terminating NUL, and in most cases we expose those, people will expect the NUL even if we'd explicitly document that it's not guaranteed. IMO, we need to add it.

This reverts commit abf5c58.

vstinner · 2024-09-12T13:44:51Z

@encukou:

Since the internal buffers do have the terminating NUL, and in most cases we expose those, people will expect the NUL even if we'd explicitly document that it's not guaranteed. IMO, we need to add it.

@serhiy-storchaka: Sorry, I reverted the "Use _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32()" change to get back the NUL trailing character.

vstinner · 2024-09-12T13:46:05Z

I'm not sure if we should guarantee that the exported buffer ends with a NUL character. I'm not sure that all Python implementations will be able to provide such guarantee in an efficient way (without having to allocate a temporary buffer for that).

encukou · 2024-09-12T14:00:37Z

We should. As long as the API is used from C, exported strings should be NUL-terminated for safety.
Another implementation can add a function like XPyUnicode_Export_Raw; if it becomes popular CPython can adopt it as an alias of PyUnicode_Export.

vstinner · 2024-09-12T14:16:01Z

We should. As long as the API is used from C, exported strings should be NUL-terminated for safety.

I suggest to continue this discussion at: capi-workgroup/decisions#33 (comment)

Objects/unicodeobject.c

pythongh-119609: Add PyUnicode_Export() function

c84f314

Add PyUnicode_Export(), PyUnicode_GetBufferFormat() and PyUnicode_Import() functions to the limited C API.

vstinner requested review from a team and encukou as code owners September 5, 2024 15:23

bedevere-app bot added the awaiting core review label Sep 5, 2024

mdboom requested changes Sep 5, 2024

View reviewed changes

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

bedevere-app bot removed the awaiting core review label Sep 5, 2024

bedevere-app bot added the awaiting changes label Sep 5, 2024

picnixz reviewed Sep 5, 2024

View reviewed changes

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

Objects/unicodeobject.c Outdated Show resolved Hide resolved

vstinner added 2 commits September 5, 2024 18:51

Address reviews

d0cdbd1

Exclude from limited C API 3.13 and older

9b33dca

bedevere-app bot added awaiting change review and removed awaiting changes labels Sep 5, 2024

bedevere-app bot requested a review from mdboom September 5, 2024 16:56

vstinner mentioned this pull request Sep 5, 2024

Add PyUnicode_Export() and PyUnicode_Import() to the limited C API capi-workgroup/decisions#33

Open

picnixz approved these changes Sep 5, 2024

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

mdboom approved these changes Sep 5, 2024

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting change review labels Sep 5, 2024

Replace PyErr_Format() with PyErr_SetString()

cf1f74a

picnixz reviewed Sep 5, 2024

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

Fix test_collections: implement UserString.__release_buffer__()

93d4470

vstinner requested a review from rhettinger as a code owner September 5, 2024 18:34

rhettinger removed their request for review September 5, 2024 20:51

vstinner mentioned this pull request Sep 6, 2024

gh-119609: Add PyUnicode_Export() and PyUnicode_Import() functions #119610

Closed

Add format parameter to PyUnicode_Export()

17ad7b9

mdboom reviewed Sep 9, 2024

View reviewed changes

vstinner added 2 commits September 10, 2024 08:45

Fix memory leak in unicode_releasebuffer()

78a70fa

UCS2 can also copy the buffer.

Remove PyUnicode_GetBufferFormat() documentation

79207f5

encukou reviewed Sep 10, 2024

View reviewed changes

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

Doc/c-api/unicode.rst Show resolved Hide resolved

vstinner and others added 4 commits September 10, 2024 15:42

Apply suggestions from code review

bc0fb69

Co-authored-by: Petr Viktorin <encukou@gmail.com>

Set format to 0 on error

2cdbc27

Remove trailing space

b5be22d

Change constant values

2960b25

encukou reviewed Sep 11, 2024

View reviewed changes

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

vstinner added 3 commits September 11, 2024 12:03

Update constants value in the doc

bcb41f3

Remove unicode_releasebuffer(); use bytes instead

44cb702

PyUnicode_Export() returns the format

1809d8d

Use signed int32_t for the format.

serhiy-storchaka reviewed Sep 12, 2024

View reviewed changes

Objects/unicodeobject.c Show resolved Hide resolved

Objects/unicodeobject.c Outdated Show resolved Hide resolved

Objects/unicodeobject.c Outdated Show resolved Hide resolved

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

vstinner added 2 commits September 12, 2024 12:34

Fix PyUnicode_Export() signature in doc

6707ef4

Use _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32()

abf5c58

Use signed int in C tests

033fc07

vstinner added 2 commits September 12, 2024 15:41

Update stable_abi: remove PyUnicode_GetBufferFormat()

078dfcf

Revert "Use _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32()"

79c6d01

This reverts commit abf5c58.

Allow surrogate characters in UTF-8

5479ab2

serhiy-storchaka reviewed Sep 12, 2024

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

vstinner added 3 commits September 14, 2024 00:04

Merge branch 'main' into unicode_view

ab2f9b0

Avoid a second copy in the UTF-8 export

f71f230

UCS-4 export: remove one memory copy

492f10a

Aug	SEP	Oct
	14
2023	2024	2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-119609: Add PyUnicode_Export() function #123738

gh-119609: Add PyUnicode_Export() function #123738

vstinner commented Sep 5, 2024 •

edited by github-actions bot

Loading

bedevere-app bot commented Sep 5, 2024

vstinner commented Sep 5, 2024

bedevere-app bot commented Sep 5, 2024

vstinner commented Sep 5, 2024

picnixz left a comment

vstinner commented Sep 5, 2024

mdboom Sep 9, 2024

vstinner Sep 10, 2024

vstinner commented Sep 12, 2024

vstinner commented Sep 12, 2024

encukou commented Sep 12, 2024

vstinner commented Sep 12, 2024

vstinner commented Sep 12, 2024

encukou commented Sep 12, 2024

vstinner commented Sep 12, 2024

gh-119609: Add PyUnicode_Export() function #123738

Are you sure you want to change the base?

gh-119609: Add PyUnicode_Export() function #123738

Conversation

vstinner commented Sep 5, 2024 • edited by github-actions bot Loading

bedevere-app bot commented Sep 5, 2024

vstinner commented Sep 5, 2024

bedevere-app bot commented Sep 5, 2024

vstinner commented Sep 5, 2024

picnixz left a comment

Choose a reason for hiding this comment

vstinner commented Sep 5, 2024

mdboom Sep 9, 2024

Choose a reason for hiding this comment

vstinner Sep 10, 2024

Choose a reason for hiding this comment

vstinner commented Sep 12, 2024

vstinner commented Sep 12, 2024

encukou commented Sep 12, 2024

vstinner commented Sep 12, 2024

vstinner commented Sep 12, 2024

encukou commented Sep 12, 2024

vstinner commented Sep 12, 2024

vstinner commented Sep 5, 2024 •

edited by github-actions bot

Loading