gh-126024: optimize UTF-8 decoder for short non-ASCII string#126025
Merged
methane merged 18 commits intopython:mainfrom Nov 29, 2024
Merged
gh-126024: optimize UTF-8 decoder for short non-ASCII string#126025methane merged 18 commits intopython:mainfrom
methane merged 18 commits intopython:mainfrom
Conversation
5344340 to
9b47c2b
Compare
rruuaanng
reviewed
Oct 27, 2024
picnixz
reviewed
Oct 27, 2024
Member
Author
|
orjson's implementation is still faster. |
Member
Author
|
Comparing to DuckDB's decoder.
When benchmarking short ASCII, performance is unstable because unicode_dealloc is slower than decoding. speed is vary on where the object is allocated. |
methane
commented
Oct 28, 2024
800452a to
b0ce85c
Compare
b0ce85c to
37715b6
Compare
This reverts commit c47d574.
Member
Author
|
This is tree I played microbenchmarks. |
Member
Author
|
orjson's benchmark_load result: When seeing 0003 vs 0004 vs 0005 on twitter.json benchmark, this PR makes PyString_FromStringAndSize from 19% slower to 12% slower. |
72ed21d to
96c7b19
Compare
Mytherin
added a commit
to duckdb/duckdb
that referenced
this pull request
Nov 19, 2024
DuckDB introduced optimization for UTF-8 decoder. It is up to 40% faster for short non-ASCII case. But it is 4x slower for long ASCII case. Python has optimized code to decode ASCII. So decoding UTF-8 containing long ASCII part is faster than UTF8Proc::UTF8ToCodepoint. And I am optimizing short non-ASCII case handling in CPython. ref: python/cpython#126025 (comment) ## Background * Using PEP 393 based API that heavily depending on current CPython internal in 3rd party code makes difficult to evolve Python internal (e.g. use UTF-8 as internal representation of Unicode). * Using PEP 393 slows down Python implementations other than CPython that use UTF-8 string representations. e.g. PyPy. * PyUnicode_FromStringAndSize is Stable ABI. Moving from non-Stable ABI to Stable ABI makes you possible to build Python modules that works with several Python versions.
picnixz
reviewed
Nov 29, 2024
srinivasreddy
pushed a commit
to srinivasreddy/cpython
that referenced
this pull request
Jan 8, 2025
ebonnal
pushed a commit
to ebonnal/cpython
that referenced
this pull request
Jan 12, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This optimization works only for strict error handler, because other error handler may remove or replace invalid UTF-8 sequence.
Benchmark
code
Result (wit
--enable-optimizations --with-lto):