gh-126505: Do not use Unicode case folding in ASCII regexes#126544
Closed
jirkamarsik wants to merge 1 commit intopython:mainfrom
Closed
gh-126505: Do not use Unicode case folding in ASCII regexes#126544jirkamarsik wants to merge 1 commit intopython:mainfrom
jirkamarsik wants to merge 1 commit intopython:mainfrom
Conversation
When an ASCII regex would use a character range that exceeds the bounds of the basic multilingual plane, it would be compiled into an opcode that performs Unicode case folding. Now, only Unicode regexes can use the Unicode-specific case folding opcode.
|
The following commit authors need to sign the Contributor License Agreement: |
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Member
ZeroIntensity
left a comment
There was a problem hiding this comment.
Thanks for contributing! This needs a NEWS entry, as it's a user-facing bug, and you'll also need to sign the CLA.
Comment on lines
+2630
to
+2631
| # gh-126505 | ||
| # should match in Unicode mode |
Member
There was a problem hiding this comment.
Suggested change
| # gh-126505 | |
| # should match in Unicode mode | |
| # GH-126505: should match in Unicode mode |
Member
Author
|
Closing this Pull Request in favor of @serhiy-storchaka's upcoming fix. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a pattern is being compiled in
_compiler.py'soptimize_charset, theRANGEopcode is translated into theRANGE_UNI_IGNOREopcode. This should be done only in regexes which set the Unicode flag, otherwise we get Unicode case folding behavior in regexes which set the ASCII or Locale mode flags.The correct way to check for Unicode mode in
optimize_charsetwould be to checkif fixes:, because thefixesargument isNonein ASCII and Locale modes and adictin Unicode mode. The code currently uses the conditionif fixup:, butfixupisNoneonly in Locale mode and it is a function in both ASCII and Unicode mode. This means that this replacement is used in ASCII mode too and theRANGEopcode is translated to aRANGE_UNI_IGNOREopcode for character sets which include characters outside of the basic multilingual plane (the second time anIndexErroris thrown inoptimize_charset).