-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for property support in Python re lib #56943
Comments
|
Python supports no Unicode properties in its re library, making it unsuitable for work with Unicode. This is therefore a formal request for the Python re library to support Unicode properties. The eleven properties required by Unicode Technical Report #18's RL1.2 are the bare minimum which must be added to make it possible to use Python reguyar expressions on Unicode. The proposed RL2.7 on Full Properties is even better. That is found at http://unicode.org/reports/tr18/proposed.html#Full_Properties Although by the time you read this, it will have been made an official part of tr18. Matthew Barnett's replacement library for re, called regex, support 67 Unicode properties at last count, including the strongly recommended loose matching. The standard re library needs to be spiffed up to make it suitable for Unicode processing; it is not currently usable for that due to this missing functionality. I quote from the Level 1 conformance requirement of tr18: pass RL1.1 Hex Notation (withdrawn) RL2.1 Canonical Equivalents I won’t even talk about Level 3. ICU, Perl, and Java7 all meet Level One conformance requirements with several Level 2 requirements also met. It is important for Python to meet the Unicode Standard in this so that people can use Python for regex matching Unicode text. They currently cannot usefully do so per the requirements of tr18. |
|
I think the only way re is going to get "spiffed up" is by replacing it with Matthew's library. This is a goal, but I'm not sure where exactly we are in the process. The more Matthew's code gets tested (especially for compatibility with the current re API), the closer we will be to that goal. |
|
I've been a lot of testing of Matthew's regex library against UTS#18 issues, but only somewhat incidentally testing re. To use regex, one has to accept that certain things will work differently than they work in re, because he is following Unicode definitions for things like casefolding. But I doubt that is the sort of difference you are talking about. One of the things that Java, Go, and Perl all do is run regression tests against the whole Unicode Character Database to make sure nothing gets hosed, missed, or otherwise out of sync. That might a sort of regression test you might like to add. |
|
This indeed should be "fixed" by replacing 're' with 'regex'. So I would suggest to focus your tests on 'regex' and report them there so that possible bugs gets fixed and tested before we include the module in the stdlib. |
|
Sorry I didn't include a test case. Hope this makes up for it. If not, please tell me how to write better test cases. :( Yeah ok, so I'm a bit persnickety or even unorthodox about my vertical alignment, but it really helps to make what is different from one to line to the next stand out if the parts that are the same from line to line are at the same column every time. |
|
Oh whoops, that was the long ticket. Shall I reupload to the right number? |
|
+1 on adding the feature to 3.3 in whichever way makes sense. |

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: