gh-120397: improve the speed of str.count, bytes.count et al. for single characters by about 2x.#120398
gh-120397: improve the speed of str.count, bytes.count et al. for single characters by about 2x.#120398vstinner merged 16 commits intopython:mainfrom
Conversation
There was a problem hiding this comment.
- What are the performances where you don't have the character in the sequence?
- What are the performances with larger inputs? can you generate inputs of size 10k and with a lot of occurrences of the character, no occurrence at all, and sparse occurrences?
- The timings that you report are for 4 statements. It would be better if we had single-case benchmarkings (where you perform only one
countcall).
Objects/stringlib/fastsearch.h
Outdated
| while (cursor < end_ptr) { | ||
| if (*cursor == p0) { | ||
| count += 1; | ||
| } | ||
| cursor += 1; | ||
| } |
There was a problem hiding this comment.
Do you need a while loop or can you live with a for-loop here?
Objects/stringlib/fastsearch.h
Outdated
| /* By unrolling in chunks of 32, the compiler can auto vectorize, resulting | ||
| in much better performance. */ | ||
| while (cursor < unroll_end_ptr) { | ||
| for(size_t i=0; i<32; i++) { |
There was a problem hiding this comment.
| for(size_t i=0; i<32; i++) { | |
| for(size_t i = 0; i < 32; i++) { |
Let us keep the same style as before.
Objects/stringlib/fastsearch.h
Outdated
| const STRINGLIB_CHAR *restrict cursor = s; | ||
| const STRINGLIB_CHAR *end_ptr = s + n; | ||
| const STRINGLIB_CHAR *unroll_end_ptr = end_ptr - 31; | ||
| /* By unrolling in chunks of 32, the compiler can auto vectorize, resulting |
There was a problem hiding this comment.
Is 32 optimal for any supported architecture? or is it possible to use 64-bit chunks for 64-bit architecture?
There was a problem hiding this comment.
These are byte-chunks, not bitchunks. ARM64 and x86-64 have 16-byte (128-bit) vectors by default. Clang and GCC are able to use these properly. On other architectures the loop is simply unrolled.
MSVC does not unroll the loop however I see. That might have an impact on performance. I should test that.
There was a problem hiding this comment.
These are byte-chunks, not bitchunks
Oh yes, sorry (well, my question remains the same actually).
There was a problem hiding this comment.
On other architectures the loop is unrolled. Meaning it is going to be 32 compare and add instructions in a row.
This should be more optimal than looping, as the CPU can go on for a while until it hits a jump. Although the assembly does not look very elegant. Of course the performance impact of this can only be evaluated on these architectures, but I suspect it will be minimal.
There was a problem hiding this comment.
Actually, I was wondering whether we could have a macro defining the correct constant to use depending on the architecture. That way, it could be more or less optimized per architecture. But if we do not already have that information, let's keep your 32.
There was a problem hiding this comment.
That's a good idea. Let's see what happens when I get to the windows benchmarking. MSVC does not unroll the inner loop at all, so the performance is potentially going to be very poor. I was also thinking of enclosing the unrolled loop in compile guards and only allow it for architectures that are known to perform better this way. Anyway, I will get to that after some benchmarks in the coming days.
There was a problem hiding this comment.
Alternatively, for MSVC, you could unroll the loop manually actually. While I may understand that it's maybe an overkill, it might be worthwhile I'd say.
Objects/stringlib/fastsearch.h
Outdated
| while (cursor < unroll_end_ptr) { | ||
| for(size_t i=0; i<32; i++) { | ||
| if (cursor[i] == p0) { | ||
| count += 1; |
There was a problem hiding this comment.
If you put the count >= maxcount here, does the runtime increases a lot or not?
There was a problem hiding this comment.
Yes. Then this PR makes no sense any more.
Because the current code tells the compiler that it:
- It can read the next 32 bytes as these are all valid memory
- Only counting of the character needs to be performed
If it needs to abort reading when the count == maxcount is made, it cannot use vectors to do the reading as 32 byte reads are not guaranteed.
So instead count is allowed to overshoot and a count >= maxcount check is placed outside the loop. This will mean that the function will read at most 31 bytes too much.
|
Thank you for your very insightful comments @picnixz ! You are right this needs extensive benchmarks for all possible use cases. I will get back to this another day as I have also other tasks to attend to. |
|
Hmm. I did some further investigation of the code. Compilers can also optimize without all the hints, provided the maxcount check is not done. |
|
There we go. Before: After: @picnixz So sorry for letting you review all the unrolled code. Turns out a simple copy paste and special casing was enough 😅 . I hope I did not waste your time. EDIT: On the upside, an evaluation of all possible platforms is not needed! This code should never run slower on any platform. |
Objects/stringlib/fastsearch.h
Outdated
|
|
||
|
|
||
| static inline Py_ssize_t | ||
| STRINGLIB(count_char_no_maximum)(const STRINGLIB_CHAR *s, Py_ssize_t n, |
There was a problem hiding this comment.
Maybe make that function private (I don't think it should be exposed except in this module).
There was a problem hiding this comment.
It is private, as it is static. The STRINGLIB macro is to prevent name clobbering. This function will be generated for STRINGLIB_CHAR==Py_UCS1, Py_UCS2 and PyUCS4. I think keeping it this way is correct. But I may be wrong of course. How do you suggest making it private?
There was a problem hiding this comment.
Oh just by adding an underscore before its name (I should have been clearer when I said "private", I meant it in the naming but I think we don't care about underscores in C files).
There was a problem hiding this comment.
No, we don't use underscore prefix in Python for static functions. Moreover, the macro adds a prefix such as ucs1lib_.
Misc/NEWS.d/next/Core and Builtins/2024-06-12-13-47-25.gh-issue-120397.n-I_cc.rst
Outdated
Show resolved
Hide resolved
…e-120397.n-I_cc.rst Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Objects/stringlib/fastsearch.h
Outdated
|
|
||
|
|
||
| static inline Py_ssize_t | ||
| STRINGLIB(count_char_no_maximum)(const STRINGLIB_CHAR *s, Py_ssize_t n, |
There was a problem hiding this comment.
No, we don't use underscore prefix in Python for static functions. Moreover, the macro adds a prefix such as ucs1lib_.
Misc/NEWS.d/next/Core and Builtins/2024-06-12-13-47-25.gh-issue-120397.n-I_cc.rst
Outdated
Show resolved
Hide resolved
|
cc @serhiy-storchaka @pitrou: This change looks very promising. |
Misc/NEWS.d/next/Core and Builtins/2024-06-12-13-47-25.gh-issue-120397.n-I_cc.rst
Outdated
Show resolved
Hide resolved
251f77b to
ce9ab9b
Compare
Co-authored-by: Nice Zombies <nineteendo19d0@gmail.com>
Head branch was pushed to by a user without write access
| else if (mode == FAST_RSEARCH) | ||
| return STRINGLIB(rfind_char)(s, n, p[0]); | ||
| else { | ||
| if (maxcount == PY_SSIZE_T_MAX) { |
There was a problem hiding this comment.
| if (maxcount == PY_SSIZE_T_MAX) { | |
| if (maxcount >= n) { |
There was a problem hiding this comment.
Maxcount is only used in the replace function, it is very unlikely that this condition will ever be triggered.
There was a problem hiding this comment.
OK, but there's no reason to check for PY_SSIZE_T_MAX specifically when this works as well.
There was a problem hiding this comment.
Yes you are correct. However, this function needs some refactoring, as this maxcount provision is only there for replace. Replace for single characters is special.cased elsewhere, so maxcount is actually always Pyssize_t_max I think. I want to revisit this at a later point.
|
@vstinner, so I had to botch the automerge. I made a few mistakes when implementing the suggestions and just after the push my attention was required elsewhere. All tests pass now. |
|
Merged, thank you. |
|
Thank you for the review and the merging. It was a pleasant process. I think I will make more of these "making CPython faster, one function at a time" PRs. If it is preferred that I bundle these, please let me know. |
|
I prefer to do it one function per change, as you can see it's already complicated to change a single function. |
Benchmarks using:
This is testing a real use case where the GC content of a DNA sequence is calculated. Other possible usages are counting newlines.
Before:
After: