gh-120397: improve the speed of str.count, bytes.count et al. for single characters by about 2x. by rhpvorderman · Pull Request #120398 · python/cpython

rhpvorderman · 2024-06-12T11:48:11Z

Issue: The count method on strings, bytes, bytearray etc. can be significantly faster #120397

Benchmarks using:

./python -m timeit -s "seq='TTTATGGTTATTTATATTTATTTATTTTTGAGATGGAGTTTTGCTCTTGCTGCCTAGGCTGGAGTGCAATGGCACGATCTCGGCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATT'" "seq.count('A'); seq.count('C'); seq.count('G');  seq.count('T')"

This is testing a real use case where the GC content of a DNA sequence is calculated. Other possible usages are counting newlines.

Before:

500000 loops, best of 5: 461 nsec per loop

After:

1000000 loops, best of 5: 216 nsec per loop

picnixz

What are the performances where you don't have the character in the sequence?
What are the performances with larger inputs? can you generate inputs of size 10k and with a lot of occurrences of the character, no occurrence at all, and sparse occurrences?
The timings that you report are for 4 statements. It would be better if we had single-case benchmarkings (where you perform only one count call).

picnixz · 2024-06-12T12:01:17Z

Objects/stringlib/fastsearch.h

+    while (cursor < end_ptr) {
+        if (*cursor == p0) {
+            count += 1;
+        }
+        cursor += 1;
+    }


Do you need a while loop or can you live with a for-loop here?

picnixz · 2024-06-12T12:01:38Z

Objects/stringlib/fastsearch.h

+    /* By unrolling in chunks of 32, the compiler can auto vectorize, resulting
+       in much better performance. */
+    while (cursor < unroll_end_ptr) {
+        for(size_t i=0; i<32; i++) {


Suggested change

for(size_t i=0; i<32; i++) {

for(size_t i = 0; i < 32; i++) {

Let us keep the same style as before.

picnixz · 2024-06-12T12:03:51Z

Objects/stringlib/fastsearch.h

+    const STRINGLIB_CHAR *restrict cursor = s;
+    const STRINGLIB_CHAR *end_ptr = s + n;
+    const STRINGLIB_CHAR *unroll_end_ptr = end_ptr - 31;
+    /* By unrolling in chunks of 32, the compiler can auto vectorize, resulting


Is 32 optimal for any supported architecture? or is it possible to use 64-bit chunks for 64-bit architecture?

These are byte-chunks, not bitchunks. ARM64 and x86-64 have 16-byte (128-bit) vectors by default. Clang and GCC are able to use these properly. On other architectures the loop is simply unrolled.

MSVC does not unroll the loop however I see. That might have an impact on performance. I should test that.

These are byte-chunks, not bitchunks

Oh yes, sorry (well, my question remains the same actually).

On other architectures the loop is unrolled. Meaning it is going to be 32 compare and add instructions in a row.
This should be more optimal than looping, as the CPU can go on for a while until it hits a jump. Although the assembly does not look very elegant. Of course the performance impact of this can only be evaluated on these architectures, but I suspect it will be minimal.

Actually, I was wondering whether we could have a macro defining the correct constant to use depending on the architecture. That way, it could be more or less optimized per architecture. But if we do not already have that information, let's keep your 32.

That's a good idea. Let's see what happens when I get to the windows benchmarking. MSVC does not unroll the inner loop at all, so the performance is potentially going to be very poor. I was also thinking of enclosing the unrolled loop in compile guards and only allow it for architectures that are known to perform better this way. Anyway, I will get to that after some benchmarks in the coming days.

Alternatively, for MSVC, you could unroll the loop manually actually. While I may understand that it's maybe an overkill, it might be worthwhile I'd say.

picnixz · 2024-06-12T12:04:31Z

Objects/stringlib/fastsearch.h

+    while (cursor < unroll_end_ptr) {
+        for(size_t i=0; i<32; i++) {
+            if (cursor[i] == p0) {
+                count += 1;


If you put the count >= maxcount here, does the runtime increases a lot or not?

Yes. Then this PR makes no sense any more.

Because the current code tells the compiler that it:

It can read the next 32 bytes as these are all valid memory

Only counting of the character needs to be performed

If it needs to abort reading when the count == maxcount is made, it cannot use vectors to do the reading as 32 byte reads are not guaranteed.
So instead count is allowed to overshoot and a count >= maxcount check is placed outside the loop. This will mean that the function will read at most 31 bytes too much.

Objects/stringlib/fastsearch.h

rhpvorderman · 2024-06-12T12:49:24Z

Thank you for your very insightful comments @picnixz ! You are right this needs extensive benchmarks for all possible use cases. I will get back to this another day as I have also other tasks to attend to.

rhpvorderman · 2024-06-12T13:32:14Z

Hmm. I did some further investigation of the code. Compilers can also optimize without all the hints, provided the maxcount check is not done.
It turns that is only needed in the case for replacing characters. So the code could also be optimized by special casing it. That would not require an extensive performance impact search, as the code remains mostly the same.

rhpvorderman · 2024-06-12T13:41:30Z

There we go. Before:

./python -m timeit -s "seq='TTTATGGTTATTTATATTTATTTATTTTTGAGATGGAGTTTTGCTCTTGCTGCCTAGGCTGGAGTGCAATGGCACGATCTCGGCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATT'" "seq.count('A')"
2000000 loops, best of 5: 103 nsec per loop

After:

./python -m timeit -s "seq='TTTATGGTTATTTATATTTATTTATTTTTGAGATGGAGTTTTGCTCTTGCTGCCTAGGCTGGAGTGCAATGGCACGATCTCGGCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATT'" "seq.count('A')"
5000000 loops, best of 5: 55.2 nsec per loop

@picnixz So sorry for letting you review all the unrolled code. Turns out a simple copy paste and special casing was enough 😅 . I hope I did not waste your time.

EDIT: On the upside, an evaluation of all possible platforms is not needed! This code should never run slower on any platform.

Objects/stringlib/fastsearch.h

picnixz · 2024-06-12T13:47:40Z

Objects/stringlib/fastsearch.h



+static inline Py_ssize_t
+STRINGLIB(count_char_no_maximum)(const STRINGLIB_CHAR *s, Py_ssize_t n,


Maybe make that function private (I don't think it should be exposed except in this module).

It is private, as it is static. The STRINGLIB macro is to prevent name clobbering. This function will be generated for STRINGLIB_CHAR==Py_UCS1, Py_UCS2 and PyUCS4. I think keeping it this way is correct. But I may be wrong of course. How do you suggest making it private?

Oh just by adding an underscore before its name (I should have been clearer when I said "private", I meant it in the naming but I think we don't care about underscores in C files).

No, we don't use underscore prefix in Python for static functions. Moreover, the macro adds a prefix such as ucs1lib_.

Misc/NEWS.d/next/Core and Builtins/2024-06-12-13-47-25.gh-issue-120397.n-I_cc.rst

…e-120397.n-I_cc.rst Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

vstinner · 2024-06-12T15:48:08Z

Objects/stringlib/fastsearch.h



+static inline Py_ssize_t
+STRINGLIB(count_char_no_maximum)(const STRINGLIB_CHAR *s, Py_ssize_t n,


No, we don't use underscore prefix in Python for static functions. Moreover, the macro adds a prefix such as ucs1lib_.

Objects/stringlib/fastsearch.h

Misc/NEWS.d/next/Core and Builtins/2024-06-12-13-47-25.gh-issue-120397.n-I_cc.rst

vstinner · 2024-06-12T15:55:49Z

cc @serhiy-storchaka @pitrou: This change looks very promising.

serhiy-storchaka

LGTM.

Misc/NEWS.d/next/Core and Builtins/2024-06-12-13-47-25.gh-issue-120397.n-I_cc.rst

Objects/stringlib/fastsearch.h

Co-authored-by: Nice Zombies <nineteendo19d0@gmail.com>

vstinner

LGTM

Objects/stringlib/fastsearch.h

…stercount

nineteendo · 2024-06-13T12:13:37Z

Objects/stringlib/fastsearch.h

        else if (mode == FAST_RSEARCH)
            return STRINGLIB(rfind_char)(s, n, p[0]);
        else {
+            if (maxcount == PY_SSIZE_T_MAX) {


Suggested change

if (maxcount == PY_SSIZE_T_MAX) {

if (maxcount >= n) {

Maxcount is only used in the replace function, it is very unlikely that this condition will ever be triggered.

OK, but there's no reason to check for PY_SSIZE_T_MAX specifically when this works as well.

Yes you are correct. However, this function needs some refactoring, as this maxcount provision is only there for replace. Replace for single characters is special.cased elsewhere, so maxcount is actually always Pyssize_t_max I think. I want to revisit this at a later point.

rhpvorderman · 2024-06-13T12:20:16Z

@vstinner, so I had to botch the automerge. I made a few mistakes when implementing the suggestions and just after the push my attention was required elsewhere. All tests pass now.

vstinner · 2024-06-13T14:29:08Z

Merged, thank you.

rhpvorderman · 2024-06-14T04:39:13Z

Thank you for the review and the merging. It was a pleasant process. I think I will make more of these "making CPython faster, one function at a time" PRs. If it is preferred that I bundle these, please let me know.

vstinner · 2024-06-14T08:09:43Z

I prefer to do it one function per change, as you can see it's already complicated to change a single function.

…20398)

rhpvorderman added 2 commits June 12, 2024 13:31

Faster counting of characters due to autovectorization

b2f9fb5

Add blurb entry for faster count method

cb56443

bedevere-app bot mentioned this pull request Jun 12, 2024

The count method on strings, bytes, bytearray etc. can be significantly faster #120397

Closed

bedevere-app bot added the awaiting review label Jun 12, 2024

rhpvorderman changed the title ~~gh-120397:~~ gh-120397: improve the speed of str.count, bytes.count et al. for single characters. Jun 12, 2024

picnixz reviewed Jun 12, 2024

View reviewed changes

rhpvorderman added 3 commits June 12, 2024 14:28

Rewrite count function as a for loop

fff610b

Do the max count check in the inner loop

1359366

Remove extraneous newline

5837e41

Add an explanatory comment

c6f4068

rhpvorderman added 2 commits June 12, 2024 15:32

Revert changes

0eaccd5

Use a no maximum count function

a85202c

picnixz reviewed Jun 12, 2024

View reviewed changes

rhpvorderman and others added 2 commits June 12, 2024 16:24

Update Misc/NEWS.d/next/Core and Builtins/2024-06-12-13-47-25.gh-issu…

807706d

…e-120397.n-I_cc.rst Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

Formatting

fb83c6a

vstinner reviewed Jun 12, 2024

View reviewed changes

serhiy-storchaka approved these changes Jun 12, 2024

View reviewed changes

Misc/NEWS.d/next/Core and Builtins/2024-06-12-13-47-25.gh-issue-120397.n-I_cc.rst Outdated Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting review labels Jun 12, 2024

rhpvorderman added 2 commits June 13, 2024 07:10

Give a speed indication

05a1fc2

Update comment

2fab99b

rhpvorderman changed the title ~~gh-120397: improve the speed of str.count, bytes.count et al. for single characters.~~ gh-120397: improve the speed of str.count, bytes.count et al. for single characters by about 2x. Jun 13, 2024

rename function

ce9ab9b

rhpvorderman force-pushed the fastercount branch from 251f77b to ce9ab9b Compare June 13, 2024 05:19

nineteendo reviewed Jun 13, 2024

View reviewed changes

Objects/stringlib/fastsearch.h Show resolved Hide resolved

Update Objects/stringlib/fastsearch.h

0cc9369

Co-authored-by: Nice Zombies <nineteendo19d0@gmail.com>

vstinner approved these changes Jun 13, 2024

View reviewed changes

vstinner enabled auto-merge (squash) June 13, 2024 09:40

nineteendo reviewed Jun 13, 2024

View reviewed changes

Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved

rhpvorderman added 2 commits June 13, 2024 14:06

Fix changing oversights

30a65a7

Merge branch 'fastercount' of github.com:rhpvorderman/cpython into fa…

a16c7ec

…stercount

auto-merge was automatically disabled June 13, 2024 12:06
Head branch was pushed to by a user without write access

nineteendo reviewed Jun 13, 2024

View reviewed changes

vstinner merged commit 2078eb4 into python:main Jun 13, 2024

bedevere-app bot removed the awaiting merge label Jun 13, 2024

mrahtz pushed a commit to mrahtz/cpython that referenced this pull request Jun 30, 2024

pythongh-120397: Optimize str.count() for single characters (python#1…

b3ffb45

…20398)

noahbkim pushed a commit to hudson-trading/cpython that referenced this pull request Jul 11, 2024

pythongh-120397: Optimize str.count() for single characters (python#1…

75d4b56

…20398)

estyxx pushed a commit to estyxx/cpython that referenced this pull request Jul 17, 2024

pythongh-120397: Optimize str.count() for single characters (python#1…

54789e1

…20398)

rhpvorderman mentioned this pull request Sep 11, 2024

gh-120196: Faster ascii_decode and find_max_char implementations #120212

Closed

rhpvorderman deleted the fastercount branch June 2, 2025 10:51

	for(size_t i=0; i<32; i++) {
	for(size_t i = 0; i < 32; i++) {



		static inline Py_ssize_t
		STRINGLIB(count_char_no_maximum)(const STRINGLIB_CHAR *s, Py_ssize_t n,

Uh oh!

Conversation

rhpvorderman commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

picnixz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

picnixz Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhpvorderman commented Jun 12, 2024

Uh oh!

rhpvorderman commented Jun 12, 2024

Uh oh!

rhpvorderman commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vstinner commented Jun 12, 2024

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhpvorderman commented Jun 13, 2024

Uh oh!

vstinner commented Jun 13, 2024

Uh oh!

rhpvorderman commented Jun 12, 2024 •

edited

Loading

picnixz left a comment •

edited

Loading

picnixz Jun 12, 2024 •

edited

Loading

rhpvorderman commented Jun 12, 2024 •

edited

Loading