X Tutup
The Wayback Machine - https://web.archive.org/web/20250108012905/https://github.com/python/cpython/issues/128563
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new tail-calling interpreter for significantly better interpreter performance #128563

Open
Fidget-Spinner opened this issue Jan 6, 2025 · 16 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement

Comments

@Fidget-Spinner
Copy link
Member

Fidget-Spinner commented Jan 6, 2025

Feature or enhancement

Proposal

Experimental branch: main...Fidget-Spinner:cpython:tail-call

Prior discussion at: faster-cpython/ideas#642

I propose adding a tail-calling interpreter to CPython for significantly better performance on compilers that support it.

This idea is not new, and has been implemented by:

  1. Protobuf https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html
  2. Lua (Deegen) https://sillycross.github.io/2022/11/22/2022-11-22/

CPython currently has a few interpreters:

  1. A switch-case interpreter (MSVC)
  2. A computed goto interpreter (Clang, GCC)
  3. An uop interpreter. (Everything)

The tail-calling interpreter will be the 4th that coexists with the rest. This means no compatibility concerns.

Performance

Preliminary benchmarks by me suggest excellent performance improvements --- 10% geometric mean speedup in pyperformance, with up to 40% speedup in Python-heavy benchmarks: https://gist.github.com/Fidget-Spinner/497c664eef389622d146d632990b0d21. These benchmarks were performed with clang-19 on both main and my branch, with ThinLTO and PGO, on AMD64 Ubuntu 22.04. PGO seems especially crucial for the speedups based on my testing. For those outside of CPython development: a 10% speedup is roughly equal to 2 minor CPython releases worth of improvements. For example, CPython 3.12 roughly sped up by 5%.

The speedup is so significant that if accepted, the new interpreter will be faster than the current JIT compiler.

Drawbacks

  1. Maintainability (this will introduce more code)
  2. Portability

I will address maintainability by using the interpreter generator that was introduced as part of CPython 3.12. This generator will allow us to automatically generate most of the infrastructure needed for this change. Preliminary estimates suggest the generator will be only 200 lines of Python code, most of which is shared/copied/same conceptually as the other generators.

For portability, this will fix itself (see the next section).

Portability and Precedent

At the moment, this is only supported by clang-19 for AArch64 and AMD64, with partial support on clang-18 and gcc-next, but likely bad performance on those. The reason is that we need both the __attribute__((musttail)) and __attribute__((preserve_none)) attributes for good performance. GCC only has gnu::musttail but not preserve_none.

There has been prior precedence on adding compiler-specific optimizations for CPython. See for example the original computed goto issue by Antoine Pitrou https://bugs.python.org/issue4753. At the time, it was a new feature only on GCC and not on Clang, but we still added it anyways. Eventually a few years later, Clang also introduced the feature. The key point gcc will likely eventually catch up and add these features.

EDIT: Added that it's only a likely case to have bad perf on GCC. Reading https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328, I have not tested on GCC trunk. This is pure speculation that perf is bad. I can try with GCC eventually after the PR lands and we can test it from there. However, testing with clang with just musttail and no preserve_none, the performance was quite bad.

Implementation plan

  1. Refactor _PyEval_EvalFrameDefault to use function calls corresponding to their common existing labels. This will need careful benchmarking. E.g.
error:
   ....

becomes

error:
   call_error_handler(...)
  1. Implement the rest of it. Add configure option to auto detect this feature in the configure script, and turn it on only for compilers that support it.
  2. Mention in Whats New.

Worries about new bugs

Computed goto is well-tested, so worrying about the new interpreter being buggy is fair.

I doubt logic bugs will be the primary one, as we are using the interpreter generator. This means we share common code between the base interpreter and the new one. If the new one has logic bugs, it is likely the base interpreter has it too.

The other one is compiler bugs. However to allay such fears, I point out that the GHC calling convention (the thing behind preserve_none has been around for 5 years1, and musttail has been around for almost 4 years2.

cc @pitrou as the original implementer of computed gotos, and @markshannon

Future Use

Kumar Aditya pointed out this could be used in regex and pickle as well. Likewise, Neil Schemenauer pointed out marshal and pickle might benefit from this for faster Python startup.

Has this already been discussed elsewhere?

https://discuss.python.org/t/a-new-tail-calling-interpreter-for-significantly-better-interpreter-performance/76315

Links to previous discussion of this feature:

No response

@Fidget-Spinner Fidget-Spinner added type-feature A feature request or enhancement performance Performance or resource usage labels Jan 6, 2025
@Fidget-Spinner Fidget-Spinner self-assigned this Jan 6, 2025
@picnixz picnixz added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jan 6, 2025
@pitrou
Copy link
Member

pitrou commented Jan 7, 2025

Can you show what a typical tail-calling sequence looks like? Does it combine tail-calling with a computed goto as in this example from the protobuf blog post?

    MUSTTAIL return op_table[op](ARGS);

@Fidget-Spinner
Copy link
Member Author

Mark gives a pretty good example here faster-cpython/ideas#642 (comment)

@pitrou
Copy link
Member

pitrou commented Jan 7, 2025

Neat, thank you!

@diegorusso
Copy link
Contributor

How much performance is attributed to __attribute__((preserve_none)) alone?

@Fidget-Spinner
Copy link
Member Author

@diegorusso I did some experiments on WASI and esmcriptem (which do not support preserve_none). No speedup at all on those platforms which do not support (in fact, up to a 2x slowdown). So this is a negative without it.

@WolframAlph
Copy link
Contributor

FYI from Clang docs

preserve_none’s ABI is still unstable, and may be changed in the future.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Jan 7, 2025

@WolframAlph I think that doesn't matter because we're only using this on _PyEval_EvalFrameDefault's opcode handlers which are not external-facing. This is an internal implementation detail.

@diegorusso
Copy link
Contributor

Anyway I pinged the GCC team at Arm and a ticket has been created to implement preserve_none in GCC https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328

@diegorusso
Copy link
Contributor

@diegorusso I did some experiments on WASI and esmcriptem (which do not support preserve_none). No speedup at all on those platforms which do not support (in fact, up to a 2x slowdown). So this is a negative without it.

This is what I was expecting. The tail call but itself is not enough (actually I was expecting similar performance to computed goto) and you need preserve_none (like for the JIT) to make the difference.

Have you tested it on AArch64 as well?

@Fidget-Spinner
Copy link
Member Author

Have you tested it on AArch64 as well?

Not yet. I want to test it on the Faster CPython build bot for macOS (that has clang-19, so it's a fair comparison), but I do not have access to it. If you could run some benchmarks for this I would be really grateful! If you want a quick-and-dirty check that it's working, try just running the pystones benchmark. I got a 25% speedup with tail calls and LTO and PGO on (make sure you enable those, because it contributes like half the perf win for some reason). https://gist.github.com/Fidget-Spinner/e7bf204bf605680b0fc1540fe3777acf

And pass CC=clang-19 and all that.

@WolframAlph
Copy link
Contributor

@Fidget-Spinner If I understand correctly, the whole trick is:

  1. musttail guarantees to jump to the next subroutine (if their signatures are the same), thus getting rid of call overhead (prologue, epilogue, stack space reservation, etc..).
  2. preserve_none puts burden of preserving registers on the caller. But every subsequent tail call is performed to the function with preserve_none attribute, therefore none of the opcode handlers spill registers (because we instructed so and because it won't need them anymore because of tail call) and just forward function arguments in registers through whole chain of calls.

Do I get it right?

@Fidget-Spinner
Copy link
Member Author

@WolframAlph

Do I get it right?

Yeah. Also I suspect most of the speedup is there because the current interpreter loop is too big to optimize properly, so all the pre-existing compilers perform not-so-well for it.

For example, PGO gives this roughly another 10% speedup over just -O3. LTO roughly another 10% over PGO and -O3. Normally if we compare equally, PGO and LTO should optimize both the old interpreter and new one similarly, but I guess the new interpreter is easier to optimize so it produces better quality code.

@WolframAlph
Copy link
Contributor

Also I suspect most of the speedup is there because the current interpreter loop is too big to optimize properly, so all the pre-existing compilers perform not-so-well for it.

Makes sense. By splitting cases into functions, compiler can optimize each one of them better individually rather than one giant chunk I assume. Same was mentioned in the protobuf article you linked.

@corona10
Copy link
Member

corona10 commented Jan 7, 2025

Here is the result for cross-checking bm_pystones with @Fidget-Spinner on macOS aarch64
The only difference between tail-calling and baseline is the dispatch technique.
I pinned the compiler version and compiler optimization policies.

Baseline: 2228e92

baseline

Configure

CC=clang-19 ./configure --enable-optimizations --with-lto=thin

Result

➜  cpython git:(2228e92da31) ✗ ./python.exe bm_pystones.py
Pystone(1.1) time for 50000 passes = 0.0763571
This machine benchmarks at 654818 pystones/second

Tail-calling https://github.com/Fidget-Spinner/cpython/tree/tail-call

tail-call

Configure

CC=clang-19 ./configure --enable-optimizations --with-lto=thin --enable-tail-call-interp

Result

➜  cpython git:(tail-call) ✗ ./python.exe bm_pystones.py
Pystone(1.1) time for 50000 passes = 0.0548847
This machine benchmarks at 911001 pystones/second

cc @diegorusso

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Jan 7, 2025

According to Donghee's comment above, Pystones is nearly 50% faster on macOS AArch64 vs 25% faster on my Ubuntu AMD64 machine. I suspect it might be because AArch64 has more registers. However, who knows at this point :)?

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Jan 7, 2025

Reading https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118328, I have not tested on GCC trunk. This is pure speculation on my part that perf is bad on there. I can try with GCC eventually after the PR lands and we can test it from there. However, testing with clang with just musttail and no preserve_none, the performance was quite bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants
X Tutup