GH-115802: Optimize JIT stencils for size#136393
Conversation
| f"-I{CPYTHON / 'Python'}", | ||
| f"-I{CPYTHON / 'Tools' / 'jit'}", | ||
| "-O3", | ||
| # -O2 and -O3 include some optimizations that make sense for |
There was a problem hiding this comment.
Did you investigate -Oz as well? The clang docs are fairly vague, but they say it reduces code size even further, so I'm curious if it's worth investigating as well.
There was a problem hiding this comment.
Nice idea! I'm definitely down to try benchmarking it after this lands.
I suspect it may be quite a bit slower, though. My understanding is that -Os does all of the meaningful performance optimizations except those that increase size, while -Oz will actually hurt performance in pursuit of the smallest possible machine code. Our goal is to be fast, of course, but in this particular case -Os is also just giving us better code (as a side-effect of not aligning jumps or duplicating tails, etc). So smaller isn't necessarily always better.
There was a problem hiding this comment.
Yeah, I'm not sure this is going to be a win. It basically turns off inlining for functions called more than once. For instance, _POP_TWO turns from this on -Os:
// 0000000000000000 <_JIT_ENTRY>:
// 0: 50 pushq %rax
// 1: 49 8d 45 f8 leaq -0x8(%r13), %rax
// 5: 49 8b 5d f0 movq -0x10(%r13), %rbx
// 9: 49 8b 7d f8 movq -0x8(%r13), %rdi
// d: 49 89 44 24 40 movq %rax, 0x40(%r12)
// 12: 40 f6 c7 01 testb $0x1, %dil
// 16: 75 0a jne 0x22 <_JIT_ENTRY+0x22>
// 18: ff 0f decl (%rdi)
// 1a: 75 06 jne 0x22 <_JIT_ENTRY+0x22>
// 1c: ff 15 00 00 00 00 callq *(%rip) # 0x22 <_JIT_ENTRY+0x22>
// 000000000000001e: R_X86_64_GOTPCRELX _Py_Dealloc-0x4
// 22: 49 83 44 24 40 f8 addq $-0x8, 0x40(%r12)
// 28: f6 c3 01 testb $0x1, %bl
// 2b: 75 0d jne 0x3a <_JIT_ENTRY+0x3a>
// 2d: ff 0b decl (%rbx)
// 2f: 75 09 jne 0x3a <_JIT_ENTRY+0x3a>
// 31: 48 89 df movq %rbx, %rdi
// 34: ff 15 00 00 00 00 callq *(%rip) # 0x3a <_JIT_ENTRY+0x3a>
// 0000000000000036: R_X86_64_GOTPCRELX _Py_Dealloc-0x4
// 3a: 4d 8b 6c 24 40 movq 0x40(%r12), %r13
// 3f: 58 popq %rax
Into this on -Oz (outlining PyStackRef_CLOSE makes it 2 bytes shorter, but adds up to three additional jumps):
// 0000000000000000 <_JIT_ENTRY>:
// 0: 50 pushq %rax
// 1: 49 8d 45 f8 leaq -0x8(%r13), %rax
// 5: 49 8b 5d f0 movq -0x10(%r13), %rbx
// 9: 49 8b 7d f8 movq -0x8(%r13), %rdi
// d: 49 89 44 24 40 movq %rax, 0x40(%r12)
// 12: e8 16 00 00 00 callq 0x2d <PyStackRef_CLOSE>
// 17: 49 83 44 24 40 f8 addq $-0x8, 0x40(%r12)
// 1d: 48 89 df movq %rbx, %rdi
// 20: e8 08 00 00 00 callq 0x2d <PyStackRef_CLOSE>
// 25: 4d 8b 6c 24 40 movq 0x40(%r12), %r13
// 2a: 58 popq %rax
// 2b: eb 11 jmp 0x3e <_JIT_CONTINUE>
//
// 000000000000002d <PyStackRef_CLOSE>:
// 2d: 40 f6 c7 01 testb $0x1, %dil
// 31: 75 04 jne 0x37 <PyStackRef_CLOSE+0xa>
// 33: ff 0f decl (%rdi)
// 35: 74 01 je 0x38 <PyStackRef_CLOSE+0xb>
// 37: c3 retq
// 38: ff 25 00 00 00 00 jmpq *(%rip) # 0x3e <_JIT_CONTINUE>
// 000000000000003a: R_X86_64_GOTPCRELX _Py_Dealloc-0x4
I'll still try benchmarking it though. But I'll land this PR in the meantime since it's just a one-character change.
There was a problem hiding this comment.
Yep, -Oz is about 1-2% slower across the board.
As the new comment says, upon manual review of
-O3,-O2, and-Os, it seems that-Osgenerates the best code for the JIT's use-case. Perf impact is close to noise, but slightly positive on x86-64 Linux and AArch64 macOS, neutral on AArch64 Linux, and slightly negative on x86-64 Windows. According to the stats, the size of JIT code is down by about 1-2%: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20250628-3.15.0a0-33054dd-JIT/README.mdHere's an example of how skipping tail-duplication removes an extra jump and a duplicate instruction from
_POP_TOP(also reducing its size by 19%):Full diff for the stencils here:
https://gist.github.com/brandtbucher/7340be56f2d2cf7061b5c9bf1c87939c