Allow Linux perf profiler to see Python calls #96123

pablogsal · 2022-08-19T13:22:56Z

⚠️ ⚠️ Note for reviewers, hackers and fellow systems/low-level/compiler engineers ⚠️ ⚠️

If you have a lot of experience with this kind of shenanigans and want to improve the first version, please make a PR against my branch or reach out by email. I prefer not to have long discussions on the approach in the PR so we can keep it productive.

Result in perf:

Objects/codeobject.c

pablogsal · 2022-08-19T13:29:38Z

Python/ceval.c

@@ -4812,7 +4812,7 @@ _PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame, int
            function = PEEK(total_args + 1);
            int positional_args = total_args - KWNAMES_LEN();
            // Check if the call can be inlined or not
-            if (Py_TYPE(function) == &PyFunction_Type && tstate->interp->eval_frame == NULL) {
+            if (0 && Py_TYPE(function) == &PyFunction_Type && tstate->interp->eval_frame == NULL) {


This will need to be controlled likely by configuration so is only active in profiling.

If you use py_trampoline_evaluator, then there is no need for the additional test and this can be enabled all the time

markshannon

I have no idea if that assembly code does what you say, but I've reviewed the rest.

Definitely an intriguing idea.

markshannon · 2022-08-19T14:01:14Z

Include/cpython/code.h

@@ -88,6 +89,7 @@ typedef uint16_t _Py_CODEUNIT;
    PyObject *co_filename;        /* unicode (where it was loaded from) */     \
    PyObject *co_name;            /* unicode (name, for reference) */          \
    PyObject *co_qualname;        /* unicode (qualname, for reference) */      \
+    void* co_trampoline;                                               \


Code objects really need fewer fields, not more.
Would it be feasible to cram this into co_extra?

Yeah, this is just for the prototype, I have dedicated no time whatsoever to make this maintainable

+1, an internal prototype doesn't modify the interpreter so it uses _PyCode_GetExtra and _PyCode_SetExtra from logic to trampoline or not within a custom eval frame func via _PyInterpreterState_SetEvalFrameFunc.

The long term prognosis of this C stack trampoline approach is interesting in that we know python to python calls shouldn't need to use the C stack so having an eval frame hook at all could wind up as a feature on the chopping block for performance reasons.

It's also a little unfortunate to see trampolines consume more C stack space as that alters the applications stack consumption profile which works from a practicality beats purity standpoint but is not ideal.

markshannon · 2022-08-19T14:01:14Z

Include/internal/pycore_ceval.h

@@ -71,6 +74,11 @@ _PyEval_EvalFrame(PyThreadState *tstate, struct _PyInterpreterFrame *frame, int
 {
    EVAL_CALL_STAT_INC(EVAL_CALL_TOTAL);
    if (tstate->interp->eval_frame == NULL) {
+        PyCodeObject *co = frame->f_code;


Could this be a py_evaluator? it would save the extra check.

PyObject* py_trampoline_evaluator(PyThreadState *ts, _PyInterpreterFrame *frame, int throw) { PyCodeObject *co = frame->f_code; py_trampoline f = (py_trampoline)(co->co_trampoline); assert (f != NULL); return f(_PyEval_EvalFrameDefault, ts, frame, throw); } _PyInterpreterState_SetEvalFrameFunc(is, py_trampoline_evaluator);

markshannon · 2022-08-19T14:01:14Z

Objects/codeobject.c

 #include "clinic/codeobject.c.h"

+#include <stdio.h>


This is a lot of #includes that don't seem relevant to code objects.
Maybe put this in its own file.

Objects/codeobject.c

markshannon · 2022-08-19T14:01:14Z

Objects/codeobject.c

+
+typedef PyObject* (*py_evaluator)(PyThreadState *, _PyInterpreterFrame *, int throwflag);
+
+PyObject* the_trampoline(py_evaluator eval, PyThreadState* t, _PyInterpreterFrame* f, int p) {


This is unused

Internally at Google we've had a a few people work on the trampoline function approach for sampled profile stack traces showing up in perf and similar collection mechanisms. The one that's been in production since ~2016ish was a build time compiled pile of code analysis based generated .c trampolines that were inserted for covered functions. (yes yuck, but it worked and saved YouTube a lot of $resources).

For runtime generated trampolines, the experiments I'm staring at internally have used hand pasted in x86_64 machine code, considering using inline asm, and a recent attempt that uses reinterpret_cast<uint8_t *>&template_function. Our compiler team LLVM/C++ish folks say the pure C/C++ copy a template function as data idea is not good. You cannot guarantee the compiler generates position independent code.

The middle ground of copying from an inline asm function is tempting if it works (only if the compiler is guaranteed to not be adding any additional boilerplate to the function before and after the inline asm for a function who's body is solely that inline asm) as it avoids the minor hurdle of separate .S files. Unclear that it'd buy much though.

OTOH this is very static small trampoline code. even writing a .S file per arch, assembling that, and pasting it into a uint8_t array per arch as data is "fine" from a maintenance perspective. a unittest could confirm that it matches what the .S assembles to.

OTOH this is very static small trampoline code. even writing a .S file per arch, assembling that, and pasting it into a uint8_t array per arch as data is "fine" from a maintenance perspective. a unittest could confirm that it matches what the .S assembles to.

I think this is by far the most maintainable version for us (at least until we have heavier machinery). The assembly is super tiny and adding more archs is very easy and it fits very well into our build system

markshannon · 2022-08-19T14:01:14Z

Objects/codeobject.c

@@ -300,6 +409,13 @@ init_code(PyCodeObject *co, struct _PyCodeConstructor *con)
    co->co_name = con->name;
    Py_INCREF(con->qualname);
    co->co_qualname = con->qualname;
+
+    py_trampoline f = compile_blech();


Can this be done lazily in a setup trampoline?

co->co_trampoline = setup_trampoline

As in, is done the first time is called?

markshannon · 2022-08-19T14:01:14Z

Python/ceval.c

@@ -4812,7 +4812,7 @@ _PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame, int
            function = PEEK(total_args + 1);
            int positional_args = total_args - KWNAMES_LEN();
            // Check if the call can be inlined or not
-            if (Py_TYPE(function) == &PyFunction_Type && tstate->interp->eval_frame == NULL) {
+            if (0 && Py_TYPE(function) == &PyFunction_Type && tstate->interp->eval_frame == NULL) {


If you use py_trampoline_evaluator, then there is no need for the additional test and this can be enabled all the time

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

tiran · 2022-08-19T19:33:01Z

You can use preprocesor macros if you name the file Objects/asm_trampoline.S or .sx instead of Objects/asm_trampoline.s. The .sx form needs a makefile rule.

pablogsal · 2022-08-19T21:14:25Z

You can use preprocesor macros if you name the file Objects/asm_trampoline.S or .sx instead of Objects/asm_trampoline.s. The .sx form needs a makefile rule.

Hummm, not sure I follow, could you maybe show me an example of what we can achieve with this?

tiran · 2022-08-19T21:31:37Z

You can have multiple implementations in the same file:

    .text
    .globl	_Py_trampoline_func_start
_Py_trampoline_func_start:
#ifdef __x86_64__
    push   %rbp
    mov    %rsp,%rbp
    mov    %rdi,%rax
    mov    %rsi,%rdi
    mov    %rdx,%rsi
    mov    %ecx,%edx
    call   *%rax
    pop    %rbp
    ret
#endif // __x86_64__
#ifdef __aarch64__
    TODO
#endif
    .globl	_Py_trampoline_func_end
_Py_trampoline_func_end:

gpshead · 2022-08-19T23:34:48Z

Include/internal/pycore_ceval.h

+        PyCodeObject *co = frame->f_code;
+        py_trampoline f = (py_trampoline)(co->co_trampoline);
+        if (f) {
+            return f(_PyEval_EvalFrameDefault, tstate, frame, throwflag);


rather than pass the address of _PyEval_EvalFrameDefault in, you can make a lighter weight trampoline: you're generating the code, inline the address of this as an immediate into the generated code. either an immediate constant load or a jump or call target - whatever the architecture and desired code requires.

that way your function signature can be identical to that of _PyEval_EvalFrameDefault.
With that in mind, can your trampoline skip its own call and just be an immediate jump directly to the function? shrinking the required trampoline code and saving stack space - _PyEval_EvalFrameDefault would then not show up in the stack at all, instead showing the name of your trampoline. (I don't know if that result is desirable)

rather than pass the address of _PyEval_EvalFrameDefault in, you can make a lighter weight trampoline: you're generating the code, inline the address of this as an immediate into the generated code. either an immediate constant load or a jump or call target - whatever the architecture and desired code requires.

That would require relocation of the address, wouldn't it? For the first version, I prefer to not have to deal with that as there are other challenges we need to solve first. We can improve this once the feature is implemented.

With that in mind, can your trampoline skip its own call and just be an immediate jump directly to the function?

Maybe, is also an interesting suggestion but I prefer not having to deal with that in the first version :)

There are some things to think about regarding how unwinders will treat that because DWARF will point to the pointer range of _PyEval_EvalFrameDefault regardless if we jump or not and perf will only resolve PCs on our mmaped region to the Python code. In any case, that's likely for a future improvement.

That would require relocation of the address, wouldn't it?

I don't think so. &_PyEval_EvalFrameDefault is an absolute address. I'm not aware of any important architecture being unable to jump or call to absolute addresses.

An implementation/iteration of this demonstrated internally filled in the address in the code template while populating a page of trampolines but is otherwise basically the same. (good validation of the idea if nothing else)

I'm pretty happy that we're all on the same page (pun intended) with how and why this concept works and is useful.

A jump rather than the call idea is me thinking out loud, it probably has consequences. Just something to ponder. Agreed: future.

I don't think so. &_PyEval_EvalFrameDefault is an absolute address. I'm not aware of any important architecture being unable to jump or call to absolute addresses.

It is, but it requires a relocation because it's address is not know until the linker starts to assemble the final binary. The compiler (or us) should place 0 in the call target and emit a relocation so the linker knows that needs to place the final address of _PyEval_EvalFrameDefault. This should be happening automatically but for some reason I'm getting segfaults :(

Note that we are using a -compiled .S file so we are not compiling anything at runtime, just copying the function data. At compile time we don't know the address of the function so we cannot just paste it there.

Ok, I gave this a go and found the problem with the segfault and as I predicted is quite complicated because it involves a relocation.

The problem is that when we compile the ASM, then the compiler generates one relocation for _PyEval_EvalFrameDefault of type R_X86_64_PLT32. The compiler leaves space for the address of the function in the generated assembly:

0: 55 push rbp 1: 48 89 e5 mov rbp,rsp 4: e8 00 00 00 00 call 9 <_Py_trampoline_func_start+0x9> 5: R_X86_64_PLT32 _PyEval_EvalFrameDefault-0x4 9: 5d pop rbp a: c3 ret

That 00 00 00 00 will be filled by the linker with the appropiate address. But because call opcode takes an offset from the instruction pointer, it means that the offset that will be placed here is unique and is always relative to the location or where our template ends in memory. For example:

(gdb) x/11bx &_Py_trampoline_func_start 0x5555556d97e0 <_Py_trampoline_func_start>: 0x55 0x48 0x89 0xe5 0xe8 0x06 0x79 0x03 0x5555556d97e8 <_Py_trampoline_func_start+8>: 0x00 0x5d 0xc3

That 0x06 0x79 0x03 0x00 is the offset (0x00037906 into account). But that's an offset relative to the location of _Py_trampoline_func_start + offset to the call instruction:

info symbol (0x00037906+0x5555556d97e0+8*6) _PyEval_EvalFrameDefault + 39 in section .text of /home/pablogsal/github/python/main/_bootstrap_python

As we are making multiple mmap places, the instruction pointer changes every time (because the address of the mmap segment chages and therefore the offset is wrong and should be updated to the start of our mmap region, which is nasty and I don't think I want to deal with this in the first version

So for the first version, I will stick to the current version that takes the address of the final call as a pointer that is already relocated and uses a register-indirect call. We can improve it in a future PR. That's why I insisted on the danger of scope creep and sticking to the simpler version first: is very easy to spend a lot of time in a rabbit hole here for small gains.

Add autoconf magic

WIP

8748cbe

pablogsal added the DO-NOT-MERGE label Aug 19, 2022

bedevere-bot added the awaiting core review label Aug 19, 2022

pablogsal reviewed Aug 19, 2022

View changes

Objects/codeobject.c Outdated Show resolved Hide resolved

pablogsal reviewed Aug 19, 2022

View changes

markshannon reviewed Aug 19, 2022

View changes

pablogsal force-pushed the perf branch from 7457aa8 to 9b009f2 Compare Aug 19, 2022

Use .s file

439ef28

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal force-pushed the perf branch from 9b009f2 to 439ef28 Compare Aug 19, 2022

gpshead reviewed Aug 19, 2022

View changes

gpshead added type-feature interpreter-core labels Aug 20, 2022

tiran and others added 2 commits Aug 20, 2022

Add autoconf magic

5a70f52

Merge pull request #33 from tiran/perf-ac

78987af

Add autoconf magic

Jul	AUG	Sep
	20
2021	2022	2023

Allow Linux perf profiler to see Python calls #96123

Allow Linux perf profiler to see Python calls #96123

pablogsal commented Aug 19, 2022 •

edited

pablogsal Aug 19, 2022 •

edited

markshannon Aug 19, 2022

markshannon left a comment

markshannon Aug 19, 2022

pablogsal Aug 19, 2022

gpshead Aug 19, 2022

markshannon Aug 19, 2022

markshannon Aug 19, 2022 •

edited

markshannon Aug 19, 2022

gpshead Aug 19, 2022

pablogsal Aug 20, 2022

markshannon Aug 19, 2022

pablogsal Aug 20, 2022

markshannon Aug 19, 2022

tiran commented Aug 19, 2022

pablogsal commented Aug 19, 2022

tiran commented Aug 19, 2022

gpshead Aug 19, 2022

pablogsal Aug 20, 2022 •

edited

gpshead Aug 20, 2022

pablogsal Aug 20, 2022 •

edited

pablogsal Aug 20, 2022 •

edited

pablogsal Aug 20, 2022 •

edited


		typedef PyObject* (py_evaluator)(PyThreadState , _PyInterpreterFrame *, int throwflag);

		PyObject* the_trampoline(py_evaluator eval, PyThreadState* t, _PyInterpreterFrame* f, int p) {

Allow Linux perf profiler to see Python calls #96123

Are you sure you want to change the base?

Allow Linux perf profiler to see Python calls #96123

Conversation

pablogsal commented Aug 19, 2022 • edited

pablogsal Aug 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markshannon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markshannon Aug 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiran commented Aug 19, 2022

pablogsal commented Aug 19, 2022

tiran commented Aug 19, 2022

Choose a reason for hiding this comment

pablogsal Aug 20, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablogsal Aug 20, 2022 • edited

Choose a reason for hiding this comment

pablogsal Aug 20, 2022 • edited

Choose a reason for hiding this comment

pablogsal Aug 20, 2022 • edited

Choose a reason for hiding this comment

pablogsal commented Aug 19, 2022 •

edited

pablogsal Aug 19, 2022 •

edited

markshannon Aug 19, 2022 •

edited

pablogsal Aug 20, 2022 •

edited

pablogsal Aug 20, 2022 •

edited

pablogsal Aug 20, 2022 •

edited

pablogsal Aug 20, 2022 •

edited