New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Linux perf profiler to see Python calls #96123
base: main
Are you sure you want to change the base?
Conversation
| @@ -4812,7 +4812,7 @@ _PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame, int | |||
| function = PEEK(total_args + 1); | |||
| int positional_args = total_args - KWNAMES_LEN(); | |||
| // Check if the call can be inlined or not | |||
| if (Py_TYPE(function) == &PyFunction_Type && tstate->interp->eval_frame == NULL) { | |||
| if (0 && Py_TYPE(function) == &PyFunction_Type && tstate->interp->eval_frame == NULL) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be controlled likely by configuration so is only active in profiling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use py_trampoline_evaluator, then there is no need for the additional test and this can be enabled all the time
I have no idea if that assembly code does what you say, but I've reviewed the rest.
Definitely an intriguing idea.
| @@ -88,6 +89,7 @@ typedef uint16_t _Py_CODEUNIT; | |||
| PyObject *co_filename; /* unicode (where it was loaded from) */ \ | |||
| PyObject *co_name; /* unicode (name, for reference) */ \ | |||
| PyObject *co_qualname; /* unicode (qualname, for reference) */ \ | |||
| void* co_trampoline; \ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code objects really need fewer fields, not more.
Would it be feasible to cram this into co_extra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is just for the prototype, I have dedicated no time whatsoever to make this maintainable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, an internal prototype doesn't modify the interpreter so it uses _PyCode_GetExtra and _PyCode_SetExtra from logic to trampoline or not within a custom eval frame func via _PyInterpreterState_SetEvalFrameFunc.
The long term prognosis of this C stack trampoline approach is interesting in that we know python to python calls shouldn't need to use the C stack so having an eval frame hook at all could wind up as a feature on the chopping block for performance reasons.
It's also a little unfortunate to see trampolines consume more C stack space as that alters the applications stack consumption profile which works from a practicality beats purity standpoint but is not ideal.
| @@ -71,6 +74,11 @@ _PyEval_EvalFrame(PyThreadState *tstate, struct _PyInterpreterFrame *frame, int | |||
| { | |||
| EVAL_CALL_STAT_INC(EVAL_CALL_TOTAL); | |||
| if (tstate->interp->eval_frame == NULL) { | |||
| PyCodeObject *co = frame->f_code; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a py_evaluator? it would save the extra check.
PyObject* py_trampoline_evaluator(PyThreadState *ts, _PyInterpreterFrame *frame, int throw) {
PyCodeObject *co = frame->f_code;
py_trampoline f = (py_trampoline)(co->co_trampoline);
assert (f != NULL);
return f(_PyEval_EvalFrameDefault, ts, frame, throw);
}
_PyInterpreterState_SetEvalFrameFunc(is, py_trampoline_evaluator);
| #include "clinic/codeobject.c.h" | ||
|
|
||
| #include <stdio.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of #includes that don't seem relevant to code objects.
Maybe put this in its own file.
Objects/codeobject.c
Outdated
|
|
||
| typedef PyObject* (*py_evaluator)(PyThreadState *, _PyInterpreterFrame *, int throwflag); | ||
|
|
||
| PyObject* the_trampoline(py_evaluator eval, PyThreadState* t, _PyInterpreterFrame* f, int p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally at Google we've had a a few people work on the trampoline function approach for sampled profile stack traces showing up in perf and similar collection mechanisms. The one that's been in production since ~2016ish was a build time compiled pile of code analysis based generated .c trampolines that were inserted for covered functions. (yes yuck, but it worked and saved YouTube a lot of $resources).
For runtime generated trampolines, the experiments I'm staring at internally have used hand pasted in x86_64 machine code, considering using inline asm, and a recent attempt that uses reinterpret_cast<uint8_t *>&template_function. Our compiler team LLVM/C++ish folks say the pure C/C++ copy a template function as data idea is not good. You cannot guarantee the compiler generates position independent code.
The middle ground of copying from an inline asm function is tempting if it works (only if the compiler is guaranteed to not be adding any additional boilerplate to the function before and after the inline asm for a function who's body is solely that inline asm) as it avoids the minor hurdle of separate .S files. Unclear that it'd buy much though.
OTOH this is very static small trampoline code. even writing a .S file per arch, assembling that, and pasting it into a uint8_t array per arch as data is "fine" from a maintenance perspective. a unittest could confirm that it matches what the .S assembles to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH this is very static small trampoline code. even writing a .S file per arch, assembling that, and pasting it into a uint8_t array per arch as data is "fine" from a maintenance perspective. a unittest could confirm that it matches what the .S assembles to.
I think this is by far the most maintainable version for us (at least until we have heavier machinery). The assembly is super tiny and adding more archs is very easy and it fits very well into our build system
Objects/codeobject.c
Outdated
| @@ -300,6 +409,13 @@ init_code(PyCodeObject *co, struct _PyCodeConstructor *con) | |||
| co->co_name = con->name; | |||
| Py_INCREF(con->qualname); | |||
| co->co_qualname = con->qualname; | |||
|
|
|||
| py_trampoline f = compile_blech(); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be done lazily in a setup trampoline?
co->co_trampoline = setup_trampoline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, is done the first time is called?
| @@ -4812,7 +4812,7 @@ _PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame, int | |||
| function = PEEK(total_args + 1); | |||
| int positional_args = total_args - KWNAMES_LEN(); | |||
| // Check if the call can be inlined or not | |||
| if (Py_TYPE(function) == &PyFunction_Type && tstate->interp->eval_frame == NULL) { | |||
| if (0 && Py_TYPE(function) == &PyFunction_Type && tstate->interp->eval_frame == NULL) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use py_trampoline_evaluator, then there is no need for the additional test and this can be enabled all the time
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
|
You can use preprocesor macros if you name the file |
Hummm, not sure I follow, could you maybe show me an example of what we can achieve with this? |
|
You can have multiple implementations in the same file: |
| PyCodeObject *co = frame->f_code; | ||
| py_trampoline f = (py_trampoline)(co->co_trampoline); | ||
| if (f) { | ||
| return f(_PyEval_EvalFrameDefault, tstate, frame, throwflag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than pass the address of _PyEval_EvalFrameDefault in, you can make a lighter weight trampoline: you're generating the code, inline the address of this as an immediate into the generated code. either an immediate constant load or a jump or call target - whatever the architecture and desired code requires.
that way your function signature can be identical to that of _PyEval_EvalFrameDefault.
With that in mind, can your trampoline skip its own call and just be an immediate jump directly to the function? shrinking the required trampoline code and saving stack space - _PyEval_EvalFrameDefault would then not show up in the stack at all, instead showing the name of your trampoline. (I don't know if that result is desirable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than pass the address of _PyEval_EvalFrameDefault in, you can make a lighter weight trampoline: you're generating the code, inline the address of this as an immediate into the generated code. either an immediate constant load or a jump or call target - whatever the architecture and desired code requires.
That would require relocation of the address, wouldn't it? For the first version, I prefer to not have to deal with that as there are other challenges we need to solve first. We can improve this once the feature is implemented.
With that in mind, can your trampoline skip its own call and just be an immediate jump directly to the function?
Maybe, is also an interesting suggestion but I prefer not having to deal with that in the first version :)
There are some things to think about regarding how unwinders will treat that because DWARF will point to the pointer range of _PyEval_EvalFrameDefault regardless if we jump or not and perf will only resolve PCs on our mmaped region to the Python code. In any case, that's likely for a future improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require relocation of the address, wouldn't it?
I don't think so. &_PyEval_EvalFrameDefault is an absolute address. I'm not aware of any important architecture being unable to jump or call to absolute addresses.
An implementation/iteration of this demonstrated internally filled in the address in the code template while populating a page of trampolines but is otherwise basically the same. (good validation of the idea if nothing else)
I'm pretty happy that we're all on the same page (pun intended) with how and why this concept works and is useful.
A jump rather than the call idea is me thinking out loud, it probably has consequences. Just something to ponder. Agreed: future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. &_PyEval_EvalFrameDefault is an absolute address. I'm not aware of any important architecture being unable to jump or call to absolute addresses.
It is, but it requires a relocation because it's address is not know until the linker starts to assemble the final binary. The compiler (or us) should place 0 in the call target and emit a relocation so the linker knows that needs to place the final address of _PyEval_EvalFrameDefault. This should be happening automatically but for some reason I'm getting segfaults :(
Note that we are using a -compiled .S file so we are not compiling anything at runtime, just copying the function data. At compile time we don't know the address of the function so we cannot just paste it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I gave this a go and found the problem with the segfault and as I predicted is quite complicated because it involves a relocation.
The problem is that when we compile the ASM, then the compiler generates one relocation for _PyEval_EvalFrameDefault of type R_X86_64_PLT32. The compiler leaves space for the address of the function in the generated assembly:
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: e8 00 00 00 00 call 9 <_Py_trampoline_func_start+0x9> 5: R_X86_64_PLT32 _PyEval_EvalFrameDefault-0x4
9: 5d pop rbp
a: c3 ret
That 00 00 00 00 will be filled by the linker with the appropiate address. But because call opcode takes an offset from the instruction pointer, it means that the offset that will be placed here is unique and is always relative to the location or where our template ends in memory. For example:
(gdb) x/11bx &_Py_trampoline_func_start
0x5555556d97e0 <_Py_trampoline_func_start>: 0x55 0x48 0x89 0xe5 0xe8 0x06 0x79 0x03
0x5555556d97e8 <_Py_trampoline_func_start+8>: 0x00 0x5d 0xc3
That 0x06 0x79 0x03 0x00 is the offset (0x00037906 into account). But that's an offset relative to the location of _Py_trampoline_func_start + offset to the call instruction:
info symbol (0x00037906+0x5555556d97e0+8*6)
_PyEval_EvalFrameDefault + 39 in section .text of /home/pablogsal/github/python/main/_bootstrap_python
As we are making multiple mmap places, the instruction pointer changes every time (because the address of the mmap segment chages and therefore the offset is wrong and should be updated to the start of our mmap region, which is nasty and I don't think I want to deal with this in the first version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the first version, I will stick to the current version that takes the address of the final call as a pointer that is already relocated and uses a register-indirect call. We can improve it in a future PR. That's why I insisted on the danger of scope creep and sticking to the simpler version first: is very easy to spend a lot of time in a rabbit hole here for small gains.
Add autoconf magic

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

If you have a lot of experience with this kind of shenanigans and want to improve the first version, please make a PR against my branch or reach out by email. I prefer not to have long discussions on the approach in the PR so we can keep it productive.
Result in perf:

