Rewrite TestChildProcessCleanup with socket-based deterministic liveness probe#2265
Draft
Rewrite TestChildProcessCleanup with socket-based deterministic liveness probe#2265
Conversation
…c in finally The three TestChildProcessCleanup tests were flaky on Windows/macOS CI because they used fixed anyio.sleep() durations (0.5s + 0.3s = 800ms) to wait for nested Python interpreter chains to start writing to marker files. On loaded CI runners, two Python startups + subprocess.Popen + open() + first write() can exceed 800ms, causing assert 0 > 0. Secondary bug: when the assertion failed, proc was never terminated (finally only cleaned up tempfiles). The leaked subprocess was GC'd during a later test, triggering PytestUnraisableExceptionWarning and causing knock-on failures in unrelated tests. Changes: - Added _wait_for_file_size() helper: polls until getsize(path) exceeds a threshold, bounded by anyio.fail_after(10). Replaces all startup sleep+assert chains. Raises TimeoutError with a clear message instead of a confusing assert 0 > 0. - Added proc cleanup to each finally block. Uses proc=None tracking after successful in-test termination to avoid redundant double-calls, and anyio.move_on_after() so cleanup timeout never masks the real failure. - Removed the parent_marker check from test_basic_child_process_cleanup. NamedTemporaryFile(delete=False) already creates the file, so the os.path.exists() assertion was a no-op that never verified anything. Child writing proves parent started. - Simplified shutdown verification: terminate_posix_process_tree already polls for process-group death, so the first post-termination sleep(0.5) was redundant. Reduced to a single 0.3s stability check (3x child write interval). Result: 3 tests drop from ~7s to ~1.9s locally, 30/30 pass under 4-worker parallel load (flake-finder). Closes #1775 Github-Issue: #1775
…ess probe
Replaces the file-growth-polling approach with a fundamentally different
design: each subprocess in the tree connects a TCP socket back to a listener
owned by the test. Liveness and death are then detected via blocking I/O
operations that unblock on kernel-level events — zero sleeps, zero polling:
1. await listener.accept() blocks until the subprocess connects
(proves it started).
2. After _terminate_process_tree(), await stream.receive(1) raises
EndOfStream because the kernel closes all file descriptors of a
terminated process. This is direct, OS-guaranteed proof that the
child is dead.
Compared to the original file-watching approach:
- No tempfiles, no file I/O, no path escaping
- No fixed sleep() durations (the root cause of flakiness)
- No polling loops
- Failure modes are clear: accept times out = child never started;
receive times out = child survived termination (actual bug)
Also fixes the process leak on assertion failure by adding proc cleanup
to each finally block (guarded by proc=None after successful in-test
termination to avoid noisy double-terminate).
Result: 3 tests drop from ~7.1s to ~0.5s, 30/30 pass under 4-worker
parallel load. Removes 80 net lines and the tempfile/path-escape deps.
Closes #1775
Github-Issue: #1775
On Windows, TerminateJobObject causes an abrupt TCP RST (not clean FIN), so anyio raises BrokenResourceError instead of EndOfStream when reading from a socket held by the terminated child. Both exceptions indicate the peer is dead — catch both. Github-Issue: #1775
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #1775. Supersedes #2071.
The three
TestChildProcessCleanuptests intests/client/test_stdio.pyfailed intermittently on Windows/macOS CI. The root cause was a fundamentally timing-dependent test design: watching a file grow on disk as a proxy for "is the child process alive?", with fixedanyio.sleep()durations to "wait long enough" for nested Python interpreters to start. On loaded CI, "long enough" wasn't.A secondary bug compounded this: when the flaky assertion failed,
procwas never terminated — thefinallyblocks only cleaned up tempfiles. The leaked subprocess was GC'd during a later test, triggeringPytestUnraisableExceptionWarningthere (observed:test_call_tool— an in-memory test that never touches subprocesses — failing withResourceWarning: subprocess 5592 is still running).New design — socket-based, zero sleeps, zero polling
Each subprocess in the tree connects a TCP socket back to a listener owned by the test. Two kernel-guaranteed blocking-I/O signals then tell us everything:
await listener.accept()blocks until the subprocess connects → proves it started. No polling for "did the file grow yet?".await stream.receive(1)after_terminate_process_tree()raisesanyio.EndOfStreambecause the kernel closes all file descriptors — including sockets — when a process terminates. This is a direct, OS-level proof that the child is dead. No polling for "did the file stop growing?".Every synchronization point is a blocking read that unblocks on a kernel event. The
anyio.fail_after()wrappers are upper bounds for catastrophic failure (subprocess never started / survived termination), not timing assumptions.Secondary fix: process cleanup in finally
Each test now terminates
procinfinallyif the test body's own termination didn't run (guarded byproc = Noneafter successful in-test termination to avoid noisy double-terminate).anyio.move_on_after(5)ensures cleanup can't mask the original failure.Also removed
tempfileimport and all file I/Oescape_path_for_pythonimport (no longer need path escaping for embedded scripts — scripts are now injected via{code!r}repr() literals, which eliminates all nested-quote escaping)textwrap.dedentblocks with triply-nested f-string scriptsparent_markercheck from test 1, which was a no-op:NamedTemporaryFile(delete=False)creates the file, soos.path.exists()was alwaysTrueResults
./scripts/testanyio.sleep()calls in TestChildProcessCleanupAI Disclaimer