MCP Gateway: avoid blocking list calls on startup#298040
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request optimizes the MCP Gateway performance by avoiding blocking list calls during server startup, addressing the 1min+ TTFT issue reported in #297780. The core change is to return cached data immediately when available and trigger server refreshes in the background, rather than waiting for all servers to be ready before responding.
Changes:
- Removed blocking
await Promise.all()call that waited for all servers to be ready before listing tools - Added conditional cache state checks that trigger background server refreshes for Unknown/Outdated states
- Modified three list methods to return immediately with cached/live data while refreshing stale servers asynchronously
| if (cacheState === McpServerCacheState.Unknown || cacheState === McpServerCacheState.Outdated) { | ||
| // Avoid blocking the tool list on slow server startup; refresh in the background. | ||
| void this._ensureServerReady(server); | ||
| } | ||
| if (cacheState !== McpServerCacheState.Live && cacheState !== McpServerCacheState.Cached && cacheState !== McpServerCacheState.RefreshingFromCached) { | ||
| continue; |
There was a problem hiding this comment.
The logic for checking cache state and conditionally refreshing servers in the background is duplicated across three methods (_listTools, _listResources, _listResourceTemplates). Consider extracting this logic into a private helper method to improve maintainability and ensure consistency when the logic needs to change in the future.
There was a problem hiding this comment.
Addressed — the duplicated logic has been extracted into _waitForStartup and _shouldUseCachedData, which are shared by _listTools, _listResources, and _listResourceTemplates.
| if (cacheState === McpServerCacheState.Unknown || cacheState === McpServerCacheState.Outdated) { | ||
| // Avoid blocking the tool list on slow server startup; refresh in the background. | ||
| void this._ensureServerReady(server); | ||
| } | ||
| if (cacheState !== McpServerCacheState.Live && cacheState !== McpServerCacheState.Cached && cacheState !== McpServerCacheState.RefreshingFromCached) { | ||
| continue; |
There was a problem hiding this comment.
With the changes in this PR, the test "starts server when cache state is unknown" on line 123 may no longer be accurate. The server startup is now triggered asynchronously in the background (void this._ensureServerReady) and won't block the listTools call. The test should be updated to verify that: 1) the background refresh is initiated, and 2) the call returns immediately without waiting for the server to be ready. Consider adding an async wait or verification that the server starts eventually in the background.
There was a problem hiding this comment.
Updated. The test now asserts that listTools returns the tools (not an empty array) after the grace period, and verifies startCalls === 1. A second test (returns empty tools and does not re-wait if server does not start within grace period) was added to cover the timeout path using a server whose start() never resolves and a short grace period.
|
This is nice but I think it's not practical atm for scenarios since Claude SDK and Copilot CLI's both don't support 'dynamic' tools that change after the connection is established. Wonder what you think @TylerLeonhardt |
|
@connor4312 it actually may work in the claude case because we react to when there are new mcp tools added (we essentially restart the session). |
|
The downside is just that the first message sent may lack appropriate tools. Perhaps we have some reasonable timeout here instead. Like 5 seconds -- but it should per server, so each new message should not wait 5s if a server takes forever to start, we should only give it that grace period when we first ask to start it. |
Instead of immediately returning empty results for Unknown-state servers and refreshing in the background, wait up to 5 seconds for the server to become live on the first list call. Subsequent calls find the already- resolved promise in _startupGrace and return immediately, so only the first message pays the startup cost. - _waitForStartup: races _ensureServerReady against the configurable grace period (default 5 000 ms); result is cached per server - _shouldUseCachedData: awaits the grace period for Unknown servers, falls back to background-refresh for Outdated servers - _listTools: refactored to Promise.all to parallelise per-server waits - Tests: updated 'starts server when cache state is unknown' to assert tools are returned after the grace period; added new test for the timeout path using createNeverStartingServer
|
Thanks for the discussion @connor4312 @TylerLeonhardt — updated with the per-server grace period approach. The new behaviour:
The timeout is injected via the constructor (default 5 000 ms) so it can be overridden in tests without fake timers. |
| if (cacheState === McpServerCacheState.Outdated) { | ||
| // Has cached data — refresh in the background without blocking. | ||
| void this._ensureServerReady(server); | ||
| } |
There was a problem hiding this comment.
I think we should actually take the same codepath for outdated servers as we do for unknown servers. Just because it started up once before quickly doesn't mean it'll do so again.
There was a problem hiding this comment.
Good point — updated. Outdated now takes the same codepath as Unknown: both race server startup against the 5s per-server grace period. The separate background-fire path for Outdated is removed.
Per review feedback: Outdated servers should use the same per-server startup grace period as Unknown servers. A prior fast startup does not guarantee a fast restart, so both states now race startup against the 5s timeout before returning results.
connor4312
left a comment
There was a problem hiding this comment.
nice work, thanks for the contribution!
|
@RajeshKumar11 looks like there is a compile error |
5e0dc2f
Fixes #297780
Summary
Rationale
These list endpoints were awaiting server startup, causing large TTFT when the gateway is enabled. Returning cached data immediately and refreshing in the background restores the feature without the startup delay.