X Tutup
Skip to content

MCP Gateway: avoid blocking list calls on startup#298040

Merged
9 commits merged intomicrosoft:mainfrom
RajeshKumar11:fix/mcp-gateway-nonblocking-297780-v2
Mar 4, 2026
Merged

MCP Gateway: avoid blocking list calls on startup#298040
9 commits merged intomicrosoft:mainfrom
RajeshKumar11:fix/mcp-gateway-nonblocking-297780-v2

Conversation

@RajeshKumar11
Copy link
Contributor

Fixes #297780

Summary

  • Avoid blocking MCP gateway list calls (tools/list, resources/list, resources/templates/list) on slow server startup
  • Refresh servers in the background when cache state is Unknown/Outdated

Rationale

These list endpoints were awaiting server startup, causing large TTFT when the gateway is enabled. Returning cached data immediately and refreshing in the background restores the feature without the startup delay.

Copilot AI review requested due to automatic review settings February 26, 2026 17:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request optimizes the MCP Gateway performance by avoiding blocking list calls during server startup, addressing the 1min+ TTFT issue reported in #297780. The core change is to return cached data immediately when available and trigger server refreshes in the background, rather than waiting for all servers to be ready before responding.

Changes:

  • Removed blocking await Promise.all() call that waited for all servers to be ready before listing tools
  • Added conditional cache state checks that trigger background server refreshes for Unknown/Outdated states
  • Modified three list methods to return immediately with cached/live data while refreshing stale servers asynchronously

Comment on lines 130 to 135
if (cacheState === McpServerCacheState.Unknown || cacheState === McpServerCacheState.Outdated) {
// Avoid blocking the tool list on slow server startup; refresh in the background.
void this._ensureServerReady(server);
}
if (cacheState !== McpServerCacheState.Live && cacheState !== McpServerCacheState.Cached && cacheState !== McpServerCacheState.RefreshingFromCached) {
continue;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for checking cache state and conditionally refreshing servers in the background is duplicated across three methods (_listTools, _listResources, _listResourceTemplates). Consider extracting this logic into a private helper method to improve maintainability and ensure consistency when the logic needs to change in the future.

Copilot generated this review using guidance from repository custom instructions.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — the duplicated logic has been extracted into _waitForStartup and _shouldUseCachedData, which are shared by _listTools, _listResources, and _listResourceTemplates.

Comment on lines 130 to 135
if (cacheState === McpServerCacheState.Unknown || cacheState === McpServerCacheState.Outdated) {
// Avoid blocking the tool list on slow server startup; refresh in the background.
void this._ensureServerReady(server);
}
if (cacheState !== McpServerCacheState.Live && cacheState !== McpServerCacheState.Cached && cacheState !== McpServerCacheState.RefreshingFromCached) {
continue;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes in this PR, the test "starts server when cache state is unknown" on line 123 may no longer be accurate. The server startup is now triggered asynchronously in the background (void this._ensureServerReady) and won't block the listTools call. The test should be updated to verify that: 1) the background refresh is initiated, and 2) the call returns immediately without waiting for the server to be ready. Consider adding an async wait or verification that the server starts eventually in the background.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. The test now asserts that listTools returns the tools (not an empty array) after the grace period, and verifies startCalls === 1. A second test (returns empty tools and does not re-wait if server does not start within grace period) was added to cover the timeout path using a server whose start() never resolves and a short grace period.

@connor4312
Copy link
Member

connor4312 commented Feb 26, 2026

This is nice but I think it's not practical atm for scenarios since Claude SDK and Copilot CLI's both don't support 'dynamic' tools that change after the connection is established. Wonder what you think @TylerLeonhardt

@TylerLeonhardt
Copy link
Member

@connor4312 it actually may work in the claude case because we react to when there are new mcp tools added (we essentially restart the session).

@connor4312
Copy link
Member

The downside is just that the first message sent may lack appropriate tools. Perhaps we have some reasonable timeout here instead. Like 5 seconds -- but it should per server, so each new message should not wait 5s if a server takes forever to start, we should only give it that grace period when we first ask to start it.

Instead of immediately returning empty results for Unknown-state servers
and refreshing in the background, wait up to 5 seconds for the server to
become live on the first list call. Subsequent calls find the already-
resolved promise in _startupGrace and return immediately, so only the
first message pays the startup cost.

- _waitForStartup: races _ensureServerReady against the configurable
  grace period (default 5 000 ms); result is cached per server
- _shouldUseCachedData: awaits the grace period for Unknown servers,
  falls back to background-refresh for Outdated servers
- _listTools: refactored to Promise.all to parallelise per-server waits
- Tests: updated 'starts server when cache state is unknown' to assert
  tools are returned after the grace period; added new test for the
  timeout path using createNeverStartingServer
@RajeshKumar11
Copy link
Contributor Author

Thanks for the discussion @connor4312 @TylerLeonhardt — updated with the per-server grace period approach.

The new behaviour:

  • On the first //\ call for an Unknown-state server, we race server startup against a 5 s timeout. If the server comes up in time, its tools are included in that response.
  • On subsequent calls the stored promise is already resolved, so no wait occurs — the server either contributes its tools (if it started) or is skipped.
  • Outdated servers continue to refresh in the background and return cached data immediately (no grace period needed since cached data is available).

The timeout is injected via the constructor (default 5 000 ms) so it can be overridden in tests without fake timers.

Comment on lines +114 to +117
if (cacheState === McpServerCacheState.Outdated) {
// Has cached data — refresh in the background without blocking.
void this._ensureServerReady(server);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should actually take the same codepath for outdated servers as we do for unknown servers. Just because it started up once before quickly doesn't mean it'll do so again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — updated. Outdated now takes the same codepath as Unknown: both race server startup against the 5s per-server grace period. The separate background-fire path for Outdated is removed.

Per review feedback: Outdated servers should use the same per-server
startup grace period as Unknown servers. A prior fast startup does not
guarantee a fast restart, so both states now race startup against the
5s timeout before returning results.
Copy link
Member

@connor4312 connor4312 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, thanks for the contribution!

@connor4312 connor4312 enabled auto-merge (squash) March 2, 2026 17:29
@vs-code-engineering vs-code-engineering bot added this to the March 2026 milestone Mar 2, 2026
@connor4312
Copy link
Member

@RajeshKumar11 looks like there is a compile error

Error: [core-ci                  ] src/vs/workbench/contrib/mcp/test/common/mcpGatewayToolBrokerChannel.test.ts(245,3): error TS2322: Type '() => Promise<{ state: McpConnectionState.Kind; }>' is not assignable to type '(opts?: IMcpServerStartOpts | undefined) => Promise<McpConnectionState>'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP Gateway performance needs improvement

4 participants

X Tutup