diagnostics_channel: add tracing channel #44943

Qard · 2022-10-10T01:20:27Z

This adds a helper to present tracing functionality through a group of channels and a shared context object. The shared context object can be used to communicate meta information about the action being traced.

No effort is made to link traces together, this only provides the basics to express a span for a single sync or async task. It's left up to the user to track and link span data through something like AsyncLocalStorage.

Similar to #44894, I'm starting this as a draft and skipping docs for the moment to get feedback on the API design. If we settle on this design satisfying our needs I'll write up some proper docs for it.

cc @nodejs/diagnostics

lib/diagnostics_channel.js

Flarna · 2022-10-12T14:13:29Z

It might be helpful to allow/support something like update to provide some more data between start and end, e.g. if some streaming use case is traced.

Qard · 2022-10-12T22:12:11Z

The idea was to just modify the ctx object whenever updates occur between the events. Open to adding additional events though if we have a need to capture immediate data for whatever reason.

Flarna · 2022-10-13T07:00:38Z

The idea was to just modify the ctx object whenever updates occur between the events. Open to adding additional events though if we have a need to capture immediate data for whatever reason.

That should be fine. Main issue I see with the context object is that there are some pre-ocupied property names (error, result as of now) which could easily result in conflicts various channels add more and more such properties to transport more data.

tony-go

Great 🙌

You probably will add documentation, but I'd like to see a full use case with an HTTP server there ^^

lib/diagnostics_channel.js

Qard · 2022-10-18T03:02:02Z

Made some changes to split up the logic for sync, callback, and promise functions. What do people think of this design? If this seems reasonable I can move forward with writing docs for it.

Flarna · 2022-10-19T06:35:31Z

I think the split API is better because usually you know if you trace a sync or async function. We might add a "cb or promise" variant later if needed. And there are always special cases which don't fit (e.g. the returned promise is actually a mixin with an event emitter,...).

How is it intended that a subscriber knows if operation is sync or async? We could transport the API used on the given context object. Or should this be part of the channel documentation?

For the sync part we have start/end which could be used for calling enter/exit on AsyncLocalStore and more (see #44894 (comment)) . But we don't have this for the callback. Is the assumption that this is not needed and the traced library is at least using AsyncResource correctly internally?

Qard · 2022-10-20T01:02:12Z

Yes, we're only caring about the trace up to when the callback runs. If the user wants more they can use async_hooks.

As for detecting sync vs async, we could have the start or end events include some flag like sync: true or sync: false depending on which version was used.

doc/api/diagnostics_channel.md

Co-authored-by: Colin Ihrig <cjihrig@gmail.com>

Qard · 2022-12-16T18:44:49Z

BTW, for anyone reviewing I have also rebased #44340 to use this interface to demonstrate the value and how it would be used in a relatively complex real-world scenario.

doc/api/diagnostics_channel.md

lib/diagnostics_channel.js

RafaelGSS · 2022-12-19T17:56:06Z

lib/diagnostics_channel.js

+function decRef(channel) {
+  channel._weak.decRef();
+  if (channel._weak.getRef() === 0) {
+    delete channels[channel.name];


Suggested change

delete channels[channel.name];

channels[channel.name] = undefined;

Otherwise, you will delete the shapes and the property access will be slower.

True, but the memory use will be greater, and performance of getting channels shouldn't be much of a priority as it's intended for it to be acquired at the file top-level rather than dynamically. It's a situation of deciding which trade-offs to go for. 🤷🏻

Maybe @mcollina has some thoughts here? Not a clear win either way to me.

How frequently would a call to decRef be in a real APM? I feel deleting the key would have more prejudice than a little memory increase (assuming fewer calls)

In ideal conditions, basically never. But the whole reason for the weak behavior in the first place is to protect against abusing it with excessively dynamic use. If someone is continuously generating and subscribing to new, uniquely-named channels it will be constantly expanding the shape without ever cleaning it up. Though we technically already have that with GC if a channel gets collected without a subscriber ever being added. 🤷🏻

I think this is unrealistic. Technically, if someone is subscribing excessively to unique channels, they will only see a memory increase after tons of calls, right?

Also, might be good to include a benchmark to validate our thoughts

We have a test around that at the moment. https://github.com/nodejs/node/blob/main/test/parallel/test-diagnostics-channel-memory-leak.js

Without the delete that test will fail because the shape of channels keeps expanding and will get filled out with a bunch of empty WeakReference instances that the referenced object can get cleaned up but the container itself will not and those containers can add up quickly if people are, for example, generating channels uniquely named to ids of things. In my opinion that's bad practice, but there's not really any reasonable way to stop users from doing that so we opted for being the least damaging to the environment we can be if the API is used badly.

I think we have to use delete here to avoid a leak if we stick on an object. Maybe we should use a Map instead?

The lookup/remove performance is in my opinion not that important as the intended use is to create a Channel once during startup and use.

Yeah, might be worth considering. 🤔

lib/diagnostics_channel.js

Flarna · 2022-12-19T21:48:55Z

doc/api/diagnostics_channel.md

+> Stability: 1 - Experimental
+
+The class `TracingChannel` is a collection of [TracingChannel Channels][] which
+together express a trace. It is used to formalize and simplify the process of


Suggested change

together express a trace. It is used to formalize and simplify the process of

together express a span. It is used to formalize and simplify the process of

Or maybe operation. The term trace is usually a tree of linked spans/operations and a single tracing channel forms one entry of a trace not the whole trace.

I don't think "span" or "operation" are great terms here either as there's zero agreement on terminology for what those units are called between APMs. It also means additional terminology that likely needs more explaining. This feature is not intended exclusively for APM and what can be considered a "trace" can be a single unit just as much as a collection of units. I think given the naming of TracingChannel it makes sense to call it a trace. If an expression of a collection of these units is described in this doc somewhere we could just call it a sub-trace. Many APMs think of it that way anyway, where there's not really a distinction between spans and traces, they're all just nested elements.

Yes, using (internal) terms from APM vendors will not help here.
I thought using Span would be quite good because it's also used in OpenTelemetry which is quite popular meanwhile and open.
At some other places you used single traceable action. Matches better but is quite long.

Anyhow, feel free to keep it as it is. APM aware people will be likely able to understand this and others are likely lost at more places.

The single traceable action terminology might be better here. It is long, but it's also unambiguous and not so much based on loaded terminology like "span" or whatever else an APM might know the concept as. What do you think of changing it to that?

sounds good.

doc/api/diagnostics_channel.md

Co-authored-by: Gerhard Stöbich <deb2001-github@yahoo.de> Co-authored-by: Chengzhong Wu <legendecas@gmail.com>

Flarna · 2022-12-20T09:10:15Z

lib/diagnostics_channel.js

@@ -105,27 +188,17 @@ function channel(name) {
  }

  channel = new Channel(name);
-  channels[name] = new WeakReference(channel);
+  channel._weak = new WeakReference(channel);
+  channels[name] = channel._weak;


Could we move this two lines into the Channel constructor?

Probably? I'll have to make sure the GC characteristics are still correct that way. I think it should be fine though.

There should be no difference. In the end you add two properties to the new channel instance. Not sure why we need two properties referring to the same WeakReference.

Ah, don't actually need _weak anymore. That's leftover from something I had before to hold channels open through TracingChannel as I was constructing them directly before, but I changed how that works so that's not relevant anymore.

Just noticed that this is not resulting in two properties on channel as it is channels[name]. I guess that is still needed to allow a setup where a consumer subscribes on a channel but doesn't keep a reference to the channel instance.

It can go back to how it was before where the WeakReference is stores directly in channels[name] rather than also needing to exist on the channel itself. That was just there because I had some weird weak reference forwarding thing for TracingChannel which turned out to be a bad idea so I deleted it. 😅

Flarna · 2022-12-20T09:10:15Z

lib/diagnostics_channel.js

@@ -105,27 +188,17 @@ function channel(name) {
  }


Not directly related to this PR but shouldn't we move the type check to the beginning of the function?

Flarna · 2022-12-20T09:10:15Z

lib/diagnostics_channel.js

+}
+
+function maybeMarkInactive(channel) {
+  // When there are no more active subscribers, restore to fast prototype.


Suggested change

// When there are no more active subscribers, restore to fast prototype.

// When there are no more active subscribers and no bound stores, restore to fast prototype.

Is it still correct that ActiveChannel.hasSubscribers returns always true?
There can be a bound ALS but no subscriber.

I would think of a bound ALS as a type of subscriber, but maybe we need a discussion of what specifically hasSubscribers means? My intent for it was to decide if something needs to be published at all, which is still kind of the case but with a distinction between publish and runStores. Maybe we need another getter for if runStores needs to be called, which is any of subscribers or stores. 🤔

TracingChannel has no such API and in the traceXXX functions no check is done. Also your real world sample in loaders doesn't check this.

But I agree that from that point of view hasSubscribers does the right thing.

TracingChannel actually did have it originally, but I removed it in favour of doing something like tracingChannel.start.hasSubscribers instead.

Flarna · 2022-12-20T09:10:15Z

doc/api/diagnostics_channel.md

+* `thisArg` {any} The receiver to be used for the function call.
+* `...args` {any} Optional arguments to pass to the function.
+
+Publishes data to the channel and applies it to any AsyncLocalStorage instances


I think we should document the sequence that first publish happens and then the store is entered.

Could do that. Can you think of any examples when that ordering would matter? Do we need to say anything beyond just swapping that and for a then?

It should be clear what one gets if als.getStore() is called inside the onMessage function. Same is valid for the transform function where sample relies on the fact that it happens before als.run() is called to get the parent.

I think it's important to tell users that publish and transform happen before.

Fair. I'll make some improvements to that soon. :)

Flarna · 2022-12-20T09:10:15Z

lib/diagnostics_channel.js

+}
+
+function wrapStoreRun(store, data, next, transform = defaultTransform) {
+  return () => store.run(transform(data), next);


Should we somehow handle a throwing transform function?

We could handle it similarly to subscribers where failures get deferred to the next tick so it can finish the loop.

Either this or silently discard the exception and skip this subscriber. But I think we should not swallow such exceptions.

Flarna · 2022-12-20T10:16:28Z

lib/diagnostics_channel.js

  unsubscribe,
-  Channel
+  Channel,
+  TracingChannel


Is there a reason to export TracingChannel?

Theoretically one could sub-class it. Not sure why they'd want to though. 🤷🏻

Similar as for Channel we export a factory method instead of telling users to use new TracingChannel().
If we don't want construction via new allowing sub-classing sounds wrong.
Well one could use it for instanceof.
ActiveChannel is not exported and this is good to my understanding.
Also Channel should be not exported in my opinion but that ship has sailed.

Channels will pass instanceof even when active due to the SymbolHasInstance method on Channel. Also, ideally I'd like to make the factory unnecessary. I have some ideas for sub-classing TracingChannel which I'd like to enable if at all possible.

Maybe remove the factory tracingChannel? I see no reason to support both variants.

Just went for that for consistency with Channel. We'll see how my sub-classing experiments go though. I'm not totally convinced of the utility of exposing both. I'll be taking another pass at cleaning stuff up soon, just busy with settling other work things before the holidays so might not get a chance to update this PR this week.

That's fine. My main point is that it is always easier to add an export compared to removing it - even if it is experimental.

Fair, I think I can just omit it then. It's easy to add later if I have a specific case for it.

lib/diagnostics_channel.js

Flarna · 2022-12-20T10:16:28Z

lib/diagnostics_channel.js

+    this.publish(data);
+
+    // Bind base fn first due to AsyncLocalStorage.run not having thisArg
+    fn = FunctionPrototypeBind(fn, thisArg, ...args);


if this is considered as a hot function it would maybe make sense to avoid the bind in case this._stores.size === 0

The inactive channel prototype handles that for the most part. A bit of a weird case with subscribers and stores competing for active status. Ideally this should be a no-op when there's no stores. 🤔

Not a noop, the function needs to be called. But this can be a simple Refect.apply instead of creating a bound function and call it.

I mean no-op in the form of doesn't need to check anything. Basically it should do what Channel.prototype.runStores does even if there are subscribers but no stores.

Flarna · 2022-12-20T10:16:28Z

doc/api/diagnostics_channel.md

+* `data` {Object} Shared object to correlate trace events through
+* `thisArg` {any} The receiver to be used for the function call
+* `...args` {any} Optional arguments to pass to the function
+* Returns: {Promise} Promise returned by the given function


It's a chained promise, not the original one

That's an implementation detail which ideally shouldn't be visible to the user. Do you think this is really relevant?

For plain, native promises not. But there are still a lot frameworks which use thenables like bluebird. I have even seen thenables which are EventEmitters to allow caller to decide on awaiting or something else.

But maybe we should just disallow non native promises. Then it's don't care.

Flarna · 2022-12-20T10:16:28Z

lib/diagnostics_channel.js

+
+    try {
+      const promise = start.runStores(ctx, fn, thisArg, ...args);
+      return PromisePrototypeThen(promise, resolve, reject);


I assume this is restricted to native promises and wont work for there thenables like e.g. a bluebird promise.
Should we add some check regarding this to avoid a TypeError is thrown from here?

Could just wrap promise in a Promise.resolve(promise) if it's not already a native promise.

The problem with this is that it removes any API the original thenable had which likely breaks user code. So maybe best to restrict this to native promises. (refs: other comment above).

If we're going to document it anyway, I think I'd rather just document that it will convert thenables to native promises rather than documenting that it doesn't support thenables at all. 🤔

lib/diagnostics_channel.js

Flarna · 2022-12-20T10:16:28Z

lib/diagnostics_channel.js

+    }
+
+    const callback = ArrayPrototypeAt(args, position);
+    ArrayPrototypeSplice(args, position, 1, wrappedCallback);


Maybe check that we really replace a function here

Flarna · 2022-12-20T11:47:13Z

test/parallel/test-diagnostics-channel-bind-store.js

+// Bind a store with transformation of published data
+const store2 = new AsyncLocalStorage();
+channel.bindStore(store2, common.mustCall((data) => {
+  assert.deepStrictEqual(data, inputs[n]);


Suggested change

assert.deepStrictEqual(data, inputs[n]);

assert.strictEqual(data, inputs[n]);

I guess this should be the same object instance. Similar a few lines below.

Flarna · 2022-12-20T11:47:13Z

test/parallel/test-diagnostics-channel-bind-store.js

+
+channel.runStores(inputs[n], common.mustCall(function(a, b) {
+  // Verify this and argument forwarding
+  assert.deepStrictEqual(this, thisArg);


Suggested change

assert.deepStrictEqual(this, thisArg);

assert.strictEqual(this, thisArg);

Flarna · 2022-12-20T11:47:13Z

test/parallel/test-diagnostics-channel-tracing-channel-sync.js

+assert.strictEqual(channel.start.hasSubscribers, false);
+channel.subscribe(handlers);
+assert.strictEqual(channel.start.hasSubscribers, true);
+channel.traceSync(() => {


maybe assert return value and passes arguments

Flarna · 2022-12-20T11:47:13Z

test/parallel/test-diagnostics-channel-tracing-channel-sync.js

+const input = { foo: 'bar' };
+
+function check(found) {
+  assert.deepStrictEqual(found, input);


Suggested change

assert.deepStrictEqual(found, input);

assert.strictEqual(found, input);

similar in other tests

Flarna · 2022-12-23T11:58:10Z

A bit confusing here is the reuse of channels as elements of TracingChannel.
The actual data published through a channel is different if it is done via e.g. TracingChannel#traceSync or directly via channel#publish or channel#runStores.
In TracingChannel use case it's wrapped within the ctx object.

For channels dedicated created by TracingChannel (the constructor getting a string) this is quite ok as the names already give some hint.
But for the constructor variant using an object "any" set of channels could be combined. This could be confusing for consumers if these channels are used directly and via TracingChannel.

Qard added the diagnostics_channel Issues and PRs related to diagnostics channel label Oct 10, 2022

nodejs-github-bot added the needs-ci PRs that need a full CI run. label Oct 10, 2022

Flarna mentioned this pull request Oct 12, 2022

diagnostics_channel: add storage channel #44894

Closed

Flarna reviewed Oct 12, 2022

View changes

lib/diagnostics_channel.js Outdated Show resolved Hide resolved

lib/diagnostics_channel.js Outdated Show resolved Hide resolved

lib/diagnostics_channel.js Outdated Show resolved Hide resolved

tony-go reviewed Oct 13, 2022

View changes

lib/diagnostics_channel.js Outdated Show resolved Hide resolved

rochdev mentioned this pull request Oct 13, 2022

Support for undici tracing DataDog/dd-trace-js#1615

Open

Qard force-pushed the diagnostics-channel-tracing-channel branch 2 times, most recently from 95f3198 to 0438150 Compare Oct 18, 2022

Qard mentioned this pull request Oct 18, 2022

lib: add diagnostics_channel events to module loading #44340

Open

Qard mentioned this pull request Nov 2, 2022

async_hooks: AsyncLocalStorage to diagnostics_channel integration #45277

Closed

Trott force-pushed the main branch from 2d76238 to ca3ed36 Compare Nov 12, 2022

rochdev mentioned this pull request Nov 22, 2022

DNS with less AsyncResources DataDog/dd-trace-js#2494

Open

Qard force-pushed the diagnostics-channel-tracing-channel branch 2 times, most recently from ac655d8 to f3afea1 Compare Dec 3, 2022

Qard force-pushed the diagnostics-channel-tracing-channel branch from f3afea1 to 8765e7d Compare Dec 14, 2022

Qard marked this pull request as ready for review Dec 14, 2022

Qard force-pushed the diagnostics-channel-tracing-channel branch from 8765e7d to db4c8c9 Compare Dec 15, 2022

Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Dec 15, 2022

Qard force-pushed the diagnostics-channel-tracing-channel branch from db4c8c9 to 9128df4 Compare Dec 15, 2022

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Dec 15, 2022

This comment was marked as outdated.

Sign in to view

Qard force-pushed the diagnostics-channel-tracing-channel branch 3 times, most recently from 1d8b89e to 1215d64 Compare Dec 15, 2022

cjihrig reviewed Dec 16, 2022

View changes

Qard and others added 2 commits Dec 16, 2022

doc: improve TracingChannel docs

d74f944

Co-authored-by: Colin Ihrig <cjihrig@gmail.com>

doc: add experimental status and fix line lengths

6b38fcb

github-actions bot mentioned this pull request Dec 17, 2022

CI Reliability 2022-12-17 nodejs/reliability#460

Open

13 tasks

Qard mentioned this pull request Dec 17, 2022

AsyncLocalStorage: inconsistent propagation of nested async context when resolving outer promise #45848

Open

This was referenced Dec 18, 2022

CI Reliability 2022-12-18 nodejs/reliability#461

Open

CI Reliability 2022-12-19 nodejs/reliability#462

Open

legendecas reviewed Dec 19, 2022

View changes

doc/api/diagnostics_channel.md Outdated Show resolved Hide resolved

doc/api/diagnostics_channel.md Outdated Show resolved Hide resolved

legendecas reviewed Dec 19, 2022

View changes

doc/api/diagnostics_channel.md Outdated Show resolved Hide resolved

doc/api/diagnostics_channel.md Outdated Show resolved Hide resolved

RafaelGSS reviewed Dec 19, 2022

View changes

Flarna reviewed Dec 19, 2022

View changes

doc/api/diagnostics_channel.md Outdated Show resolved Hide resolved

Flarna reviewed Dec 19, 2022

View changes

doc/api/diagnostics_channel.md Outdated Show resolved Hide resolved

Flarna reviewed Dec 19, 2022

View changes

doc/api/diagnostics_channel.md Outdated Show resolved Hide resolved

Flarna reviewed Dec 19, 2022

View changes

doc/api/diagnostics_channel.md Outdated Show resolved Hide resolved

Flarna reviewed Dec 19, 2022

View changes

doc/api/diagnostics_channel.md Outdated Show resolved Hide resolved

Flarna reviewed Dec 19, 2022

View changes

doc/api/diagnostics_channel.md Outdated Show resolved Hide resolved

diagnostics_channel: Apply suggestions from code review

8219d7f

Co-authored-by: Gerhard Stöbich <deb2001-github@yahoo.de> Co-authored-by: Chengzhong Wu <legendecas@gmail.com>

Qard force-pushed the diagnostics-channel-tracing-channel branch 2 times, most recently from 7fd627f to 0544ff2 Compare Dec 19, 2022

Qard added 2 commits Dec 19, 2022

doc: avoid using context terminology

0d9f3fa

src: eliminate getRef in favour of incRef and decRef returning values

bc67924

Qard force-pushed the diagnostics-channel-tracing-channel branch from 0544ff2 to bc67924 Compare Dec 19, 2022

benjamingr approved these changes Dec 20, 2022

View changes

Flarna reviewed Dec 20, 2022

View changes

Flarna added the semver-minor PRs that contain new features and should be released in the next minor version. label Dec 20, 2022

Flarna reviewed Dec 20, 2022

View changes

Nov	DEC	Jan
	24
2021	2022	2023

	delete channels[channel.name];
	channels[channel.name] = undefined;

	together express a trace. It is used to formalize and simplify the process of
	together express a span. It is used to formalize and simplify the process of

	// When there are no more active subscribers, restore to fast prototype.
	// When there are no more active subscribers and no bound stores, restore to fast prototype.

	assert.deepStrictEqual(data, inputs[n]);
	assert.strictEqual(data, inputs[n]);

	assert.deepStrictEqual(this, thisArg);
	assert.strictEqual(this, thisArg);

	assert.deepStrictEqual(found, input);
	assert.strictEqual(found, input);

diagnostics_channel: add tracing channel #44943

Are you sure you want to change the base?

diagnostics_channel: add tracing channel #44943

Conversation

Qard commented Oct 10, 2022

Flarna commented Oct 12, 2022

Qard commented Oct 12, 2022

Flarna commented Oct 13, 2022

tony-go left a comment

Qard commented Oct 18, 2022

Flarna commented Oct 19, 2022

Qard commented Oct 20, 2022

This comment was marked as outdated.

Qard commented Dec 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Flarna commented Dec 23, 2022