X Tutup
The Wayback Machine - https://web.archive.org/web/20250523134220/https://github.com/github/codeql/pull/7349
Skip to content

Dataflow: Add support for flow state #7349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 14, 2022

Conversation

aschackmull
Copy link
Contributor

Flow state aka flow labels.

The type DataFlow::FlowState is added as an abstract class extending string, and 4 new predicates are added to Configurations:

  predicate isSource(Node source, FlowState state)
  predicate isSink(Node source, FlowState state)
  predicate isBarrier(Node node, FlowState state)
  predicate isAdditionalFlowStep(Node node1, FlowState state1, Node node2, FlowState state2)

The existing isSource and isSink use the default state "", and the existing steps become state-preserving and applies to all states.

Furthermore, the state can be inspected after calculating data flow using PathNode.getState().

To consider:

  • Should we change the existing isSource and isSink from abstract to { none() }?
  • Should the state be part of PathNode.toString such that it shows up in path explanations? If so, what format do we prefer?

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial (design) comments (I haven't looked at the code yet).

@@ -3,6 +3,25 @@ private import DataFlowImplSpecific::Public
import Cached

module DataFlowImplCommonPublic {
/** A state value to track during data flow. */
abstract class FlowState extends string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential foot gun: For data flow configurations in libraries like

private class FieldConfiguration extends Configuration {

if someone were to add new states to a query (which I assume would be a valid thing to do), then such library analyses would have to be recomputed.

Can't FlowState just be an alias for string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. There might be one place in DataFlowImpl where this change will require a refactor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this has an obvious answer, but isn't codeql-javascript having the same potential foot gun?
Or is the plan to (later) integrate this PR with the javascript data-flow library?

/**
* A label describing the kind of information tracked by a flow configuration.
*
* There are two standard labels "data" and "taint".
* - "data" only propagates along value-preserving data flow, such as assignments
* and parameter-passing, and is the default flow source for a `DataFlow::Configuration`.
* - "taint" additionally permits flow through transformations such as string operations,
* and is the default flow source for a `TaintTracking::Configuration`.
*/
abstract class FlowLabel extends string {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to let @github/codeql-javascript answer that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the javascript data-flow library supports usage in library code - it only exists in one copy, so that should be reserved for queries. It's a good observation and a good question, though, and certainly something that needs to be taken into account if the javascript data-flow library is to allow in-library use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the javascript data-flow library supports usage in library code - it only exists in one copy, so that should be reserved for queries.

Yep. We only create new FlowLabels in query specific code, and never in library code.
We also have a test that fails if we every accidentally add more default flow labels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is not whether you create flow labels in library code, but whether you use global data flow in library code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh. We don't use global dataflow in library code, we use type-tracking instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm supportive of the idea of using an alias class FlowState = string instead of the abstract class.

In practice you want to define a predicate for accessing the flow state constant. Once you have that predicate, you don't gain much by forcing users to declare a subclass as well. In JS this has led to various foot-gun situations with awkward solutions.

* into account in the analysis. This step is only applicable in `state1` and
* updates the flow state to `state2`.
*/
predicate isAdditionalFlowStep(Node node1, FlowState state1, Node node2, FlowState state2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't we need something like this for cases where we previously chained together configurations?

/**
 * Holds if reaching `node1` updates the state from `state1` to `state2`.
 */
predicate updatesState(Node node, FlowState state1, FlowState state2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simulate that fairly well by using isAdditionalStep(node, state1, node, state2). So I guess we could add it as a convenience, but it shouldn't be strictly necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that will only work for taint steps and not value preserving steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't planning on allowing state-dependent or state-changing value steps. Do you think we'll have strong use-cases for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but I do know that we have previously chained together configurations to model the case where flow must go through a certain node. Whether the step into that node is alway a taint step I don't know, but it seems a bit restrictive to assume so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've found a use case for state-changing value steps when there's a multi-part barrier (in this case, two different methods that need to be called to secure an XML parser against XXE vulnerabilities).

A size 1 DataFlowType causes misoptimisations.
Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach LGTM. I have a few comments, and I assume (didn't check) that stages 2-4 remain in sync.

or
preservesValue = false and
additionalLocalFlowStepNodeCand1(node1, node2, config)
additionalLocalStateStep(node1, state1, node2, state2, config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't node1 and node2 be restricted via stage 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where the localFlowStep and additionalLocalFlowStep relations tend to be quite large, and therefore benefits from this restriction, I don't think it matters for additionalLocalStateStep.

@@ -994,15 +1111,23 @@ private module Stage2 {
bindingset[node, cc, config]
private LocalCc getLocalCc(NodeEx node, Cc cc, Configuration config) { any() }

bindingset[node1, state1, config]
bindingset[node2, state2, config]
private predicate localStep(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we perhaps have a similar jumpStep predicate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would benefit readability, since stage 2 has to have a different set of bindingsets for localStep compared to stages 3 and 4, which means that localStep will be inlined in stage 2. The same sort of complexity would apply to a unified jumpStep predicate, so I think it's simpler to keep them apart. We've only unified the local steps in a single predicate in stage 2 in order to share code with stages 3 and 4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually more thinking about the fact that the various fwdFlow/revFlow predicates could be simplified, as we could collapse three disjuncts into one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. I was indeed thinking about readability in terms of those predicates - maybe I should have said maintainability instead. Reading a predicate in which another complex predicate is being inlined and reasoning about its expected behaviour is somewhat tricky, I think. Note also that even localStep is being duplicated in revFlow currently, so the disjuncts aren't all collapsed in that case either.
It is the fact that these 3 disjuncts have different binding patterns that makes them difficult to collapse in a simple way.

ap = ap0
or
localStep(mid, node, false, ap, config, localCc) and
localStep(mid, state0, node, state, false, ap, config, localCc) and
ap0 instanceof ApNil
)
or
exists(NodeEx mid |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had a jumpStep predicate, as suggested above, we could combine the three disjuncts below.

private newtype TDataFlowType = TTodoDataFlowType()
private newtype TDataFlowType =
TTodoDataFlowType() or
TTodoDataFlowType2()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves a comment. I would expect that this is also a problem for Python, where DataFlowType is also a singleton.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we shouldn't add a second constructor for Python as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double-checked the last dca run, and it appears to have successfully gathered timing-data for all the projects (even though apparently some errors were reported afterwards). Perf looks fine there, so I figured I'd leave it be.

@aschackmull
Copy link
Contributor Author

All comments should be addressed now.

@aschackmull aschackmull merged commit 0b24af9 into github:main Jan 14, 2022
@aschackmull aschackmull deleted the dataflow/state branch January 14, 2022 08:12
RasmusWL added a commit to RasmusWL/codeql that referenced this pull request Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
X Tutup