-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Dataflow: Add support for flow state #7349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial (design) comments (I haven't looked at the code yet).
| @@ -3,6 +3,25 @@ private import DataFlowImplSpecific::Public | |||
| import Cached | |||
|
|
|||
| module DataFlowImplCommonPublic { | |||
| /** A state value to track during data flow. */ | |||
| abstract class FlowState extends string { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential foot gun: For data flow configurations in libraries like
| private class FieldConfiguration extends Configuration { |
if someone were to add new states to a query (which I assume would be a valid thing to do), then such library analyses would have to be recomputed.
Can't FlowState just be an alias for string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. There might be one place in DataFlowImpl where this change will require a refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this has an obvious answer, but isn't codeql-javascript having the same potential foot gun?
Or is the plan to (later) integrate this PR with the javascript data-flow library?
codeql/javascript/ql/lib/semmle/javascript/dataflow/Configuration.qll
Lines 284 to 293 in 15caaa7
| /** | |
| * A label describing the kind of information tracked by a flow configuration. | |
| * | |
| * There are two standard labels "data" and "taint". | |
| * - "data" only propagates along value-preserving data flow, such as assignments | |
| * and parameter-passing, and is the default flow source for a `DataFlow::Configuration`. | |
| * - "taint" additionally permits flow through transformations such as string operations, | |
| * and is the default flow source for a `TaintTracking::Configuration`. | |
| */ | |
| abstract class FlowLabel extends string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to let @github/codeql-javascript answer that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the javascript data-flow library supports usage in library code - it only exists in one copy, so that should be reserved for queries. It's a good observation and a good question, though, and certainly something that needs to be taken into account if the javascript data-flow library is to allow in-library use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the javascript data-flow library supports usage in library code - it only exists in one copy, so that should be reserved for queries.
Yep. We only create new FlowLabels in query specific code, and never in library code.
We also have a test that fails if we every accidentally add more default flow labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is not whether you create flow labels in library code, but whether you use global data flow in library code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh. We don't use global dataflow in library code, we use type-tracking instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm supportive of the idea of using an alias class FlowState = string instead of the abstract class.
In practice you want to define a predicate for accessing the flow state constant. Once you have that predicate, you don't gain much by forcing users to declare a subclass as well. In JS this has led to various foot-gun situations with awkward solutions.
| * into account in the analysis. This step is only applicable in `state1` and | ||
| * updates the flow state to `state2`. | ||
| */ | ||
| predicate isAdditionalFlowStep(Node node1, FlowState state1, Node node2, FlowState state2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't we need something like this for cases where we previously chained together configurations?
/**
* Holds if reaching `node1` updates the state from `state1` to `state2`.
*/
predicate updatesState(Node node, FlowState state1, FlowState state2)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simulate that fairly well by using isAdditionalStep(node, state1, node, state2). So I guess we could add it as a convenience, but it shouldn't be strictly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except that will only work for taint steps and not value preserving steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't planning on allowing state-dependent or state-changing value steps. Do you think we'll have strong use-cases for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but I do know that we have previously chained together configurations to model the case where flow must go through a certain node. Whether the step into that node is alway a taint step I don't know, but it seems a bit restrictive to assume so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've found a use case for state-changing value steps when there's a multi-part barrier (in this case, two different methods that need to be called to secure an XML parser against XXE vulnerabilities).
2f6e5b7 to
ef714f7
Compare
A size 1 DataFlowType causes misoptimisations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach LGTM. I have a few comments, and I assume (didn't check) that stages 2-4 remain in sync.
| or | ||
| preservesValue = false and | ||
| additionalLocalFlowStepNodeCand1(node1, node2, config) | ||
| additionalLocalStateStep(node1, state1, node2, state2, config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't node1 and node2 be restricted via stage 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where the localFlowStep and additionalLocalFlowStep relations tend to be quite large, and therefore benefits from this restriction, I don't think it matters for additionalLocalStateStep.
| @@ -994,15 +1111,23 @@ private module Stage2 { | |||
| bindingset[node, cc, config] | |||
| private LocalCc getLocalCc(NodeEx node, Cc cc, Configuration config) { any() } | |||
|
|
|||
| bindingset[node1, state1, config] | |||
| bindingset[node2, state2, config] | |||
| private predicate localStep( | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we perhaps have a similar jumpStep predicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that would benefit readability, since stage 2 has to have a different set of bindingsets for localStep compared to stages 3 and 4, which means that localStep will be inlined in stage 2. The same sort of complexity would apply to a unified jumpStep predicate, so I think it's simpler to keep them apart. We've only unified the local steps in a single predicate in stage 2 in order to share code with stages 3 and 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually more thinking about the fact that the various fwdFlow/revFlow predicates could be simplified, as we could collapse three disjuncts into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know. I was indeed thinking about readability in terms of those predicates - maybe I should have said maintainability instead. Reading a predicate in which another complex predicate is being inlined and reasoning about its expected behaviour is somewhat tricky, I think. Note also that even localStep is being duplicated in revFlow currently, so the disjuncts aren't all collapsed in that case either.
It is the fact that these 3 disjuncts have different binding patterns that makes them difficult to collapse in a simple way.
| ap = ap0 | ||
| or | ||
| localStep(mid, node, false, ap, config, localCc) and | ||
| localStep(mid, state0, node, state, false, ap, config, localCc) and | ||
| ap0 instanceof ApNil | ||
| ) | ||
| or | ||
| exists(NodeEx mid | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had a jumpStep predicate, as suggested above, we could combine the three disjuncts below.
| private newtype TDataFlowType = TTodoDataFlowType() | ||
| private newtype TDataFlowType = | ||
| TTodoDataFlowType() or | ||
| TTodoDataFlowType2() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deserves a comment. I would expect that this is also a problem for Python, where DataFlowType is also a singleton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we shouldn't add a second constructor for Python as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double-checked the last dca run, and it appears to have successfully gathered timing-data for all the projects (even though apparently some errors were reported afterwards). Perf looks fine there, so I figured I'd leave it be.
|
All comments should be addressed now. |
See discussion at github#7349 (comment)


Flow state aka flow labels.
The type
DataFlow::FlowStateis added as an abstract class extending string, and 4 new predicates are added toConfigurations:The existing
isSourceandisSinkuse the default state"", and the existing steps become state-preserving and applies to all states.Furthermore, the state can be inspected after calculating data flow using
PathNode.getState().To consider:
isSourceandisSinkfromabstractto{ none() }?PathNode.toStringsuch that it shows up in path explanations? If so, what format do we prefer?