Python: Shared dataflow: Content flow by yoff · Pull Request #4013 · github/codeql

yoff · 2020-08-04T12:00:47Z

This PR uses the Content concept of the shared dataflow library to model flow into and out of sequences and collections. It is currently a proof of concept, but the achieved flow is quite encouraging (see diff for coverage\test.py).

The last few commits experiments with recording the indices for tuples and the keys for dictionaries. The hope being that sources will be sparse and so the extra precision will not hurt performance.

Perfomance in general will have to be evaluated for this approach.

Some issues have also surfaced, such as the extraction of comprehensions.

only from displays so far

…_SequenceFlow

There are some things that should be rewritten, though, but it may involve the extractor

(and dictionaries, but that is not fleshed out)

Also, less hacky comprehension, but I think we still want to fix the extractor

hvitved

A few random comments :-)

hvitved · 2020-08-13T08:49:33Z

python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

+}
+
+/** This should not be necessary */
+predicate colocated(AstNode n1, AstNode n2) {


Remove (not used)?

Absolutely! I thought I was rid of that...

hvitved · 2020-08-13T08:58:11Z

python/ql/src/experimental/dataflow/internal/DataFlowPublic.qll

+  /** An element of a tuple at a specifik index. */
+  TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or
+  /** An element of a dictionary under a specific key. */
+  TDictionaryElementContent(string key) { exists(StrConst s | s.getS() = key) } or


If you want, you can replace it with key = any(KeyValuePair kvp).getKey().(StrConst).getS() to make it smaller.

Neat, done.

hvitved · 2020-08-13T09:04:02Z

python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

 * content `c`.
 */
-predicate storeStep(Node node1, Content c, Node node2) { none() }
+predicate storeStep(Node nodeFrom, Content c, Node nodeTo) {


At some point you probably also want to implement the clearsContent predicate, for example dict.clear() will clear dictionary contents. However, that is easiest implemented if you have use-use flow.

Yes, it is still outstanding if we should make an effort to switch to use-use flow.

hvitved · 2020-08-13T09:14:26Z

python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

+  nodeTo.(CfgNode).getNode().getNode().(DictComp).getElt() = nodeFrom.(CfgNode).getNode().getNode() and
+  c instanceof DictionaryElementAnyContent


An alternative here is to introduce synthetic nodes that record the key that was read, so you get more precision. So something like this will work:

dict["a"] = "taint" dict["b"] = "no taint" dict = {k:v+"" for (k,v) in dict.items()} Sink(dict["b"]) // not tainted

You will then add a readStep from dict with content TDictionaryElementContent(key) to TSynthesizedDictCompNode(key, v), a local step from TSynthesizedDictCompNode(key, v) to TSynthesizedDictCompNode(key, v+"") and a storeStep with content TDictionaryElementContent(key) from TSynthesizedDictCompNode(key, v+"") to the entire comprehension.

This is an interesting use of synthetic nodes that we should keep in mind. It sparked thoughts that I wrote down here. It is probably too much precision to strive for right now, but perhaps we should add some test cases to record the ambition :-)

hvitved · 2020-08-13T09:17:10Z

python/ql/src/experimental/dataflow/internal/DataFlowPublic.qll

+  /** An element of a set. */
+  TSetElementContent() or
+  /** An element of a tuple at a specifik index. */
+  TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or


Can be replaced with exists(any(TupleNode tn).getElement(index)) to make it smaller.

Neat, done.

tausbn

I have a few comments, but otherwise I think this looks good. I'm impressed with what content flow manages to give us "for free".

tausbn · 2020-08-12T13:34:35Z

python/ql/src/experimental/dataflow/internal/DataFlowPublic.qll

+  /** An element of a set. */
+  TSetElementContent() or
+  /** An element of a tuple at a specifik index. */
+  TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or


I think we can probably limit these to only values that are actually used for indexing. This is probably best done by defining a separate class (e.g. named IntegerIndex) the instances of which are literals that appear in a subscripting operation. (I imagine this will not capture things like i = 5; y = list[i], but I don't know that this works in the present version anyway.)

I use the expression suggested by Tom. Indeed it will not catch your example and indeed that is not handled at present either.

tausbn · 2020-08-12T13:36:14Z

python/ql/src/experimental/dataflow/internal/DataFlowPublic.qll

+  /** An element of a tuple at a specifik index. */
+  TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or
+  /** An element of a dictionary under a specific key. */
+  TDictionaryElementContent(string key) { exists(StrConst s | s.getS() = key) } or


Again, I think this can be usefully restricted (and also extended). I would disregard all string constants that are not used in subscripting or dictionary-building operations, but also potentially add anything that appears as the name of a keyword argument, to capture things like d = dict(x=1, y=2).

I use the expression suggested by Tom augmented to include names of keyword arguments.

python/ql/src/experimental/dataflow/internal/DataFlowPublic.qll

tausbn · 2020-08-13T10:40:19Z

python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

+  //   c denotes list or set
+  exists(CallNode call, AttrNode a |
+    call.getFunction() = a and
+    a.getName() = "pop" and // TODO: Should be made more robust, like Value::named("set.pop").getACall()


I think doing it this way is probably plenty robust.

The way I see it, if dataflow has managed to track a list/set to this point, then I would feel pretty safe in assuming the pop method does what we imply here. Either way, I wouldn't want to introduce a dependence on Value, though it should be pretty safe in this case.

I have updated the comment(s).

python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

RasmusWL

If I understood correctly, here's a few doc changes that will make things clearer :)

python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

Co-authored-by: Taus <tausbn@github.com>

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

…equenceFlow

RasmusWL

Besides my question about comprehensionReadStep, LGTM.

Minor nitpick: subscriptionReadStep sounds more like a monthly subscription, than indexing into an array with a subscript 😄

python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

RasmusWL · 2020-08-20T13:18:39Z

python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

+  //   c denotes element of list or set
+  exists(For f, Comp comp |
+    f = getCompFor(comp) and
+    nodeFrom.getNode().(SequenceNode).getNode() = getCompIter(comp) and


This will only work for
x+1 for x in <sequence literal>, such as x+1 for x in [1,2,3] and not for

xs = [1,2,3] ys = [x+1 for x in xs]

is that by purpose? (I can see the tests only use sequence literals, so tests won't tell you anything about this)

Thanks, no, that is not on purpose. I meant to change this once I had a more reliable implementation of getCompIter (which I hope we do now, but I would love input on how to do it properly, that still feels like it might need extractor changes).

I looked into how list comprehensions are exposed through QL. If we use the ListComp class directly instead of Comp, we can get the information we want (instead of your helper methods)

from ListComp lc select lc, lc.getIterable(), lc.getIterationVariable(0).getAStore()

Basically, what we need is Comp.getIterable(). The method is available on all subclasses of Comp, so it just needs to be exposed in the same way as Comp.getFunction().

I didn't look into how nested comprehensions work, so I would test that out before blindly accepting that we should always use 0 for the index 😄

P.S. The getCompFor functionality is already implemented in Comp.getNthInnerLoop, which is just a private member predicate, but I'm not sure you would even need that with the functionality above.

Thanks for digging this out. As discussed, nested comprehensions will be dealt with in a separate PR.

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

RasmusWL

As discussed, nested comprehensions will be dealt with in a separate PR.

With this in mind, this LGTM.

…equenceFlow

RasmusWL · 2020-08-25T13:10:49Z

@tausbn since you have requested changes, this PR is blocked right now. Just a FYI, since I recall that merging policy has been more relaxed at some point.

tausbn

Looks good to me!

yoff added 6 commits July 31, 2020 15:45

Python: Field flow for sequence elements

b21da86

only from displays so far

Merge branch 'master' of github.com:github/codeql into SharedDataflow…

6debc48

…_SequenceFlow

Python: update test expectations and annotations

4a8d532

Python: Simplyfy sequence stores

f21777c

Python: Comprehension stores

9d09b4c

Python: More easy-to-get content flow

9312b42

There are some things that should be rewritten, though, but it may involve the extractor

yoff added the Python label Aug 4, 2020

yoff added 5 commits August 5, 2020 14:16

Python: format ql

2639e68

Python: More precise dataflow for tuples

7c23559

(and dictionaries, but that is not fleshed out)

Python: Track dictionary keys

e77ceaf

Also, less hacky comprehension, but I think we still want to fix the extractor

Python: format ql

ce86a8b

Python: update test expectation

6dfa2ea

yoff marked this pull request as ready for review August 13, 2020 08:34

yoff requested a review from a team as a code owner August 13, 2020 08:34

hvitved reviewed Aug 13, 2020

View reviewed changes

tausbn requested changes Aug 14, 2020

View reviewed changes

adityasharad changed the base branch from master to main August 14, 2020 18:33

RasmusWL requested changes Aug 17, 2020

View reviewed changes

yoff and others added 7 commits August 19, 2020 08:01

Update python/ql/src/experimental/dataflow/internal/DataFlowPublic.qll

43a5e74

Co-authored-by: Taus <tausbn@github.com>

Update python/ql/src/experimental/dataflow/internal/DataFlowPublic.qll

1c3b945

Co-authored-by: Taus <tausbn@github.com>

Update python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

8fbb447

Co-authored-by: Taus <tausbn@github.com>

Update python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

06bd436

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

Update python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll

5e84754

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

Python: Address review comments

176aa06

Merge branch 'main' of github.com:github/codeql into SharedDataflow_S…

bd53a71

…equenceFlow

yoff requested review from RasmusWL, hvitved and tausbn August 19, 2020 10:23

RasmusWL reviewed Aug 20, 2020

View reviewed changes

Apply suggestions from code review

bfd9c08

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

yoff added 3 commits August 21, 2020 10:04

Python: Fix bug pointed out by reviewer

f9b1c5e

Python: Test flow into conprehension

ccff84d

Python: Support set literals.

e1343c7

yoff requested a review from RasmusWL August 21, 2020 09:15

RasmusWL mentioned this pull request Aug 24, 2020

Python: Additional taint steps for string methods #4124

Merged

RasmusWL approved these changes Aug 24, 2020

View reviewed changes

Merge branch 'main' of github.com:github/codeql into SharedDataflow_S…

2608509

…equenceFlow

tausbn approved these changes Aug 25, 2020

View reviewed changes

tausbn merged commit 000fa33 into github:main Aug 25, 2020

Feb	MAR	Apr
	04
2025	2026	2027

		nodeTo.(CfgNode).getNode().getNode().(DictComp).getElt() = nodeFrom.(CfgNode).getNode().getNode() and
		c instanceof DictionaryElementAnyContent

Conversation

yoff commented Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvitved left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

RasmusWL commented Aug 25, 2020

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

yoff commented Aug 4, 2020 •

edited

Loading