X Tutup
The Wayback Machine - https://web.archive.org/web/20260304220959/https://github.com/github/codeql/pull/4013
Skip to content

Python: Shared dataflow: Content flow#4013

Merged
tausbn merged 23 commits intogithub:mainfrom
yoff:SharedDataflow_SequenceFlow
Aug 25, 2020
Merged

Python: Shared dataflow: Content flow#4013
tausbn merged 23 commits intogithub:mainfrom
yoff:SharedDataflow_SequenceFlow

Conversation

@yoff
Copy link
Contributor

@yoff yoff commented Aug 4, 2020

This PR uses the Content concept of the shared dataflow library to model flow into and out of sequences and collections. It is currently a proof of concept, but the achieved flow is quite encouraging (see diff for coverage\test.py).

The last few commits experiments with recording the indices for tuples and the keys for dictionaries. The hope being that sources will be sparse and so the extra precision will not hurt performance.

Perfomance in general will have to be evaluated for this approach.

Some issues have also surfaced, such as the extraction of comprehensions.

@yoff yoff added the Python label Aug 4, 2020
yoff added 5 commits August 5, 2020 14:16
(and dictionaries, but that is not fleshed out)
Also, less hacky comprehension,
but I think we still want to fix the extractor
@yoff yoff marked this pull request as ready for review August 13, 2020 08:34
@yoff yoff requested a review from a team as a code owner August 13, 2020 08:34
Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few random comments :-)

}

/** This should not be necessary */
predicate colocated(AstNode n1, AstNode n2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove (not used)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! I thought I was rid of that...

/** An element of a tuple at a specifik index. */
TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or
/** An element of a dictionary under a specific key. */
TDictionaryElementContent(string key) { exists(StrConst s | s.getS() = key) } or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you can replace it with key = any(KeyValuePair kvp).getKey().(StrConst).getS() to make it smaller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, done.

* content `c`.
*/
predicate storeStep(Node node1, Content c, Node node2) { none() }
predicate storeStep(Node nodeFrom, Content c, Node nodeTo) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point you probably also want to implement the clearsContent predicate, for example dict.clear() will clear dictionary contents. However, that is easiest implemented if you have use-use flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is still outstanding if we should make an effort to switch to use-use flow.

Comment on lines +268 to +269
nodeTo.(CfgNode).getNode().getNode().(DictComp).getElt() = nodeFrom.(CfgNode).getNode().getNode() and
c instanceof DictionaryElementAnyContent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative here is to introduce synthetic nodes that record the key that was read, so you get more precision. So something like this will work:

dict["a"] = "taint"
dict["b"] = "no taint"
dict = {k:v+"" for (k,v) in dict.items()}
Sink(dict["b"]) // not tainted

You will then add a readStep from dict with content TDictionaryElementContent(key) to TSynthesizedDictCompNode(key, v), a local step from TSynthesizedDictCompNode(key, v) to TSynthesizedDictCompNode(key, v+"") and a storeStep with content TDictionaryElementContent(key) from TSynthesizedDictCompNode(key, v+"") to the entire comprehension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting use of synthetic nodes that we should keep in mind. It sparked thoughts that I wrote down here. It is probably too much precision to strive for right now, but perhaps we should add some test cases to record the ambition :-)

/** An element of a set. */
TSetElementContent() or
/** An element of a tuple at a specifik index. */
TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be replaced with exists(any(TupleNode tn).getElement(index)) to make it smaller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, done.

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments, but otherwise I think this looks good. I'm impressed with what content flow manages to give us "for free".

/** An element of a set. */
TSetElementContent() or
/** An element of a tuple at a specifik index. */
TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably limit these to only values that are actually used for indexing. This is probably best done by defining a separate class (e.g. named IntegerIndex) the instances of which are literals that appear in a subscripting operation. (I imagine this will not capture things like i = 5; y = list[i], but I don't know that this works in the present version anyway.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the expression suggested by Tom. Indeed it will not catch your example and indeed that is not handled at present either.

/** An element of a tuple at a specifik index. */
TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or
/** An element of a dictionary under a specific key. */
TDictionaryElementContent(string key) { exists(StrConst s | s.getS() = key) } or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think this can be usefully restricted (and also extended). I would disregard all string constants that are not used in subscripting or dictionary-building operations, but also potentially add anything that appears as the name of a keyword argument, to capture things like d = dict(x=1, y=2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the expression suggested by Tom augmented to include names of keyword arguments.

// c denotes list or set
exists(CallNode call, AttrNode a |
call.getFunction() = a and
a.getName() = "pop" and // TODO: Should be made more robust, like Value::named("set.pop").getACall()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing it this way is probably plenty robust.

The way I see it, if dataflow has managed to track a list/set to this point, then I would feel pretty safe in assuming the pop method does what we imply here. Either way, I wouldn't want to introduce a dependence on Value, though it should be pretty safe in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comment(s).

@adityasharad adityasharad changed the base branch from master to main August 14, 2020 18:33
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly, here's a few doc changes that will make things clearer :)

@yoff yoff requested review from RasmusWL, hvitved and tausbn August 19, 2020 10:23
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides my question about comprehensionReadStep, LGTM.

Minor nitpick: subscriptionReadStep sounds more like a monthly subscription, than indexing into an array with a subscript 😄

// c denotes element of list or set
exists(For f, Comp comp |
f = getCompFor(comp) and
nodeFrom.getNode().(SequenceNode).getNode() = getCompIter(comp) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work for
x+1 for x in <sequence literal>, such as x+1 for x in [1,2,3] and not for

xs = [1,2,3]
ys = [x+1 for x in xs]

is that by purpose? (I can see the tests only use sequence literals, so tests won't tell you anything about this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, no, that is not on purpose. I meant to change this once I had a more reliable implementation of getCompIter (which I hope we do now, but I would love input on how to do it properly, that still feels like it might need extractor changes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into how list comprehensions are exposed through QL. If we use the ListComp class directly instead of Comp, we can get the information we want (instead of your helper methods)

from ListComp lc
select lc, lc.getIterable(), lc.getIterationVariable(0).getAStore()

Basically, what we need is Comp.getIterable(). The method is available on all subclasses of Comp, so it just needs to be exposed in the same way as Comp.getFunction().

I didn't look into how nested comprehensions work, so I would test that out before blindly accepting that we should always use 0 for the index 😄

P.S. The getCompFor functionality is already implemented in Comp.getNthInnerLoop, which is just a private member predicate, but I'm not sure you would even need that with the functionality above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging this out. As discussed, nested comprehensions will be dealt with in a separate PR.

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, nested comprehensions will be dealt with in a separate PR.

With this in mind, this LGTM.

@RasmusWL
Copy link
Member

@tausbn since you have requested changes, this PR is blocked right now. Just a FYI, since I recall that merging policy has been more relaxed at some point.

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@tausbn tausbn merged commit 000fa33 into github:main Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

X Tutup