Python: Shared dataflow: Content flow#4013
Conversation
only from displays so far
There are some things that should be rewritten, though, but it may involve the extractor
(and dictionaries, but that is not fleshed out)
Also, less hacky comprehension, but I think we still want to fix the extractor
hvitved
left a comment
There was a problem hiding this comment.
A few random comments :-)
| } | ||
|
|
||
| /** This should not be necessary */ | ||
| predicate colocated(AstNode n1, AstNode n2) { |
There was a problem hiding this comment.
Absolutely! I thought I was rid of that...
| /** An element of a tuple at a specifik index. */ | ||
| TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or | ||
| /** An element of a dictionary under a specific key. */ | ||
| TDictionaryElementContent(string key) { exists(StrConst s | s.getS() = key) } or |
There was a problem hiding this comment.
If you want, you can replace it with key = any(KeyValuePair kvp).getKey().(StrConst).getS() to make it smaller.
| * content `c`. | ||
| */ | ||
| predicate storeStep(Node node1, Content c, Node node2) { none() } | ||
| predicate storeStep(Node nodeFrom, Content c, Node nodeTo) { |
There was a problem hiding this comment.
At some point you probably also want to implement the clearsContent predicate, for example dict.clear() will clear dictionary contents. However, that is easiest implemented if you have use-use flow.
There was a problem hiding this comment.
Yes, it is still outstanding if we should make an effort to switch to use-use flow.
| nodeTo.(CfgNode).getNode().getNode().(DictComp).getElt() = nodeFrom.(CfgNode).getNode().getNode() and | ||
| c instanceof DictionaryElementAnyContent |
There was a problem hiding this comment.
An alternative here is to introduce synthetic nodes that record the key that was read, so you get more precision. So something like this will work:
dict["a"] = "taint"
dict["b"] = "no taint"
dict = {k:v+"" for (k,v) in dict.items()}
Sink(dict["b"]) // not tainted
You will then add a readStep from dict with content TDictionaryElementContent(key) to TSynthesizedDictCompNode(key, v), a local step from TSynthesizedDictCompNode(key, v) to TSynthesizedDictCompNode(key, v+"") and a storeStep with content TDictionaryElementContent(key) from TSynthesizedDictCompNode(key, v+"") to the entire comprehension.
There was a problem hiding this comment.
This is an interesting use of synthetic nodes that we should keep in mind. It sparked thoughts that I wrote down here. It is probably too much precision to strive for right now, but perhaps we should add some test cases to record the ambition :-)
| /** An element of a set. */ | ||
| TSetElementContent() or | ||
| /** An element of a tuple at a specifik index. */ | ||
| TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or |
There was a problem hiding this comment.
Can be replaced with exists(any(TupleNode tn).getElement(index)) to make it smaller.
tausbn
left a comment
There was a problem hiding this comment.
I have a few comments, but otherwise I think this looks good. I'm impressed with what content flow manages to give us "for free".
| /** An element of a set. */ | ||
| TSetElementContent() or | ||
| /** An element of a tuple at a specifik index. */ | ||
| TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or |
There was a problem hiding this comment.
I think we can probably limit these to only values that are actually used for indexing. This is probably best done by defining a separate class (e.g. named IntegerIndex) the instances of which are literals that appear in a subscripting operation. (I imagine this will not capture things like i = 5; y = list[i], but I don't know that this works in the present version anyway.)
There was a problem hiding this comment.
I use the expression suggested by Tom. Indeed it will not catch your example and indeed that is not handled at present either.
| /** An element of a tuple at a specifik index. */ | ||
| TTupleElementContent(int index) { exists(IntegerLiteral lit | lit.getValue() = index) } or | ||
| /** An element of a dictionary under a specific key. */ | ||
| TDictionaryElementContent(string key) { exists(StrConst s | s.getS() = key) } or |
There was a problem hiding this comment.
Again, I think this can be usefully restricted (and also extended). I would disregard all string constants that are not used in subscripting or dictionary-building operations, but also potentially add anything that appears as the name of a keyword argument, to capture things like d = dict(x=1, y=2).
There was a problem hiding this comment.
I use the expression suggested by Tom augmented to include names of keyword arguments.
python/ql/src/experimental/dataflow/internal/DataFlowPublic.qll
Outdated
Show resolved
Hide resolved
python/ql/src/experimental/dataflow/internal/DataFlowPublic.qll
Outdated
Show resolved
Hide resolved
| // c denotes list or set | ||
| exists(CallNode call, AttrNode a | | ||
| call.getFunction() = a and | ||
| a.getName() = "pop" and // TODO: Should be made more robust, like Value::named("set.pop").getACall() |
There was a problem hiding this comment.
I think doing it this way is probably plenty robust.
The way I see it, if dataflow has managed to track a list/set to this point, then I would feel pretty safe in assuming the pop method does what we imply here. Either way, I wouldn't want to introduce a dependence on Value, though it should be pretty safe in this case.
There was a problem hiding this comment.
I have updated the comment(s).
python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
RasmusWL
left a comment
There was a problem hiding this comment.
If I understood correctly, here's a few doc changes that will make things clearer :)
python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
Co-authored-by: Taus <tausbn@github.com>
Co-authored-by: Taus <tausbn@github.com>
Co-authored-by: Taus <tausbn@github.com>
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
RasmusWL
left a comment
There was a problem hiding this comment.
Besides my question about comprehensionReadStep, LGTM.
Minor nitpick: subscriptionReadStep sounds more like a monthly subscription, than indexing into an array with a subscript 😄
python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/src/experimental/dataflow/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
| // c denotes element of list or set | ||
| exists(For f, Comp comp | | ||
| f = getCompFor(comp) and | ||
| nodeFrom.getNode().(SequenceNode).getNode() = getCompIter(comp) and |
There was a problem hiding this comment.
This will only work for
x+1 for x in <sequence literal>, such as x+1 for x in [1,2,3] and not for
xs = [1,2,3]
ys = [x+1 for x in xs]
is that by purpose? (I can see the tests only use sequence literals, so tests won't tell you anything about this)
There was a problem hiding this comment.
Thanks, no, that is not on purpose. I meant to change this once I had a more reliable implementation of getCompIter (which I hope we do now, but I would love input on how to do it properly, that still feels like it might need extractor changes).
There was a problem hiding this comment.
I looked into how list comprehensions are exposed through QL. If we use the ListComp class directly instead of Comp, we can get the information we want (instead of your helper methods)
from ListComp lc
select lc, lc.getIterable(), lc.getIterationVariable(0).getAStore()Basically, what we need is Comp.getIterable(). The method is available on all subclasses of Comp, so it just needs to be exposed in the same way as Comp.getFunction().
I didn't look into how nested comprehensions work, so I would test that out before blindly accepting that we should always use 0 for the index 😄
P.S. The getCompFor functionality is already implemented in Comp.getNthInnerLoop, which is just a private member predicate, but I'm not sure you would even need that with the functionality above.
There was a problem hiding this comment.
Thanks for digging this out. As discussed, nested comprehensions will be dealt with in a separate PR.
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
RasmusWL
left a comment
There was a problem hiding this comment.
As discussed, nested comprehensions will be dealt with in a separate PR.
With this in mind, this LGTM.
|
@tausbn since you have requested changes, this PR is blocked right now. Just a FYI, since I recall that merging policy has been more relaxed at some point. |


This PR uses the
Contentconcept of the shared dataflow library to model flow into and out of sequences and collections. It is currently a proof of concept, but the achieved flow is quite encouraging (see diff forcoverage\test.py).The last few commits experiments with recording the indices for tuples and the keys for dictionaries. The hope being that sources will be sparse and so the extra precision will not hurt performance.
Perfomance in general will have to be evaluated for this approach.
Some issues have also surfaced, such as the extraction of comprehensions.