Coercion warning feedback for unary distinct outside aggregate function#13162
Coercion warning feedback for unary distinct outside aggregate function#13162bekapono wants to merge 1 commit intosqlalchemy:mainfrom
Conversation
…tion. Fixes issue#11526
Warning feedback for unary distinct outside aggregate functionOverviewAddress misuse of distinct functionality, and how that can result in runtime errors or unintended SQL statement results. The solution proposed focuses on a coercion time check that will issue a warning if misuse is detected. ContextUser (tboddyspargo) reported #11526 where the SQL statement produced with MSSQL via pyodbc was incorrect, causing a runtime error. Maintainer (zzzeek) notified the user that they misused column-level Goals & Non-goalsGoals:
Non-goals:
Proposed SolutionThe warning check is implemented in To detect column-level This logic is dependent on visitors.iterate() to traverse top-down. Doing so, we are checking Warning issued:
Alternatives ConsideredTrigger coercion check in
Top level
Creating a new role
Creating a new subclass of
|
zzzeek
left a comment
There was a problem hiding this comment.
hi -
thanks for this PR! I think this is a good start, however I think we need to change the approach to use the compiler, rather than coercion. the former should be much simpler and not have any performance issues.
| # test-case 1 of 2 for mssql dialect - issue#11526 | ||
| def test_unary_distinct_with_limit(self): | ||
| t = table("sometable", column("somecolumn")) | ||
| with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"): |
There was a problem hiding this comment.
these tests should all be modified to show correct use of column.distinct(), rather than adding a warning catch
| ) | ||
| eq_([(7,), (8,), (9,), (10,)], q.all()) | ||
| with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"): | ||
| q = fixture_session().query(distinct(User.id)).order_by(User.id) |
There was a problem hiding this comment.
fix SQL to be correct, remove expect_warnings
| def test_standalone_w_subquery(self): | ||
| User = self.classes.User | ||
| q = fixture_session().query(distinct(User.id)) | ||
| with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"): |
There was a problem hiding this comment.
fix SQL to be correct, remove expect_warnings
| "SELECT DISTINCT mytable.myid FROM mytable", | ||
| ) | ||
| with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"): | ||
| self.assert_compile( |
There was a problem hiding this comment.
fix SQL to be correct, remove expect_warnings
| "SELECT DISTINCT mytable.myid FROM mytable", | ||
| ) | ||
| with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"): | ||
| self.assert_compile( |
There was a problem hiding this comment.
fix SQL to be correct, remove expect_warnings
| eq_(r._mapping[users.c.user_name], "john") | ||
| eq_(r.user_name, "john") | ||
| with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"): | ||
| r = connection.execute( |
| def test_distinct(self, connection): | ||
| test_table = self.tables.test | ||
| with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"): | ||
| s = select(distinct(test_table.c.avalue)) |
| s = select(test_table.c.avalue.distinct()) | ||
| eq_(connection.execute(s).scalar(), 25) | ||
| with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"): | ||
| s = select(test_table.c.avalue.distinct()) |
| " use select().distinct() for full-statement DISTINCT." | ||
| ) | ||
|
|
||
| def _valid_unary_hierarchy(self, ent: ColumnElement[_T], op_name: str) -> bool: |
There was a problem hiding this comment.
I havent checked closely, but it looks like _valid_unary_hierarchy would be called for every element of a select() that is coercing on ColumnsClauseImpl, that is:
select(a, b, c)
would call _valid_unary_hierarchy three times, no matter what a, b, c were, and would then call into at least two isinstance calls, as well as a whole visitors.iterate() call, etc.
unfortunately this approach cannot be taken here, this would add an order of magnitude of performance overhead to virtually all SQLAlchemy applications.
I have to apologize because I wasn't really thinking through a complete solution here when I said this:
does "distinct()" only go inside of a SQL function ? That's all we'd have to look for. we can put "distinct()" into a special coercion category based on the roles, like FunctionDistinctRole.
Sorry, I think I was incorrect there, since the main thing we need to catch is when distinct() is used outside of a function and warn for that, and to do this at coercion would mean distinct() would have to be taken outside of the whole UnaryExpressionElement hierarchy so that it could be caught directly. So I apologize if this caused a lot of time to be spent on the wrong approach.
When we need to warn for aspects of a SQL construct that depend on traversal of the hierarchy in context, this has to happen in the compiler, since that's where we're already traversing. we also have a little bit more performance wiggle room there since compilations are cached (coercions are not).
That said here's a simple patch to get this behavior (needs polish):
diff --git a/lib/sqlalchemy/sql/compiler.py b/lib/sqlalchemy/sql/compiler.py
index f0e608e74..09cf3d196 100644
--- a/lib/sqlalchemy/sql/compiler.py
+++ b/lib/sqlalchemy/sql/compiler.py
@@ -3111,6 +3111,8 @@ class SQLCompiler(Compiled):
text: str
+ kwargs["allow_distinct_op"] = True
+
if disp:
text = disp(func, **kwargs)
else:
@@ -3258,6 +3260,13 @@ class SQLCompiler(Compiled):
disp = self._get_operator_dispatch(
unary.operator, "unary", "operator"
)
+ if unary.operator is operators.distinct_op and not kw.get(
+ "allow_distinct_op", False
+ ):
+ util.warn(
+ "can't use distinct() outside of an aggregate function "
+ "TODO improve this message"
+ )
if disp:
return disp(unary, unary.operator, **kw)
else:Im not sure if I want to make this "allow_distinct_op" or "inside_aggregate_function", but this is a starting point
|
|
||
| _aggregate_functions = { | ||
| # Basic Aggregate function listed from GenericFunctions | ||
| 'count', 'sum', 'avg', |
There was a problem hiding this comment.
we should allow DISTINCT inside of all functions for now, we can't predict what kinds of aggregate functions may exist on all platforms, and requiring functions to indicate "aggregate OK" is overkill. we are just trying to prevent the simple case of select(distinct(col)).
Description
Adds a warning when column-level
distinct()is used outside an aggregatefunction. Misuse can cause runtime errors on some dialects (e.g. MSSQL via
pyodbc) or silently produce incorrect SQL on others.
The warning check is implemented in
ColumnsClauseImpl._post_coercion, whichwalks the expression tree via
visitors.iterate()to detectUnaryExpression+distinct_opappearing outside an aggregate. This also coverscases like
select(distinct("x") + 1).This logic is dependent on visitors.iterate() to traverse top-down. Doing so, we are checking
if an aggregate function has been seen prior to arriving to
UnaryExpression+distinct_opWarning issued: "column-expression-level unary DISTINCT clause used outside
aggregate; use select().distinct() for full-statement DISTINCT."
Fixes: #11526
Checklist
This pull request is:
must include a complete example of the issue. one line code fixes without an
issue and demonstration will not be accepted.
Fixes: #<issue number>in the commit messageinclude a complete example of how the feature would look.
Fixes: #<issue number>in the commit messageHave a nice day!