X Tutup
Skip to content

Coercion warning feedback for unary distinct outside aggregate function#13162

Open
bekapono wants to merge 1 commit intosqlalchemy:mainfrom
bekapono:distinct-coercion
Open

Coercion warning feedback for unary distinct outside aggregate function#13162
bekapono wants to merge 1 commit intosqlalchemy:mainfrom
bekapono:distinct-coercion

Conversation

@bekapono
Copy link

Description

Adds a warning when column-level distinct() is used outside an aggregate
function. Misuse can cause runtime errors on some dialects (e.g. MSSQL via
pyodbc) or silently produce incorrect SQL on others.

The warning check is implemented in ColumnsClauseImpl._post_coercion, which
walks the expression tree via visitors.iterate() to detect
UnaryExpression+distinct_op appearing outside an aggregate. This also covers
cases like select(distinct("x") + 1).

This logic is dependent on visitors.iterate() to traverse top-down. Doing so, we are checking
if an aggregate function has been seen prior to arriving to UnaryExpression+distinct_op

Warning issued: "column-expression-level unary DISTINCT clause used outside
aggregate; use select().distinct() for full-statement DISTINCT."

Fixes: #11526

Checklist

This pull request is:

  • A documentation / typographical / small typing error fix
    • Good to go, no issue or tests are needed
  • A short code fix
    • please include the issue number, and create an issue if none exists, which
      must include a complete example of the issue. one line code fixes without an
      issue and demonstration will not be accepted.
    • Please include: Fixes: #<issue number> in the commit message
    • please include tests. one line code fixes without tests will not be accepted.
  • A new feature implementation
    • please include the issue number, and create an issue if none exists, which must
      include a complete example of how the feature would look.
    • Please include: Fixes: #<issue number> in the commit message
    • please include tests.

Have a nice day!

@bekapono
Copy link
Author

Warning feedback for unary distinct outside aggregate function

Overview

Address misuse of distinct functionality, and how that can result in runtime errors or unintended SQL statement results. The solution proposed focuses on a coercion time check that will issue a warning if misuse is detected.

Context

User (tboddyspargo) reported #11526 where the SQL statement produced with MSSQL via pyodbc was incorrect, causing a runtime error. Maintainer (zzzeek) notified the user that they misused column-level distinct() and should instead use statement-level distinct(). Misuse of distinct() can cause runtime errors on certain dialects, or produce silent incorrect SQL statements on others. In #11518, two approaches were considered: a compile-time check and a coercion-time check. Discussion concluded with CaselIT opening #11526 for tracking. Column-level distinct is intended to be wrapped within an aggregate function, whereas statement-level distinct() is intended to be used for the whole statement i.e. SELECT DISTINCT …

Goals & Non-goals

Goals:

  • Detection of column-level distinct() used outside an aggregate function
  • Feedback when users use column-level distinct() incorrectly

Non-goals:

  • Existing behavior will not be broken — warning is advisory and does not raise an exception
  • Will not address other UnaryExpression misuse patterns

Proposed Solution

The warning check is implemented in ColumnsClauseImpl._post_coercion. By the time _post_coercion is triggered, the full column expression tree is constructed bottom-up.

To detect column-level distinct() used outside of an aggregate, visitors.iterate(ent) walks the expression tree top to bottom, checking for UnaryExpression+distinct_op and an aggregate function. If an aggregate function is not seen while UnaryExpression+distinct_op is present, a warning is issued. This also covers cases like select(distinct("x") + 1) where UnaryExpression is not the top level expression.

This logic is dependent on visitors.iterate() to traverse top-down. Doing so, we are checking
if an aggregate function has been seen prior to arriving to UnaryExpression+distinct_op

Warning issued:

"column-expression-level unary DISTINCT clause used outside aggregate; use select().distinct() for full-statement DISTINCT."

Alternatives Considered

Trigger coercion check in implicit_coercion vs post_coercion:

  • Cons: Would require introducing a new role purposefully excluded from the MRO — felt like a hack that could interfere with existing coercion implementations
  • Pros: I felt it would of followed SQLAlchemy culture more closely based on other implicit_coercion patterns

Top level UnaryExpression+distinct_op coercion check (best alternative):

  • Con: Does not validate that column-level distinct() is wrapped within an aggregate function
  • Pro: Faster — no need to traverse the full expression tree
  • Pro: Satisfies the original issue of column-level distinct() used in place of statement-level distinct()

Creating a new role FunctionDistinctRole:

  • Con: Expression trees are constructed bottom-up, so parent context hasn't been constructed yet when FunctionDistinctImpl is triggered, making aggregate detection impossible

Creating a new subclass of UnaryExpression:

  • Con: Without full knowledge of the codebase and future intentions for distinct(), this felt too invasive
  • Pro: Would simplify detection by checking class type directly rather than inspecting UnaryExpression for its operator

Copy link
Member

@zzzeek zzzeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi -

thanks for this PR! I think this is a good start, however I think we need to change the approach to use the compiler, rather than coercion. the former should be much simpler and not have any performance issues.

# test-case 1 of 2 for mssql dialect - issue#11526
def test_unary_distinct_with_limit(self):
t = table("sometable", column("somecolumn"))
with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests should all be modified to show correct use of column.distinct(), rather than adding a warning catch

)
eq_([(7,), (8,), (9,), (10,)], q.all())
with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"):
q = fixture_session().query(distinct(User.id)).order_by(User.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix SQL to be correct, remove expect_warnings

def test_standalone_w_subquery(self):
User = self.classes.User
q = fixture_session().query(distinct(User.id))
with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix SQL to be correct, remove expect_warnings

"SELECT DISTINCT mytable.myid FROM mytable",
)
with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"):
self.assert_compile(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix SQL to be correct, remove expect_warnings

"SELECT DISTINCT mytable.myid FROM mytable",
)
with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"):
self.assert_compile(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix SQL to be correct, remove expect_warnings

eq_(r._mapping[users.c.user_name], "john")
eq_(r.user_name, "john")
with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"):
r = connection.execute(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use valid SQL

def test_distinct(self, connection):
test_table = self.tables.test
with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"):
s = select(distinct(test_table.c.avalue))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use valid SQL

s = select(test_table.c.avalue.distinct())
eq_(connection.execute(s).scalar(), 25)
with testing.expect_warnings("column-expression-level unary.*DISTINCT.*outside aggregate"):
s = select(test_table.c.avalue.distinct())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use valid SQL

" use select().distinct() for full-statement DISTINCT."
)

def _valid_unary_hierarchy(self, ent: ColumnElement[_T], op_name: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I havent checked closely, but it looks like _valid_unary_hierarchy would be called for every element of a select() that is coercing on ColumnsClauseImpl, that is:

select(a, b, c)

would call _valid_unary_hierarchy three times, no matter what a, b, c were, and would then call into at least two isinstance calls, as well as a whole visitors.iterate() call, etc.

unfortunately this approach cannot be taken here, this would add an order of magnitude of performance overhead to virtually all SQLAlchemy applications.

I have to apologize because I wasn't really thinking through a complete solution here when I said this:

does "distinct()" only go inside of a SQL function ? That's all we'd have to look for. we can put "distinct()" into a special coercion category based on the roles, like FunctionDistinctRole.

Sorry, I think I was incorrect there, since the main thing we need to catch is when distinct() is used outside of a function and warn for that, and to do this at coercion would mean distinct() would have to be taken outside of the whole UnaryExpressionElement hierarchy so that it could be caught directly. So I apologize if this caused a lot of time to be spent on the wrong approach.

When we need to warn for aspects of a SQL construct that depend on traversal of the hierarchy in context, this has to happen in the compiler, since that's where we're already traversing. we also have a little bit more performance wiggle room there since compilations are cached (coercions are not).

That said here's a simple patch to get this behavior (needs polish):

diff --git a/lib/sqlalchemy/sql/compiler.py b/lib/sqlalchemy/sql/compiler.py
index f0e608e74..09cf3d196 100644
--- a/lib/sqlalchemy/sql/compiler.py
+++ b/lib/sqlalchemy/sql/compiler.py
@@ -3111,6 +3111,8 @@ class SQLCompiler(Compiled):
 
         text: str
 
+        kwargs["allow_distinct_op"] = True
+
         if disp:
             text = disp(func, **kwargs)
         else:
@@ -3258,6 +3260,13 @@ class SQLCompiler(Compiled):
             disp = self._get_operator_dispatch(
                 unary.operator, "unary", "operator"
             )
+            if unary.operator is operators.distinct_op and not kw.get(
+                "allow_distinct_op", False
+            ):
+                util.warn(
+                    "can't use distinct() outside of an aggregate function "
+                    "TODO improve this message"
+                )
             if disp:
                 return disp(unary, unary.operator, **kw)
             else:

Im not sure if I want to make this "allow_distinct_op" or "inside_aggregate_function", but this is a starting point


_aggregate_functions = {
# Basic Aggregate function listed from GenericFunctions
'count', 'sum', 'avg',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should allow DISTINCT inside of all functions for now, we can't predict what kinds of aggregate functions may exist on all platforms, and requiring functions to indicate "aggregate OK" is overkill. we are just trying to prevent the simple case of select(distinct(col)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Warn when distinct() is used outside of aggregated functions

2 participants

X Tutup