X Tutup
The Wayback Machine - https://web.archive.org/web/20241008073548/https://github.com/python/cpython/issues/76960
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urljoining an empty query string doesn't clear query string #76960

Open
thetorpedodog mannequin opened this issue Feb 6, 2018 · 12 comments
Open

urljoining an empty query string doesn't clear query string #76960

thetorpedodog mannequin opened this issue Feb 6, 2018 · 12 comments
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@thetorpedodog
Copy link
Mannequin

thetorpedodog mannequin commented Feb 6, 2018

BPO 32779
Nosy @orsenthil, @asvetlov, @thetorpedodog, @iritkatriel
PRs
  • gh-76960: Fix urljoining with an empty query string. #5645
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2018-02-06.04:48:49.412>
    labels = ['type-bug', 'library', '3.9', '3.10', '3.11']
    title = "urljoining an empty query string doesn't clear query string"
    updated_at = <Date 2021-05-28.14:36:47.327>
    user = 'https://github.com/thetorpedodog'

    bugs.python.org fields:

    activity = <Date 2021-05-28.14:36:47.327>
    actor = 'pfish'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2018-02-06.04:48:49.412>
    creator = 'pfish'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 32779
    keywords = ['patch']
    message_count = 7.0
    messages = ['311704', '311937', '312201', '312223', '394648', '394649', '394664']
    nosy_count = 4.0
    nosy_names = ['orsenthil', 'asvetlov', 'pfish', 'iritkatriel']
    pr_nums = ['5645']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue32779'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    Linked PRs

    @thetorpedodog
    Copy link
    Mannequin Author

    thetorpedodog mannequin commented Feb 6, 2018

    urljoining with '?' will not clear a query string:

    ACTUAL:
    >>> import urllib.parse
    >>> urllib.parse.urljoin('http://a/b/c?d=e', '?')
    'http://a/b/c?d=e'

    EXPECTED:
    'http://a/b/c' (optionally, with a ? at the end)

    WhatWG's URL standard expects a relative URL consisting of only a ? to replace a query string:

    https://url.spec.whatwg.org/#relative-state

    Seen in versions 3.6 and 2.7, but probably also affects later versions.

    @thetorpedodog thetorpedodog mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 6, 2018
    @thetorpedodog
    Copy link
    Mannequin Author

    thetorpedodog mannequin commented Feb 10, 2018

    I'm working on a patch for this and can have one up in the next week or so, once I get the CLA signed and other boxes ticked. I'm new to the Github process but hopefully it will be a good start for the discussion.

    @asvetlov
    Copy link
    Contributor

    Python follows not WhatWG but RFC.
    https://tools.ietf.org/html/rfc3986#section-5.2.2 is proper definition for url joining algorithm.

    @thetorpedodog
    Copy link
    Mannequin Author

    thetorpedodog mannequin commented Feb 15, 2018

    In this case, the RFC is mismatched from the actual behaviour of browsers (as described and codified by WhatWG). It was surprising to me that urljoin() didn't do what I percieved as "the right thing" (and I expect other users would too).

    I would personally expect urljoin to do "the thing that everybody else does". Is there a sensible way to reduce this mismatch?

    For reference, Java's stdlib does what I would expect here:

    URI base = URI.create("https://example.com/?a=b");
    URI rel = base.resolve("?");
    System.out.println(rel);
    

    https://example.com/?

    @iritkatriel
    Copy link
    Member

    The relevant part in the RFC pseudo code is

               if defined(R.query) then
                  T.query = R.query;
               else
                  T.query = Base.query;
               endif;
    

    which is implemented in urljoin as:

            if not query:
                query = bquery

    Is this correct? Should the code not say "if query is not None"?
    (I can't see in the RFC a definition of defined()).

    @iritkatriel iritkatriel added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes labels May 28, 2021
    @iritkatriel
    Copy link
    Member

    Sorry, urlparse returns '' rather than None when there is no query.
    So we indeed need to check something like
    if '?' not in url:
    or what's in Paul's patch.

    However, my main point was to question whether fixing this is actually in contradiction with the RFC.

    @thetorpedodog
    Copy link
    Mannequin Author

    thetorpedodog mannequin commented May 28, 2021

    Reading more into this, from section 5.2,1:

    A component is undefined if its associated delimiter does not appear in the URI reference

    So you could say that since there is a '?', the query component is *defined*, but *empty*. This would mean that assigning the target query to be '' has the desired effect as implemented by browsers and other languages' standard libraries.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @orsenthil
    Copy link
    Member

    Yes, this a bug with urljoin in Python.

    I compared the behavior against Ruby, and Golang.

    require 'uri'
    
    base_url = 'https://www.example.com/?a=b'
    relative_url = '?'
    
    url = URI.join(base_url, relative_url).to_s
    
    puts url

    Output:

    https://www.example.com/?

    And go's https://pkg.go.dev/net/url@go1.19beta1#URL.ResolveReference, ResolveReference resolves a URI reference to an absolute URI from an absolute base URI u, per RFC 3986 Section 5.2.

    package main
    
    import (
    	"fmt"
    	"net/url"
    )
    
    func main() {
    	base, _ := url.Parse("https://example.com/?a=b")
    	u, _ := url.Parse("?")
    	fmt.Println(base.ResolveReference(u))
    }

    https://example.com/?

    In python

    Python 3.12.0a7+ (heads/main:4898415df7, Apr 20 2023, 16:46:44) [GCC 11.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import urllib.parse
    >>> urllib.parse.urljoin("http://www.example.com/?a/b","?")
    'http://www.example.com/?a/b'
    

    We can treat this as bug and fix it. However, there are very high chances that people are relying on the current no-op behavior, and this change in behavior can cause frameworks to break. This sounds a bit costly to me.

    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 23, 2024
    …ponents
    
    * urljoin() with relative reference "?" sets empty query and removes fragment.
    * Preserve empty components (authority, params, query, fragment) in urljoin().
    * Preserve empty components (authority, params, query) in urldefrag().
    
    Also refactor the code and get rid of double _coerce_args() and
    _coerce_result() calls in urljoin(), urldefrag(), urlparse() and
    urlunparse().
    @serhiy-storchaka
    Copy link
    Member

    This is not only about "?". For example, "?#z" also should set an empty query. There may be also issues with empty authority ("//"), but this is less clear.

    Currently urllib.parse does not distinguish between undefined (no delimiter) and empty (there is a delimiter) components, both are represented by empty strings (see #99962). To make the behavior RFC3986 compatible we need a new parser. So I refactored the code in #123273, made internal helper functions more RFC3986 compatible. Public functions are now wrappers around them. Later I will add options to make public parsing/unparsing API also optionally RFC3986 compatible.

    It seems that there is an error in RFC3986. According to its algorithm, an the relative reference should always override fragment. And since an empty string does not have fragment, it urljoin() with empty string should remove fragment from the base URI. But older RFCs explicitly said that the original base URI should be returned. RFC3986 does not contain such statement explicitly and there is no such case in examples.

    As for "http:?" etc, RFC3986 allows to ignore the scheme if it is the same as the scheme in the base URI for compatibility. Current urllib.parse parser is not strict according to RFC3986. We may add an option for strict mode, but this is a different issue.

    @orsenthil
    Copy link
    Member

    Currently urllib.parse does not distinguish between undefined (no delimiter) and empty (there is a delimiter) components, both are represented by empty strings (see #99962). To make the behavior RFC3986 compatible we need a new parser. So I refactored the code in #123273, made internal helper functions more RFC3986 compatible.

    Thanks for explaining this clearly and taking this approach, @serhiy-storchaka .

    My one feedback is, we make this change early in the cycle of release so that we give ample time for many tools that rely on this underlying parser to detect the behavior change (if they relied on a previous one).

    @picnixz picnixz added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes and removed 3.11 only security fixes 3.10 only security fixes 3.9 only security fixes labels Aug 30, 2024
    serhiy-storchaka added a commit that referenced this issue Aug 31, 2024
    GH-123273)
    
    * urljoin() with relative reference "?" sets empty query and removes fragment.
    * Preserve empty components (authority, params, query, fragment) in urljoin().
    * Preserve empty components (authority, params, query) in urldefrag().
    
    Also refactor the code and get rid of double _coerce_args() and
    _coerce_result() calls in urljoin(), urldefrag(), urlparse() and
    urlunparse().
    @serhiy-storchaka
    Copy link
    Member

    Yet one bug (and perhaps more serious one) -- removing trailing semicolon from the path.

    >>> urljoin('http:/a/b/c/d;#f', '#z')
    'http:///a/b/c/d#z'
    >>> urldefrag('http:/a/b/c/d;#f')
    DefragResult(url='http:///a/b/c/d', fragment='f')

    RFC3986 does not define "params". The URI consists of 5 components: scheme, authority, path, query, fragment. '/b/c/d;' is the path, and the urllib functions mutilate it.

    This is why I think that it is worth to backport the solution.

    @orsenthil
    Copy link
    Member

    orsenthil commented Sep 1, 2024

    RFC3986 does not define "params".

    True. And when I dug a bit into this topic yesterday. Using params is highly discouraged now, and I think, very old Sun Servers and FTPs used a params for storing session state

    The backport is fine. But it carries the risk of breaking some applications. Calling it out and early backport is alright IMO.

    tryton-mirror-keeper pushed a commit to tryton/tryton that referenced this issue Sep 28, 2024
    Since python/cpython#76960 if the path of the URI
    does not start with a '/', the netloc is removed.
    
    Closes #13543
    tryton-mirror-keeper pushed a commit to tryton/tryton that referenced this issue Oct 1, 2024
    Since python/cpython#76960 if the path of the URI
    does not start with a '/', the netloc is removed.
    
    Closes #13543
    (grafted from 6d420b9f4bffc77dc599abff4b7f8e17a061542d)
    
    --HG--
    branch : 6.0
    tryton-mirror-keeper pushed a commit to tryton/tryton that referenced this issue Oct 1, 2024
    Since python/cpython#76960 if the path of the URI
    does not start with a '/', the netloc is removed.
    
    Closes #13543
    (grafted from 6d420b9f4bffc77dc599abff4b7f8e17a061542d)
    
    --HG--
    branch : 7.2
    tryton-mirror-keeper pushed a commit to tryton/tryton that referenced this issue Oct 1, 2024
    Since python/cpython#76960 if the path of the URI
    does not start with a '/', the netloc is removed.
    
    Closes #13543
    (grafted from 6d420b9f4bffc77dc599abff4b7f8e17a061542d)
    
    --HG--
    branch : 7.0
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants
    X Tutup