-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urljoining an empty query string doesn't clear query string #76960
Comments
|
urljoining with '?' will not clear a query string: ACTUAL:
>>> import urllib.parse
>>> urllib.parse.urljoin('http://a/b/c?d=e', '?')
'http://a/b/c?d=e'EXPECTED: WhatWG's URL standard expects a relative URL consisting of only a ? to replace a query string: https://url.spec.whatwg.org/#relative-state Seen in versions 3.6 and 2.7, but probably also affects later versions. |
|
I'm working on a patch for this and can have one up in the next week or so, once I get the CLA signed and other boxes ticked. I'm new to the Github process but hopefully it will be a good start for the discussion. |
|
Python follows not WhatWG but RFC. |
|
In this case, the RFC is mismatched from the actual behaviour of browsers (as described and codified by WhatWG). It was surprising to me that urljoin() didn't do what I percieved as "the right thing" (and I expect other users would too). I would personally expect urljoin to do "the thing that everybody else does". Is there a sensible way to reduce this mismatch? For reference, Java's stdlib does what I would expect here: |
|
The relevant part in the RFC pseudo code is which is implemented in urljoin as: if not query:
query = bqueryIs this correct? Should the code not say "if query is not None"? |
|
Sorry, urlparse returns '' rather than None when there is no query. However, my main point was to question whether fixing this is actually in contradiction with the RFC. |
|
Reading more into this, from section 5.2,1:
So you could say that since there is a '?', the query component is *defined*, but *empty*. This would mean that assigning the target query to be '' has the desired effect as implemented by browsers and other languages' standard libraries. |
|
Yes, this a bug with urljoin in Python. I compared the behavior against Ruby, and Golang. require 'uri'
base_url = 'https://www.example.com/?a=b'
relative_url = '?'
url = URI.join(base_url, relative_url).to_s
puts urlOutput: And go's https://pkg.go.dev/net/url@go1.19beta1#URL.ResolveReference, ResolveReference resolves a URI reference to an absolute URI from an absolute base URI u, per RFC 3986 Section 5.2. package main
import (
"fmt"
"net/url"
)
func main() {
base, _ := url.Parse("https://example.com/?a=b")
u, _ := url.Parse("?")
fmt.Println(base.ResolveReference(u))
}In python We can treat this as bug and fix it. However, there are very high chances that people are relying on the current no-op behavior, and this change in behavior can cause frameworks to break. This sounds a bit costly to me. |
…ponents * urljoin() with relative reference "?" sets empty query and removes fragment. * Preserve empty components (authority, params, query, fragment) in urljoin(). * Preserve empty components (authority, params, query) in urldefrag(). Also refactor the code and get rid of double _coerce_args() and _coerce_result() calls in urljoin(), urldefrag(), urlparse() and urlunparse().
|
This is not only about "?". For example, "?#z" also should set an empty query. There may be also issues with empty authority ("//"), but this is less clear. Currently It seems that there is an error in RFC3986. According to its algorithm, an the relative reference should always override fragment. And since an empty string does not have fragment, it urljoin() with empty string should remove fragment from the base URI. But older RFCs explicitly said that the original base URI should be returned. RFC3986 does not contain such statement explicitly and there is no such case in examples. As for "http:?" etc, RFC3986 allows to ignore the scheme if it is the same as the scheme in the base URI for compatibility. Current |
Thanks for explaining this clearly and taking this approach, @serhiy-storchaka . My one feedback is, we make this change early in the cycle of release so that we give ample time for many tools that rely on this underlying parser to detect the behavior change (if they relied on a previous one). |
GH-123273) * urljoin() with relative reference "?" sets empty query and removes fragment. * Preserve empty components (authority, params, query, fragment) in urljoin(). * Preserve empty components (authority, params, query) in urldefrag(). Also refactor the code and get rid of double _coerce_args() and _coerce_result() calls in urljoin(), urldefrag(), urlparse() and urlunparse().
|
Yet one bug (and perhaps more serious one) -- removing trailing semicolon from the path. >>> urljoin('http:/a/b/c/d;#f', '#z')
'http:///a/b/c/d#z'
>>> urldefrag('http:/a/b/c/d;#f')
DefragResult(url='http:///a/b/c/d', fragment='f')RFC3986 does not define "params". The URI consists of 5 components: scheme, authority, path, query, fragment. This is why I think that it is worth to backport the solution. |
True. And when I dug a bit into this topic yesterday. Using params is highly discouraged now, and I think, very old Sun Servers and FTPs used a params for storing session state The backport is fine. But it carries the risk of breaking some applications. Calling it out and early backport is alright IMO. |
Since python/cpython#76960 if the path of the URI does not start with a '/', the netloc is removed. Closes #13543
Since python/cpython#76960 if the path of the URI does not start with a '/', the netloc is removed. Closes #13543 (grafted from 6d420b9f4bffc77dc599abff4b7f8e17a061542d) --HG-- branch : 6.0
Since python/cpython#76960 if the path of the URI does not start with a '/', the netloc is removed. Closes #13543 (grafted from 6d420b9f4bffc77dc599abff4b7f8e17a061542d) --HG-- branch : 7.2
Since python/cpython#76960 if the path of the URI does not start with a '/', the netloc is removed. Closes #13543 (grafted from 6d420b9f4bffc77dc599abff4b7f8e17a061542d) --HG-- branch : 7.0

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: