urljoining an empty query string doesn't clear query string #76960

thetorpedodog · 2018-02-06T04:48:49Z

BPO	32779
Nosy	@orsenthil, @asvetlov, @thetorpedodog, @iritkatriel
PRs	gh-76960: Fix urljoining with an empty query string. #5645

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2018-02-06.04:48:49.412>
labels = ['type-bug', 'library', '3.9', '3.10', '3.11']
title = "urljoining an empty query string doesn't clear query string"
updated_at = <Date 2021-05-28.14:36:47.327>
user = 'https://github.com/thetorpedodog'

bugs.python.org fields:

activity = <Date 2021-05-28.14:36:47.327>
actor = 'pfish'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2018-02-06.04:48:49.412>
creator = 'pfish'
dependencies = []
files = []
hgrepos = []
issue_num = 32779
keywords = ['patch']
message_count = 7.0
messages = ['311704', '311937', '312201', '312223', '394648', '394649', '394664']
nosy_count = 4.0
nosy_names = ['orsenthil', 'asvetlov', 'pfish', 'iritkatriel']
pr_nums = ['5645']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue32779'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

Linked PRs

gh-76960: Fix urljoin() and urldefrag() for URIs with empty components #123273

thetorpedodog · 2018-02-06T04:48:49Z

urljoining with '?' will not clear a query string:

ACTUAL:
>>> import urllib.parse
>>> urllib.parse.urljoin('http://a/b/c?d=e', '?')
'http://a/b/c?d=e'

EXPECTED:
'http://a/b/c' (optionally, with a ? at the end)

WhatWG's URL standard expects a relative URL consisting of only a ? to replace a query string:

https://url.spec.whatwg.org/#relative-state

Seen in versions 3.6 and 2.7, but probably also affects later versions.

thetorpedodog · 2018-02-10T06:05:15Z

I'm working on a patch for this and can have one up in the next week or so, once I get the CLA signed and other boxes ticked. I'm new to the Github process but hopefully it will be a good start for the discussion.

asvetlov · 2018-02-15T11:04:20Z

Python follows not WhatWG but RFC.
https://tools.ietf.org/html/rfc3986#section-5.2.2 is proper definition for url joining algorithm.

thetorpedodog · 2018-02-15T20:28:52Z

In this case, the RFC is mismatched from the actual behaviour of browsers (as described and codified by WhatWG). It was surprising to me that urljoin() didn't do what I percieved as "the right thing" (and I expect other users would too).

I would personally expect urljoin to do "the thing that everybody else does". Is there a sensible way to reduce this mismatch?

For reference, Java's stdlib does what I would expect here:

URI base = URI.create("https://example.com/?a=b");
URI rel = base.resolve("?");
System.out.println(rel);

https://example.com/?

iritkatriel · 2021-05-28T09:52:37Z

The relevant part in the RFC pseudo code is

           if defined(R.query) then
              T.query = R.query;
           else
              T.query = Base.query;
           endif;

which is implemented in urljoin as:

        if not query:
            query = bquery

Is this correct? Should the code not say "if query is not None"?
(I can't see in the RFC a definition of defined()).

iritkatriel · 2021-05-28T10:00:16Z

Sorry, urlparse returns '' rather than None when there is no query.
So we indeed need to check something like
if '?' not in url:
or what's in Paul's patch.

However, my main point was to question whether fixing this is actually in contradiction with the RFC.

thetorpedodog · 2021-05-28T14:36:47Z

Reading more into this, from section 5.2,1:

A component is undefined if its associated delimiter does not appear in the URI reference

So you could say that since there is a '?', the query component is *defined*, but *empty*. This would mean that assigning the target query to be '' has the desired effect as implemented by browsers and other languages' standard libraries.

orsenthil · 2023-04-21T00:11:21Z

Yes, this a bug with urljoin in Python.

I compared the behavior against Ruby, and Golang.

require 'uri'

base_url = 'https://www.example.com/?a=b'
relative_url = '?'

url = URI.join(base_url, relative_url).to_s

puts url

Output:

https://www.example.com/?

And go's https://pkg.go.dev/net/url@go1.19beta1#URL.ResolveReference, ResolveReference resolves a URI reference to an absolute URI from an absolute base URI u, per RFC 3986 Section 5.2.

package main

import (
	"fmt"
	"net/url"
)

func main() {
	base, _ := url.Parse("https://example.com/?a=b")
	u, _ := url.Parse("?")
	fmt.Println(base.ResolveReference(u))
}

https://example.com/?

In python

Python 3.12.0a7+ (heads/main:4898415df7, Apr 20 2023, 16:46:44) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.parse
>>> urllib.parse.urljoin("http://www.example.com/?a/b","?")
'http://www.example.com/?a/b'

We can treat this as bug and fix it. However, there are very high chances that people are relying on the current no-op behavior, and this change in behavior can cause frameworks to break. This sounds a bit costly to me.

…ponents * urljoin() with relative reference "?" sets empty query and removes fragment. * Preserve empty components (authority, params, query, fragment) in urljoin(). * Preserve empty components (authority, params, query) in urldefrag(). Also refactor the code and get rid of double _coerce_args() and _coerce_result() calls in urljoin(), urldefrag(), urlparse() and urlunparse().

serhiy-storchaka · 2024-08-23T21:09:50Z

This is not only about "?". For example, "?#z" also should set an empty query. There may be also issues with empty authority ("//"), but this is less clear.

Currently urllib.parse does not distinguish between undefined (no delimiter) and empty (there is a delimiter) components, both are represented by empty strings (see #99962). To make the behavior RFC3986 compatible we need a new parser. So I refactored the code in #123273, made internal helper functions more RFC3986 compatible. Public functions are now wrappers around them. Later I will add options to make public parsing/unparsing API also optionally RFC3986 compatible.

It seems that there is an error in RFC3986. According to its algorithm, an the relative reference should always override fragment. And since an empty string does not have fragment, it urljoin() with empty string should remove fragment from the base URI. But older RFCs explicitly said that the original base URI should be returned. RFC3986 does not contain such statement explicitly and there is no such case in examples.

As for "http:?" etc, RFC3986 allows to ignore the scheme if it is the same as the scheme in the base URI for compatibility. Current urllib.parse parser is not strict according to RFC3986. We may add an option for strict mode, but this is a different issue.

orsenthil · 2024-08-30T15:38:37Z

Currently urllib.parse does not distinguish between undefined (no delimiter) and empty (there is a delimiter) components, both are represented by empty strings (see #99962). To make the behavior RFC3986 compatible we need a new parser. So I refactored the code in #123273, made internal helper functions more RFC3986 compatible.

Thanks for explaining this clearly and taking this approach, @serhiy-storchaka .

My one feedback is, we make this change early in the cycle of release so that we give ample time for many tools that rely on this underlying parser to detect the behavior change (if they relied on a previous one).

GH-123273) * urljoin() with relative reference "?" sets empty query and removes fragment. * Preserve empty components (authority, params, query, fragment) in urljoin(). * Preserve empty components (authority, params, query) in urldefrag(). Also refactor the code and get rid of double _coerce_args() and _coerce_result() calls in urljoin(), urldefrag(), urlparse() and urlunparse().

serhiy-storchaka · 2024-08-31T09:51:43Z

Yet one bug (and perhaps more serious one) -- removing trailing semicolon from the path.

>>> urljoin('http:/a/b/c/d;#f', '#z')
'http:///a/b/c/d#z'
>>> urldefrag('http:/a/b/c/d;#f')
DefragResult(url='http:///a/b/c/d', fragment='f')

RFC3986 does not define "params". The URI consists of 5 components: scheme, authority, path, query, fragment. '/b/c/d;' is the path, and the urllib functions mutilate it.

This is why I think that it is worth to backport the solution.

orsenthil · 2024-09-01T14:02:14Z

RFC3986 does not define "params".

True. And when I dug a bit into this topic yesterday. Using params is highly discouraged now, and I think, very old Sun Servers and FTPs used a params for storing session state

The backport is fine. But it carries the risk of breaking some applications. Calling it out and early backport is alright IMO.

Since python/cpython#76960 if the path of the URI does not start with a '/', the netloc is removed. Closes #13543

Since python/cpython#76960 if the path of the URI does not start with a '/', the netloc is removed. Closes #13543 (grafted from 6d420b9f4bffc77dc599abff4b7f8e17a061542d) --HG-- branch : 6.0

Since python/cpython#76960 if the path of the URI does not start with a '/', the netloc is removed. Closes #13543 (grafted from 6d420b9f4bffc77dc599abff4b7f8e17a061542d) --HG-- branch : 7.2

Since python/cpython#76960 if the path of the URI does not start with a '/', the netloc is removed. Closes #13543 (grafted from 6d420b9f4bffc77dc599abff4b7f8e17a061542d) --HG-- branch : 7.0

thetorpedodog mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 6, 2018

iritkatriel added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes labels May 28, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

bedevere-bot mentioned this issue May 9, 2023

gh-76960: Fix urljoining with an empty query string. #5645

Open

pirate mentioned this issue Apr 25, 2024

a bug of urllib.parse.urljoin ArchiveBox/ArchiveBox#1411

Closed

bedevere-app bot mentioned this issue Aug 23, 2024

gh-76960: Fix urljoin() and urldefrag() for URIs with empty components #123273

Merged

picnixz added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes and removed 3.11 only security fixes 3.10 only security fixes 3.9 only security fixes labels Aug 30, 2024

tryton-mirror-keeper pushed a commit to tryton/tryton that referenced this issue Sep 28, 2024

Start PostgreSQL URI path with a '/'

3d8fcb9

Since python/cpython#76960 if the path of the URI does not start with a '/', the netloc is removed. Closes #13543

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urljoining an empty query string doesn't clear query string #76960

urljoining an empty query string doesn't clear query string #76960

thetorpedodog mannequin commented Feb 6, 2018 •

edited by bedevere-app bot

Loading

thetorpedodog mannequin commented Feb 6, 2018

thetorpedodog mannequin commented Feb 10, 2018

asvetlov commented Feb 15, 2018

thetorpedodog mannequin commented Feb 15, 2018

iritkatriel commented May 28, 2021

iritkatriel commented May 28, 2021

thetorpedodog mannequin commented May 28, 2021

orsenthil commented Apr 21, 2023

serhiy-storchaka commented Aug 23, 2024

orsenthil commented Aug 30, 2024

serhiy-storchaka commented Aug 31, 2024

orsenthil commented Sep 1, 2024 •

edited

Loading

Sep	OCT	Nov
	08
2023	2024	2025

urljoining an empty query string doesn't clear query string #76960

urljoining an empty query string doesn't clear query string #76960

Comments

thetorpedodog mannequin commented Feb 6, 2018 • edited by bedevere-app bot Loading

Linked PRs

thetorpedodog mannequin commented Feb 6, 2018

thetorpedodog mannequin commented Feb 10, 2018

asvetlov commented Feb 15, 2018

thetorpedodog mannequin commented Feb 15, 2018

iritkatriel commented May 28, 2021

iritkatriel commented May 28, 2021

thetorpedodog mannequin commented May 28, 2021

orsenthil commented Apr 21, 2023

serhiy-storchaka commented Aug 23, 2024

orsenthil commented Aug 30, 2024

serhiy-storchaka commented Aug 31, 2024

orsenthil commented Sep 1, 2024 • edited Loading

thetorpedodog mannequin commented Feb 6, 2018 •

edited by bedevere-app bot

Loading

orsenthil commented Sep 1, 2024 •

edited

Loading