gh-63161: Fix tokenize detect_encoding() for non-ASCII coding by vstinner · Pull Request #139235 · python/cpython

vstinner · 2025-09-22T13:06:18Z

Issue: Non-UTF8 encoding line #63161

serhiy-storchaka

Please add a test for the coded cookie on the second line (and non-ascii first line).

Also add a test with specified ASCII encoding, but non-ASCII content that can still be decoded as UTF-8. E.g. '#coding=ascii €'.encode('utf-8') and corresponding for two lines.

vstinner · 2025-09-22T14:18:42Z

@serhiy-storchaka: I added more tests, please review the updated PR. Is it what you wanted?

serhiy-storchaka

Thank you for update. In two-line cases please use non-ASCII data in the first line, before the codec cookie. Test that the tokenizer uses correct encoding to decode comments in first lines.

It may be already tested elsewhere, but I would also add tests for non-ASCII data in the first and in the second comment lines, when no codec cookie is present (so UTF-8 should be used). For valid and invalid UTF-8.

I expect that the tokenizer correctly decodes files that match the explicit or implicit encoding, and reject files that do not match. And the interpreter should work the same.

vstinner · 2025-09-23T14:28:39Z

Ok, I added more tests. Please review the updated PR.

vstinner · 2025-10-07T23:51:08Z

@serhiy-storchaka: It seems like you're working on the same area these days and you have more advanced fix. I can abandon this PR, no?

serhiy-storchaka · 2025-10-08T08:34:09Z

Agree. Sorry, but I already had tests for the core interpreter and the model of how it should work. I only needed to beat the code until it started to pass the tests.

pythongh-63161: Fix tokenize detect_encoding() for non-ASCII coding

fb7b944

vstinner requested review from lysnikolaou and pablogsal as code owners September 22, 2025 13:06

vstinner added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Sep 22, 2025

bedevere-app bot added the awaiting core review label Sep 22, 2025

bedevere-app bot mentioned this pull request Sep 22, 2025

Non-UTF8 encoding line #63161

Closed

Add NEWS entry

c535b65

serhiy-storchaka reviewed Sep 22, 2025

View reviewed changes

Add tests

5723fc5

serhiy-storchaka reviewed Sep 22, 2025

View reviewed changes

vstinner added 2 commits September 23, 2025 16:23

Add more tests

911dc3a

Test comments with no coding marker

e36d860

vstinner closed this Oct 8, 2025

vstinner deleted the nonascii_coding branch October 8, 2025 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235
vstinner wants to merge 5 commits intopython:mainfrom
vstinner:nonascii_coding

vstinner commented Sep 22, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

serhiy-storchaka left a comment

Uh oh!

vstinner commented Sep 22, 2025

Uh oh!

serhiy-storchaka left a comment

Uh oh!

vstinner commented Sep 23, 2025

Uh oh!

vstinner commented Oct 7, 2025

Uh oh!

serhiy-storchaka commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vstinner commented Sep 22, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented Sep 22, 2025

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented Sep 23, 2025

Uh oh!

vstinner commented Oct 7, 2025

Uh oh!

serhiy-storchaka commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vstinner commented Sep 22, 2025 •

edited by bedevere-app bot

Loading