gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235
gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235vstinner wants to merge 5 commits intopython:mainfrom
Conversation
serhiy-storchaka
left a comment
There was a problem hiding this comment.
Please add a test for the coded cookie on the second line (and non-ascii first line).
Also add a test with specified ASCII encoding, but non-ASCII content that can still be decoded as UTF-8. E.g. '#coding=ascii €'.encode('utf-8') and corresponding for two lines.
|
@serhiy-storchaka: I added more tests, please review the updated PR. Is it what you wanted? |
serhiy-storchaka
left a comment
There was a problem hiding this comment.
Thank you for update. In two-line cases please use non-ASCII data in the first line, before the codec cookie. Test that the tokenizer uses correct encoding to decode comments in first lines.
It may be already tested elsewhere, but I would also add tests for non-ASCII data in the first and in the second comment lines, when no codec cookie is present (so UTF-8 should be used). For valid and invalid UTF-8.
I expect that the tokenizer correctly decodes files that match the explicit or implicit encoding, and reject files that do not match. And the interpreter should work the same.
|
Ok, I added more tests. Please review the updated PR. |
|
@serhiy-storchaka: It seems like you're working on the same area these days and you have more advanced fix. I can abandon this PR, no? |
|
Agree. Sorry, but I already had tests for the core interpreter and the model of how it should work. I only needed to beat the code until it started to pass the tests. |
Uh oh!
There was an error while loading. Please reload this page.