Implement _tokenize and update tokenize from v3.14.3#7392
Implement _tokenize and update tokenize from v3.14.3#7392youknowone merged 3 commits intoRustPython:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (4)
📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThe PR adds a new Python-accessible tokenizer module ( Changes
Sequence DiagramsequenceDiagram
participant Python as Python Caller
participant Iter as PyTokenizerIter
participant Reader as readline Callback
participant Parser as Token Parser
participant TokenMgmt as Token State Manager
Python->>Iter: __init__(readline, extra_tokens, encoding)
Iter->>Iter: Initialize state (Reading phase)
Python->>Iter: __next__()
Iter->>Iter: Check current phase
alt Reading Phase
Iter->>Reader: Call readline callback
Reader-->>Iter: Return line text
Iter->>Parser: Parse line → tokens
Parser-->>Iter: Return tokens
Iter->>TokenMgmt: Queue pending tokens
Iter->>Iter: Transition to Yielding phase
end
alt Yielding Phase
Iter->>TokenMgmt: Fetch next token
Iter->>Iter: Handle special cases (FSTRING_MIDDLE, Dedent, NL, EOF)
Iter->>Iter: Build token tuple (type, text, start, end, line)
Iter-->>Python: Return token tuple
end
alt Done Phase
Iter-->>Python: Raise StopIteration
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [x] lib: cpython/Lib/tabnanny.py dependencies:
dependent tests: (1 tests)
[x] lib: cpython/Lib/tokenize.py dependencies:
dependent tests: (132 tests)
Legend:
|
There was a problem hiding this comment.
should rename to _tokenize, right?
ShaharNaveh
left a comment
There was a problem hiding this comment.
Way better than what I did
|
@ShaharNaveh this is obviously based on your work! |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
crates/stdlib/src/_tokenize.rs (2)
396-419: Extract magic numbers 60 and 63 to named constants.The function uses magic numbers
60(FSTRING_MIDDLE) and63(TSTRING_MIDDLE) directly. For consistency with the other token constants defined at the top of the module (lines 25-29), consider adding named constants.♻️ Suggested refactor
Add to the constants section (after line 29):
const TOKEN_FSTRING_MIDDLE: u8 = 60; const TOKEN_TSTRING_MIDDLE: u8 = 63;Then update the function:
TokenKind::FStringStart => { if depth == 0 { - return 60; // FSTRING_MIDDLE + return TOKEN_FSTRING_MIDDLE; } depth -= 1; } TokenKind::TStringStart => { if depth == 0 { - return 63; // TSTRING_MIDDLE + return TOKEN_TSTRING_MIDDLE; } depth -= 1; } _ => {} } } - 60 // default to FSTRING_MIDDLE + TOKEN_FSTRING_MIDDLE // default🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/_tokenize.rs` around lines 396 - 419, The function find_fstring_middle_type currently returns magic numbers 60 and 63; define named constants (e.g., TOKEN_FSTRING_MIDDLE and TOKEN_TSTRING_MIDDLE) alongside the other token constants at the top of the module and replace the literal returns (and the final default 60) with those constants so the function returns TOKEN_FSTRING_MIDDLE or TOKEN_TSTRING_MIDDLE instead of raw numbers.
107-149: Consider the memory overhead of cloningTokenizerState.The clone-modify-write pattern (line 109 reads/clones, line 140 writes back) avoids holding the lock during potentially slow
readline()calls. However, in theYieldingphase, this clones the entire source string and token vectors on eachnext()call, which could be expensive for large files.A more efficient approach would be to hold the write lock directly in the
Yieldingphase (since no external calls are made):♻️ Suggested optimization (optional)
impl IterNext for PyTokenizerIter { fn next(zelf: &Py<Self>, vm: &VirtualMachine) -> PyResult<PyIterReturn> { - let mut state = zelf.state.read().clone(); - loop { - match &mut state.phase { - TokenizerPhase::Reading { source } => { + // Check phase without holding lock + let is_reading = matches!(&zelf.state.read().phase, TokenizerPhase::Reading { .. }); + + if is_reading { + let mut state = zelf.state.write(); + if let TokenizerPhase::Reading { source } = &mut state.phase { let line = zelf.readline(vm)?; // ... rest of reading logic - } - TokenizerPhase::Yielding { .. } => { - let result = - emit_next_token(&mut state, zelf.extra_tokens, vm)?; - *zelf.state.write() = state; - return Ok(result); } + } else { + let mut state = zelf.state.write(); + return emit_next_token(&mut state, zelf.extra_tokens, vm); } } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/_tokenize.rs` around lines 107 - 149, The current PyTokenizerIter::next clones the entire TokenizerState at the start (zelf.state.read().clone()), which causes repeated heavy clones when in TokenizerPhase::Yielding; instead, only clone/read for the Reading branch but for the Yielding branch acquire a write lock on zelf.state (i.e., get a write guard) and call emit_next_token with a mutable reference into that guarded state (TokenizerPhase::Yielding) so you avoid cloning source/tokens on each next(); keep the readline path unchanged (read/clone/modify/write) but switch to holding the write lock inside the TokenizerPhase::Yielding arm before calling emit_next_token and returning its PyIterReturn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/stdlib/src/_tokenize.rs`:
- Around line 742-744: The match currently returns 0 (ENDMARKER) for
TokenKind::Unknown (and leaves IpyEscapeCommand and Question as 0); change these
arms so they map to ERRORTOKEN (numeric 67 or the appropriate ERRORTOKEN
constant) instead of 0: update the match handling for TokenKind::Unknown,
TokenKind::IpyEscapeCommand, and TokenKind::Question in the tokenizer function
to return 67 (or token::ERRORTOKEN) so unrecognized/IPython-specific tokens
produce ERRORTOKEN rather than ENDMARKER.
- Around line 189-199: The call to source.full_line_str(...) is invalid; replace
it with a small helper function (e.g., full_line_at(source: &str, byte_offset:
usize) -> &str or String) and call that from the tokenization code (replace
source.full_line_str(TextSize::from(...)) with full_line_at(source,
offset_min)). Implement full_line_at to locate the start and end byte indices
around byte_offset by scanning source bytes for the previous '\n' (or start) and
next '\n' (or end), trimming a trailing '\r' if present so CRLF lines return the
line without '\r'; keep using TextSize::from(offset.min(source.len()) as u32) to
compute the offset passed from the existing code. Also fix the offset
calculation (where you currently do .map(|l| l.len() + 1)) to sum byte lengths
including the actual newline sequence length for each consumed line (detect
'\r\n' vs '\n') or instead compute the byte offset by scanning source bytes for
the (sl-1)-th newline to avoid assuming single-byte line endings; update all
places calling full_line_str to use the new full_line_at helper.
---
Nitpick comments:
In `@crates/stdlib/src/_tokenize.rs`:
- Around line 396-419: The function find_fstring_middle_type currently returns
magic numbers 60 and 63; define named constants (e.g., TOKEN_FSTRING_MIDDLE and
TOKEN_TSTRING_MIDDLE) alongside the other token constants at the top of the
module and replace the literal returns (and the final default 60) with those
constants so the function returns TOKEN_FSTRING_MIDDLE or TOKEN_TSTRING_MIDDLE
instead of raw numbers.
- Around line 107-149: The current PyTokenizerIter::next clones the entire
TokenizerState at the start (zelf.state.read().clone()), which causes repeated
heavy clones when in TokenizerPhase::Yielding; instead, only clone/read for the
Reading branch but for the Yielding branch acquire a write lock on zelf.state
(i.e., get a write guard) and call emit_next_token with a mutable reference into
that guarded state (TokenizerPhase::Yielding) so you avoid cloning source/tokens
on each next(); keep the readline path unchanged (read/clone/modify/write) but
switch to holding the write lock inside the TokenizerPhase::Yielding arm before
calling emit_next_token and returning its PyIterReturn.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 51c73337-da3f-40a7-a355-8dedaf682e2d
⛔ Files ignored due to path filters (4)
Cargo.lockis excluded by!**/*.lockLib/test/test_tabnanny.pyis excluded by!Lib/**Lib/test/test_tokenize.pyis excluded by!Lib/**Lib/tokenize.pyis excluded by!Lib/**
📒 Files selected for processing (3)
crates/stdlib/Cargo.tomlcrates/stdlib/src/_tokenize.rscrates/stdlib/src/lib.rs
| let offset: usize = source | ||
| .lines() | ||
| .take(sl.saturating_sub(1)) | ||
| .map(|l| l.len() + 1) | ||
| .sum(); | ||
| let full_line = | ||
| source.full_line_str(TextSize::from(offset.min(source.len()) as u32)); | ||
| return Ok(PyIterReturn::Return(make_token_tuple( | ||
| vm, tok_type, &tok_str, sl, sc as isize, el, ec as isize, full_line, | ||
| ))); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if line endings are normalized elsewhere in the codebase
rg -n "replace.*\\\\r\\\\n|normalize.*line" --type=rs crates/stdlib/src/ -C2
# Check how other modules handle line counting
ast-grep --pattern $'source.lines().$$$'Repository: RustPython/RustPython
Length of output: 754
🏁 Script executed:
# Get the full function context around lines 189-199
sed -n '170,210p' crates/stdlib/src/_tokenize.rs
# Find the function signature to understand where source comes from
rg -B 20 "let offset: usize = source" crates/stdlib/src/_tokenize.rs | head -50Repository: RustPython/RustPython
Length of output: 2230
🏁 Script executed:
# Check how source is created/normalized in this file
rg -n "fn.*source|Source::" crates/stdlib/src/_tokenize.rs | head -20
# Look for any line ending normalization in the module
rg -n "replace|normalize|CRLF|\\\\r" crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 280
🏁 Script executed:
# Check what Source type is and if it handles normalization
rg -B 5 -A 10 "impl.*Source|struct Source" crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Check the Source type definition
rg -n "struct Source|impl Source" crates/stdlib/src/_tokenize.rs -A 5
# Look at the context of line 270-271 where \r is handled
sed -n '265,280p' crates/stdlib/src/_tokenize.rs
# Find the source variable declaration to understand its type
rg -n "let source" crates/stdlib/src/_tokenize.rs | head -10Repository: RustPython/RustPython
Length of output: 913
🏁 Script executed:
# Check what the offset is used for - full_line_str method
rg -n "full_line_str" crates/stdlib/src/_tokenize.rs -B 2 -A 2Repository: RustPython/RustPython
Length of output: 1711
🏁 Script executed:
# Find the definition of full_line_str to understand what it does with the offset
rg -n "fn full_line_str" crates/stdlib/src/_tokenize.rs -A 10Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Check what type source is - likely a reference to some Source type
rg -n "fn next_token.*source:|impl Iterator" crates/stdlib/src/_tokenize.rs -B 5 | head -40Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Let's look at the function signature that contains this code
sed -n '1,189p' crates/stdlib/src/_tokenize.rs | tail -80Repository: RustPython/RustPython
Length of output: 3182
🏁 Script executed:
# Find the readline method to see if it normalizes line endings
rg -n "fn readline" crates/stdlib/src/_tokenize.rs -A 15Repository: RustPython/RustPython
Length of output: 816
🏁 Script executed:
# Look for the readline implementation on the tokenizer/file object
rg -n "impl.*Tokenizer|readline" crates/stdlib/src/_tokenize.rs | head -20Repository: RustPython/RustPython
Length of output: 834
🏁 Script executed:
# Check if there's any normalization of line endings when reading
rg -n "readline|\.lines\(\)" crates/ --type rust -A 2 -B 2 | grep -i "crlf\|normalize\|replace.*\\\\r" -A 5 -B 5Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Let's find the full_line_str implementation - it might be on a Source type or Line Index type
rg -n "full_line_str" crates/ --type rust -B 2 -A 8 | head -50Repository: RustPython/RustPython
Length of output: 3566
🏁 Script executed:
# Check the imports and what Source type is being used
head -30 crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 985
🏁 Script executed:
# Look for TextSize or other offset-related types
rg -n "TextSize|LineIndex" crates/stdlib/src/_tokenize.rs | head -20Repository: RustPython/RustPython
Length of output: 599
🏁 Script executed:
# Search for full_line_str definition - it might be from ruff_source_file
rg -n "full_line_str" crates/ -A 5 | grep "fn\|trait\|impl" | head -20Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Check if source has a method full_line_str or if it's on the string type
# Let's see what type source is more carefully
sed -n '117,130p' crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 959
🏁 Script executed:
# Look at ruff_source_file usage - check if LineIndex has methods for extracting lines
rg -n "LineIndex::from_source_text|\.line_str\|\.full_line\|\.get_line" crates/ --type rust | head -20Repository: RustPython/RustPython
Length of output: 683
🏁 Script executed:
# Let's look at what type source actually is - check the TokenizerState struct
rg -n "struct TokenizerState|TokenizerPhase::Yielding" crates/stdlib/src/_tokenize.rs -A 10 | head -50Repository: RustPython/RustPython
Length of output: 1779
🏁 Script executed:
# Search for the definition of full_line_str more broadly
rg "fn full_line_str" crates/ --type rustRepository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Check if source might implement a trait that has full_line_str
rg -n "impl.*for.*str|impl.*for.*String" crates/stdlib/src/_tokenize.rs -A 10Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Check what full_line_str is - might be a method on String or &str
rg -n "fn full_line_str|trait.*full_line" crates/ --type rustRepository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Search across the entire codebase for full_line_str definition
find crates -name "*.rs" -exec grep -l "full_line_str" {} \;Repository: RustPython/RustPython
Length of output: 95
🏁 Script executed:
# Let's check if it's from the ruff_source_file crate - look at Cargo.toml
grep -r "ruff_source_file\|ruff_text_size" Cargo.tomlRepository: RustPython/RustPython
Length of output: 300
🏁 Script executed:
# Since source is a String, let's check if full_line_str is a custom extension method
rg -n "trait.*full_line|impl.*String.*full_line" crates/ --type rustRepository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Check if there's a trait implementation or extension in the file itself
grep -n "impl\|trait" crates/stdlib/src/_tokenize.rs | tail -30Repository: RustPython/RustPython
Length of output: 782
🏁 Script executed:
# Let's look at the actual method calls - maybe it's a typo or the code doesn't compile
# Check if this file has any compilation issues
sed -n '189,200p' crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 523
🏁 Script executed:
# Search for what could provide full_line_str - check if it's defined on a wrapper type
rg -C 5 "full_line_str" crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 2873
🏁 Script executed:
# Check if there are any trait implementations or extensions in the file
sed -n '500,650p' crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 5030
🏁 Script executed:
# Let's see the end of the file to find if there's a trait impl
tail -100 crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 3667
🏁 Script executed:
# Search for where full_line_str might come from - perhaps it's defined as an extension
rg -n "extension|trait Ext" crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Let's search for any custom trait or extension method definition
grep -n "full_line_str\|fn.*full_line" crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 533
🏁 Script executed:
# Check all the lines that compile and use full_line_str
# This suggests it's a valid method - let's check if there's a custom impl block
sed -n '1,650p' crates/stdlib/src/_tokenize.rs | grep -n "impl\|trait\|full_line"Repository: RustPython/RustPython
Length of output: 1372
🏁 Script executed:
# Let's look for uses of the method and try to understand what it should return
# Check what type full_line parameter expects in make_token_tuple
sed -n '583,598p' crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 646
🏁 Script executed:
# Search for full_line_str implementation in the entire codebase
grep -r "fn full_line_str" crates/ --include="*.rs"Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Search in the entire repository including external dependencies
find . -name "*.rs" -type f | xargs grep -l "fn full_line_str" 2>/dev/nullRepository: RustPython/RustPython
Length of output: 49
🏁 Script executed:
# Maybe it's in ruff_source_file - check if there's a local copy
find . -path "./target" -prune -o -name "*.rs" -type f | xargs grep "full_line_str" 2>/dev/null | head -20Repository: RustPython/RustPython
Length of output: 707
🏁 Script executed:
# Search for trait definitions or macros in the _tokenize.rs file
sed -n '1,100p' crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 3434
🏁 Script executed:
# Check if there's a custom trait or impl on &str or String in the file
grep -n "trait\|impl.*for.*str\|impl.*for.*String\|impl.*str\|impl.*String" crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 110
🏁 Script executed:
# Let's check the git history to see if full_line_str was recently added or if there was a refactor
git log --oneline crates/stdlib/src/_tokenize.rs | head -20Repository: RustPython/RustPython
Length of output: 123
🏁 Script executed:
# Let's check the full recent git history for this file
git log -n 5 --format="%H %s" crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 155
🏁 Script executed:
# Check the most recent commit to see what changed
git show 042aed48 --stat | head -30Repository: RustPython/RustPython
Length of output: 1688
🏁 Script executed:
# Look for full_line_str in the recent commit
git show 042aed48 crates/stdlib/src/_tokenize.rs | grep -A 5 -B 5 "full_line_str" | head -40Repository: RustPython/RustPython
Length of output: 1694
🏁 Script executed:
# Search for full_line_str definition in ruff dependencies or any Rust trait
grep -r "full_line_str" . --include="*.rs" 2>/dev/null | grep -v "_tokenize.rs" | head -20Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Let's check if this is maybe a method that was supposed to be defined but isn't
# Try to see if there's a compile error by checking if the code can be built
cd crates/stdlib && cargo check 2>&1 | grep -A 5 "full_line_str\|error" | head -30Repository: RustPython/RustPython
Length of output: 151
🏁 Script executed:
# Check if there's maybe a different approach used elsewhere for extracting lines
rg "full_line|get_line|extract.*line" crates/ --type rust -A 2 | head -40Repository: RustPython/RustPython
Length of output: 2498
🏁 Script executed:
# Let's check if maybe full_line_str is defined as a macro or via a trait method
# Look for any trait definitions on String or &str
rg "impl.*str|impl.*String|trait.*str" crates/stdlib/src/_tokenize.rs -A 10Repository: RustPython/RustPython
Length of output: 402
🏁 Script executed:
# Check what methods are available on String type that might extract a line
# Let's look at how line_str is used - it should be a &str
sed -n '280,295p' crates/stdlib/src/_tokenize.rsRepository: RustPython/RustPython
Length of output: 959
🏁 Script executed:
# Check if this might be a method that should be defined but isn't
# Look at the method signature we're trying to call
rg "full_line_str\(TextSize" crates/stdlib/src/_tokenize.rs -B 3 -A 1Repository: RustPython/RustPython
Length of output: 567
full_line_str method does not exist and will cause compilation failure.
The code calls source.full_line_str(TextSize) at multiple locations (lines 195, 288, 351, 383, 437, 472), but this method is not defined on String or &str types in the codebase. This will prevent compilation. Either implement this method as an extension trait or replace these calls with an alternative approach to extract line content at the given offset.
The secondary concern about CRLF line ending handling in the offset calculation (lines 189-193) also applies: .map(|l| l.len() + 1) assumes single-byte (\n) line endings and will miscalculate for CRLF (\r\n) sources.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/stdlib/src/_tokenize.rs` around lines 189 - 199, The call to
source.full_line_str(...) is invalid; replace it with a small helper function
(e.g., full_line_at(source: &str, byte_offset: usize) -> &str or String) and
call that from the tokenization code (replace
source.full_line_str(TextSize::from(...)) with full_line_at(source,
offset_min)). Implement full_line_at to locate the start and end byte indices
around byte_offset by scanning source bytes for the previous '\n' (or start) and
next '\n' (or end), trimming a trailing '\r' if present so CRLF lines return the
line without '\r'; keep using TextSize::from(offset.min(source.len()) as u32) to
compute the offset passed from the existing code. Also fix the offset
calculation (where you currently do .map(|l| l.len() + 1)) to sum byte lengths
including the actual newline sequence length for each consumed line (detect
'\r\n' vs '\n') or instead compute the byte offset by scanning source bytes for
the (sl-1)-th newline to avoid assuming single-byte line endings; update all
places calling full_line_str to use the new full_line_at helper.
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.
Replace per-line reparsing with single-pass tokenization:
- Read all lines via readline, parse once, yield tokens
- Fix token type values (COMMENT=65, NL=66, OP=55)
- Fix NEWLINE/NL end positions and implicit newline handling
- Fix DEDENT positions via look-ahead to next non-DEDENT token
- Handle FSTRING_MIDDLE brace unescaping ({{ → {, }} → })
- Emit implicit NL before ENDMARKER when source lacks trailing newline
- Raise IndentationError from lexer errors
- Remove 13 expectedFailure marks for now-passing tests
042aed4 to
55737ed
Compare
close #6240
Summary by CodeRabbit
New Features
Chores