Implement _tokenize and update tokenize from v3.14.3 by youknowone · Pull Request #7392 · RustPython/RustPython

youknowone · 2026-03-09T13:56:18Z

close #6240

Summary by CodeRabbit

New Features
- Added a tokenizer accessible from Python that iterates token streams, with f-string handling, encoding awareness, implicit-newline and indentation error reporting.
Chores
- Added workspace dependencies for Ruff-related crates (parser, AST, text-size, source-file) to support the stdlib build.

coderabbitai · 2026-03-09T14:00:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 0f32219c-d6cf-4084-8ad8-5c73335a6b68

📥 Commits

Reviewing files that changed from the base of the PR and between 042aed4 and 55737ed.

⛔ Files ignored due to path filters (4)

Cargo.lock is excluded by !**/*.lock
Lib/test/test_tabnanny.py is excluded by !Lib/**
Lib/test/test_tokenize.py is excluded by !Lib/**
Lib/tokenize.py is excluded by !Lib/**

📒 Files selected for processing (3)

crates/stdlib/Cargo.toml
crates/stdlib/src/_tokenize.rs
crates/stdlib/src/lib.rs

🚧 Files skipped from review as they are similar to previous changes (3)

crates/stdlib/src/lib.rs
crates/stdlib/Cargo.toml
crates/stdlib/src/_tokenize.rs

📝 Walkthrough

Walkthrough

The PR adds a new Python-accessible tokenizer module (_tokenize) implemented in Rust, registers it in stdlib, and adds four Ruff workspace dependencies to crates/stdlib/Cargo.toml. The tokenizer provides a TokenizerIter exposing Python iterator semantics with f-string, indentation, and EOF handling.

Changes

Cohort / File(s)	Summary
Dependency Management `crates/stdlib/Cargo.toml`	Added four workspace dependencies: `ruff_python_parser`, `ruff_python_ast`, `ruff_text_size`, `ruff_source_file`.
Tokenizer Implementation `crates/stdlib/src/_tokenize.rs`	New Rust module (~747 lines) adding `PyTokenizerIter` (Python class `TokenizerIter`) with iterator protocol, multi-phase state (Reading, Yielding, Done), f-string splitting/handling, implicit-NL/EOF logic, indentation/syntax error reporting, and token tuple construction.
Module Registration `crates/stdlib/src/lib.rs`	Declared `mod _tokenize;` and added `_tokenize::module_def(ctx)` to stdlib module registrations.

Sequence Diagram

sequenceDiagram
    participant Python as Python Caller
    participant Iter as PyTokenizerIter
    participant Reader as readline Callback
    participant Parser as Token Parser
    participant TokenMgmt as Token State Manager

    Python->>Iter: __init__(readline, extra_tokens, encoding)
    Iter->>Iter: Initialize state (Reading phase)
    
    Python->>Iter: __next__()
    Iter->>Iter: Check current phase
    
    alt Reading Phase
        Iter->>Reader: Call readline callback
        Reader-->>Iter: Return line text
        Iter->>Parser: Parse line → tokens
        Parser-->>Iter: Return tokens
        Iter->>TokenMgmt: Queue pending tokens
        Iter->>Iter: Transition to Yielding phase
    end
    
    alt Yielding Phase
        Iter->>TokenMgmt: Fetch next token
        Iter->>Iter: Handle special cases (FSTRING_MIDDLE, Dedent, NL, EOF)
        Iter->>Iter: Build token tuple (type, text, start, end, line)
        Iter-->>Python: Return token tuple
    end
    
    alt Done Phase
        Iter-->>Python: Raise StopIteration
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Update ruff to 0.15.4 #7246: Updates Ruff dependency revisions for the same ruff_python_* crates added here.
Update ruff to 0.15.0 #7086: Modifies the same Ruff workspace dependency entries present in this PR.
Update ruff to 0.15.1 #7105: Adjusts workspace usage/revisions for the same Ruff crates referenced in this PR.

Poem

🐰 I hopped through code with tiny feet,
A tokenizer tune I learned to beat,
F-strings twirled and tokens spun,
Readline sang until the job was done. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main changes: implementing the _tokenize module and updating tokenize from v3.14.3.
Linked Issues check	✅ Passed	All coding requirements from `#6240` are met: the new _tokenize module is implemented with a Tokenizer iterator yielding token tuples, accepting readline callable and optional encoding, and internal dependencies are updated.
Out of Scope Changes check	✅ Passed	All changes directly support the implementation of the _tokenize module and its dependencies as required by `#6240`; no unrelated changes detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-09T14:21:55Z

ShaharNaveh · 2026-03-09T14:37:11Z

crates/stdlib/src/_tokenize.rs

should rename to _tokenize, right?

ShaharNaveh

Way better than what I did

youknowone · 2026-03-10T01:25:19Z

@ShaharNaveh this is obviously based on your work!

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

crates/stdlib/src/_tokenize.rs (2)

396-419: Extract magic numbers 60 and 63 to named constants.

The function uses magic numbers 60 (FSTRING_MIDDLE) and 63 (TSTRING_MIDDLE) directly. For consistency with the other token constants defined at the top of the module (lines 25-29), consider adding named constants.

♻️ Suggested refactor

Add to the constants section (after line 29):

const TOKEN_FSTRING_MIDDLE: u8 = 60;
const TOKEN_TSTRING_MIDDLE: u8 = 63;

Then update the function:

                 TokenKind::FStringStart => {
                     if depth == 0 {
-                        return 60; // FSTRING_MIDDLE
+                        return TOKEN_FSTRING_MIDDLE;
                     }
                     depth -= 1;
                 }
                 TokenKind::TStringStart => {
                     if depth == 0 {
-                        return 63; // TSTRING_MIDDLE
+                        return TOKEN_TSTRING_MIDDLE;
                     }
                     depth -= 1;
                 }
                 _ => {}
             }
         }
-        60 // default to FSTRING_MIDDLE
+        TOKEN_FSTRING_MIDDLE // default

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/_tokenize.rs` around lines 396 - 419, The function
find_fstring_middle_type currently returns magic numbers 60 and 63; define named
constants (e.g., TOKEN_FSTRING_MIDDLE and TOKEN_TSTRING_MIDDLE) alongside the
other token constants at the top of the module and replace the literal returns
(and the final default 60) with those constants so the function returns
TOKEN_FSTRING_MIDDLE or TOKEN_TSTRING_MIDDLE instead of raw numbers.

107-149: Consider the memory overhead of cloning TokenizerState.

The clone-modify-write pattern (line 109 reads/clones, line 140 writes back) avoids holding the lock during potentially slow readline() calls. However, in the Yielding phase, this clones the entire source string and token vectors on each next() call, which could be expensive for large files.

A more efficient approach would be to hold the write lock directly in the Yielding phase (since no external calls are made):

♻️ Suggested optimization (optional)

     impl IterNext for PyTokenizerIter {
         fn next(zelf: &Py<Self>, vm: &VirtualMachine) -> PyResult<PyIterReturn> {
-            let mut state = zelf.state.read().clone();
-
             loop {
-                match &mut state.phase {
-                    TokenizerPhase::Reading { source } => {
+                // Check phase without holding lock
+                let is_reading = matches!(&zelf.state.read().phase, TokenizerPhase::Reading { .. });
+                
+                if is_reading {
+                    let mut state = zelf.state.write();
+                    if let TokenizerPhase::Reading { source } = &mut state.phase {
                         let line = zelf.readline(vm)?;
                         // ... rest of reading logic
-                    }
-                    TokenizerPhase::Yielding { .. } => {
-                        let result =
-                            emit_next_token(&mut state, zelf.extra_tokens, vm)?;
-                        *zelf.state.write() = state;
-                        return Ok(result);
                     }
+                } else {
+                    let mut state = zelf.state.write();
+                    return emit_next_token(&mut state, zelf.extra_tokens, vm);
                 }
             }
         }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/_tokenize.rs` around lines 107 - 149, The current
PyTokenizerIter::next clones the entire TokenizerState at the start
(zelf.state.read().clone()), which causes repeated heavy clones when in
TokenizerPhase::Yielding; instead, only clone/read for the Reading branch but
for the Yielding branch acquire a write lock on zelf.state (i.e., get a write
guard) and call emit_next_token with a mutable reference into that guarded state
(TokenizerPhase::Yielding) so you avoid cloning source/tokens on each next();
keep the readline path unchanged (read/clone/modify/write) but switch to holding
the write lock inside the TokenizerPhase::Yielding arm before calling
emit_next_token and returning its PyIterReturn.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/stdlib/src/_tokenize.rs`:
- Around line 742-744: The match currently returns 0 (ENDMARKER) for
TokenKind::Unknown (and leaves IpyEscapeCommand and Question as 0); change these
arms so they map to ERRORTOKEN (numeric 67 or the appropriate ERRORTOKEN
constant) instead of 0: update the match handling for TokenKind::Unknown,
TokenKind::IpyEscapeCommand, and TokenKind::Question in the tokenizer function
to return 67 (or token::ERRORTOKEN) so unrecognized/IPython-specific tokens
produce ERRORTOKEN rather than ENDMARKER.
- Around line 189-199: The call to source.full_line_str(...) is invalid; replace
it with a small helper function (e.g., full_line_at(source: &str, byte_offset:
usize) -> &str or String) and call that from the tokenization code (replace
source.full_line_str(TextSize::from(...)) with full_line_at(source,
offset_min)). Implement full_line_at to locate the start and end byte indices
around byte_offset by scanning source bytes for the previous '\n' (or start) and
next '\n' (or end), trimming a trailing '\r' if present so CRLF lines return the
line without '\r'; keep using TextSize::from(offset.min(source.len()) as u32) to
compute the offset passed from the existing code. Also fix the offset
calculation (where you currently do .map(|l| l.len() + 1)) to sum byte lengths
including the actual newline sequence length for each consumed line (detect
'\r\n' vs '\n') or instead compute the byte offset by scanning source bytes for
the (sl-1)-th newline to avoid assuming single-byte line endings; update all
places calling full_line_str to use the new full_line_at helper.

---

Nitpick comments:
In `@crates/stdlib/src/_tokenize.rs`:
- Around line 396-419: The function find_fstring_middle_type currently returns
magic numbers 60 and 63; define named constants (e.g., TOKEN_FSTRING_MIDDLE and
TOKEN_TSTRING_MIDDLE) alongside the other token constants at the top of the
module and replace the literal returns (and the final default 60) with those
constants so the function returns TOKEN_FSTRING_MIDDLE or TOKEN_TSTRING_MIDDLE
instead of raw numbers.
- Around line 107-149: The current PyTokenizerIter::next clones the entire
TokenizerState at the start (zelf.state.read().clone()), which causes repeated
heavy clones when in TokenizerPhase::Yielding; instead, only clone/read for the
Reading branch but for the Yielding branch acquire a write lock on zelf.state
(i.e., get a write guard) and call emit_next_token with a mutable reference into
that guarded state (TokenizerPhase::Yielding) so you avoid cloning source/tokens
on each next(); keep the readline path unchanged (read/clone/modify/write) but
switch to holding the write lock inside the TokenizerPhase::Yielding arm before
calling emit_next_token and returning its PyIterReturn.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 51c73337-da3f-40a7-a355-8dedaf682e2d

📥 Commits

Reviewing files that changed from the base of the PR and between 2266ba7 and 042aed4.

⛔ Files ignored due to path filters (4)

Cargo.lock is excluded by !**/*.lock
Lib/test/test_tabnanny.py is excluded by !Lib/**
Lib/test/test_tokenize.py is excluded by !Lib/**
Lib/tokenize.py is excluded by !Lib/**

📒 Files selected for processing (3)

crates/stdlib/Cargo.toml
crates/stdlib/src/_tokenize.rs
crates/stdlib/src/lib.rs

coderabbitai · 2026-03-10T02:23:34Z

crates/stdlib/src/_tokenize.rs

+            let offset: usize = source
+                .lines()
+                .take(sl.saturating_sub(1))
+                .map(|l| l.len() + 1)
+                .sum();
+            let full_line =
+                source.full_line_str(TextSize::from(offset.min(source.len()) as u32));
+            return Ok(PyIterReturn::Return(make_token_tuple(
+                vm, tok_type, &tok_str, sl, sc as isize, el, ec as isize, full_line,
+            )));
+        }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if line endings are normalized elsewhere in the codebase rg -n "replace.*\\\\r\\\\n|normalize.*line" --type=rs crates/stdlib/src/ -C2 # Check how other modules handle line counting ast-grep --pattern $'source.lines().$$$'

Repository: RustPython/RustPython

Length of output: 754

🏁 Script executed:

# Get the full function context around lines 189-199 sed -n '170,210p' crates/stdlib/src/_tokenize.rs # Find the function signature to understand where source comes from rg -B 20 "let offset: usize = source" crates/stdlib/src/_tokenize.rs | head -50

Repository: RustPython/RustPython

Length of output: 2230

🏁 Script executed:

# Check how source is created/normalized in this file rg -n "fn.*source|Source::" crates/stdlib/src/_tokenize.rs | head -20 # Look for any line ending normalization in the module rg -n "replace|normalize|CRLF|\\\\r" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 280

🏁 Script executed:

# Check what Source type is and if it handles normalization rg -B 5 -A 10 "impl.*Source|struct Source" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Check the Source type definition rg -n "struct Source|impl Source" crates/stdlib/src/_tokenize.rs -A 5 # Look at the context of line 270-271 where \r is handled sed -n '265,280p' crates/stdlib/src/_tokenize.rs # Find the source variable declaration to understand its type rg -n "let source" crates/stdlib/src/_tokenize.rs | head -10

Repository: RustPython/RustPython

Length of output: 913

🏁 Script executed:

# Check what the offset is used for - full_line_str method rg -n "full_line_str" crates/stdlib/src/_tokenize.rs -B 2 -A 2

Repository: RustPython/RustPython

Length of output: 1711

🏁 Script executed:

# Find the definition of full_line_str to understand what it does with the offset rg -n "fn full_line_str" crates/stdlib/src/_tokenize.rs -A 10

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Check what type source is - likely a reference to some Source type rg -n "fn next_token.*source:|impl Iterator" crates/stdlib/src/_tokenize.rs -B 5 | head -40

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Let's look at the function signature that contains this code sed -n '1,189p' crates/stdlib/src/_tokenize.rs | tail -80

Repository: RustPython/RustPython

Length of output: 3182

🏁 Script executed:

# Find the readline method to see if it normalizes line endings rg -n "fn readline" crates/stdlib/src/_tokenize.rs -A 15

Repository: RustPython/RustPython

Length of output: 816

🏁 Script executed:

# Look for the readline implementation on the tokenizer/file object rg -n "impl.*Tokenizer|readline" crates/stdlib/src/_tokenize.rs | head -20

Repository: RustPython/RustPython

Length of output: 834

🏁 Script executed:

# Check if there's any normalization of line endings when reading rg -n "readline|\.lines" crates/ --type rust -A 2 -B 2 | grep -i "crlf\|normalize\|replace.*\\\\r" -A 5 -B 5

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Let's find the full_line_str implementation - it might be on a Source type or Line Index type rg -n "full_line_str" crates/ --type rust -B 2 -A 8 | head -50

Repository: RustPython/RustPython

Length of output: 3566

🏁 Script executed:

# Check the imports and what Source type is being used head -30 crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 985

🏁 Script executed:

# Look for TextSize or other offset-related types rg -n "TextSize|LineIndex" crates/stdlib/src/_tokenize.rs | head -20

Repository: RustPython/RustPython

Length of output: 599

🏁 Script executed:

# Search for full_line_str definition - it might be from ruff_source_file rg -n "full_line_str" crates/ -A 5 | grep "fn\|trait\|impl" | head -20

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Check if source has a method full_line_str or if it's on the string type # Let's see what type source is more carefully sed -n '117,130p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 959

🏁 Script executed:

# Look at ruff_source_file usage - check if LineIndex has methods for extracting lines rg -n "LineIndex::from_source_text|\.line_str\|\.full_line\|\.get_line" crates/ --type rust | head -20

Repository: RustPython/RustPython

Length of output: 683

🏁 Script executed:

# Let's look at what type source actually is - check the TokenizerState struct rg -n "struct TokenizerState|TokenizerPhase::Yielding" crates/stdlib/src/_tokenize.rs -A 10 | head -50

Repository: RustPython/RustPython

Length of output: 1779

🏁 Script executed:

# Search for the definition of full_line_str more broadly rg "fn full_line_str" crates/ --type rust

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Check if source might implement a trait that has full_line_str rg -n "impl.*for.*str|impl.*for.*String" crates/stdlib/src/_tokenize.rs -A 10

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Check what full_line_str is - might be a method on String or &str rg -n "fn full_line_str|trait.*full_line" crates/ --type rust

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Search across the entire codebase for full_line_str definition find crates -name "*.rs" -exec grep -l "full_line_str" {} \;

Repository: RustPython/RustPython

Length of output: 95

🏁 Script executed:

# Let's check if it's from the ruff_source_file crate - look at Cargo.toml grep -r "ruff_source_file\|ruff_text_size" Cargo.toml

Repository: RustPython/RustPython

Length of output: 300

🏁 Script executed:

# Since source is a String, let's check if full_line_str is a custom extension method rg -n "trait.*full_line|impl.*String.*full_line" crates/ --type rust

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Check if there's a trait implementation or extension in the file itself grep -n "impl\|trait" crates/stdlib/src/_tokenize.rs | tail -30

Repository: RustPython/RustPython

Length of output: 782

🏁 Script executed:

# Let's look at the actual method calls - maybe it's a typo or the code doesn't compile # Check if this file has any compilation issues sed -n '189,200p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 523

🏁 Script executed:

# Search for what could provide full_line_str - check if it's defined on a wrapper type rg -C 5 "full_line_str" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 2873

🏁 Script executed:

# Check if there are any trait implementations or extensions in the file sed -n '500,650p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 5030

🏁 Script executed:

# Let's see the end of the file to find if there's a trait impl tail -100 crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 3667

🏁 Script executed:

# Search for where full_line_str might come from - perhaps it's defined as an extension rg -n "extension|trait Ext" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Let's search for any custom trait or extension method definition grep -n "full_line_str\|fn.*full_line" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 533

🏁 Script executed:

# Check all the lines that compile and use full_line_str # This suggests it's a valid method - let's check if there's a custom impl block sed -n '1,650p' crates/stdlib/src/_tokenize.rs | grep -n "impl\|trait\|full_line"

Repository: RustPython/RustPython

Length of output: 1372

🏁 Script executed:

# Let's look for uses of the method and try to understand what it should return # Check what type full_line parameter expects in make_token_tuple sed -n '583,598p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 646

🏁 Script executed:

# Search for full_line_str implementation in the entire codebase grep -r "fn full_line_str" crates/ --include="*.rs"

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Search in the entire repository including external dependencies find . -name "*.rs" -type f | xargs grep -l "fn full_line_str" 2>/dev/null

Repository: RustPython/RustPython

Length of output: 49

🏁 Script executed:

# Maybe it's in ruff_source_file - check if there's a local copy find . -path "./target" -prune -o -name "*.rs" -type f | xargs grep "full_line_str" 2>/dev/null | head -20

Repository: RustPython/RustPython

Length of output: 707

🏁 Script executed:

# Search for trait definitions or macros in the _tokenize.rs file sed -n '1,100p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 3434

🏁 Script executed:

# Check if there's a custom trait or impl on &str or String in the file grep -n "trait\|impl.*for.*str\|impl.*for.*String\|impl.*str\|impl.*String" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 110

🏁 Script executed:

# Let's check the git history to see if full_line_str was recently added or if there was a refactor git log --oneline crates/stdlib/src/_tokenize.rs | head -20

Repository: RustPython/RustPython

Length of output: 123

🏁 Script executed:

# Let's check the full recent git history for this file git log -n 5 --format="%H %s" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 155

🏁 Script executed:

# Check the most recent commit to see what changed git show 042aed48 --stat | head -30

Repository: RustPython/RustPython

Length of output: 1688

🏁 Script executed:

# Look for full_line_str in the recent commit git show 042aed48 crates/stdlib/src/_tokenize.rs | grep -A 5 -B 5 "full_line_str" | head -40

Repository: RustPython/RustPython

Length of output: 1694

🏁 Script executed:

# Search for full_line_str definition in ruff dependencies or any Rust trait grep -r "full_line_str" . --include="*.rs" 2>/dev/null | grep -v "_tokenize.rs" | head -20

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Let's check if this is maybe a method that was supposed to be defined but isn't # Try to see if there's a compile error by checking if the code can be built cd crates/stdlib && cargo check 2>&1 | grep -A 5 "full_line_str\|error" | head -30

Repository: RustPython/RustPython

Length of output: 151

🏁 Script executed:

# Check if there's maybe a different approach used elsewhere for extracting lines rg "full_line|get_line|extract.*line" crates/ --type rust -A 2 | head -40

Repository: RustPython/RustPython

Length of output: 2498

🏁 Script executed:

# Let's check if maybe full_line_str is defined as a macro or via a trait method # Look for any trait definitions on String or &str rg "impl.*str|impl.*String|trait.*str" crates/stdlib/src/_tokenize.rs -A 10

Repository: RustPython/RustPython

Length of output: 402

🏁 Script executed:

# Check what methods are available on String type that might extract a line # Let's look at how line_str is used - it should be a &str sed -n '280,295p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 959

🏁 Script executed:

# Check if this might be a method that should be defined but isn't # Look at the method signature we're trying to call rg "full_line_str\(TextSize" crates/stdlib/src/_tokenize.rs -B 3 -A 1

Repository: RustPython/RustPython

Length of output: 567

full_line_str method does not exist and will cause compilation failure.

The code calls source.full_line_str(TextSize) at multiple locations (lines 195, 288, 351, 383, 437, 472), but this method is not defined on String or &str types in the codebase. This will prevent compilation. Either implement this method as an extension trait or replace these calls with an alternative approach to extract line content at the given offset.

The secondary concern about CRLF line ending handling in the offset calculation (lines 189-193) also applies: .map(|l| l.len() + 1) assumes single-byte (\n) line endings and will miscalculate for CRLF (\r\n) sources.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/_tokenize.rs` around lines 189 - 199, The call to source.full_line_str(...) is invalid; replace it with a small helper function (e.g., full_line_at(source: &str, byte_offset: usize) -> &str or String) and call that from the tokenization code (replace source.full_line_str(TextSize::from(...)) with full_line_at(source, offset_min)). Implement full_line_at to locate the start and end byte indices around byte_offset by scanning source bytes for the previous '\n' (or start) and next '\n' (or end), trimming a trailing '\r' if present so CRLF lines return the line without '\r'; keep using TextSize::from(offset.min(source.len()) as u32) to compute the offset passed from the existing code. Also fix the offset calculation (where you currently do .map(|l| l.len() + 1)) to sum byte lengths including the actual newline sequence length for each consumed line (detect '\r\n' vs '\n') or instead compute the byte offset by scanning source bytes for the (sl-1)-th newline to avoid assuming single-byte line endings; update all places calling full_line_str to use the new full_line_at helper.

crates/stdlib/src/_tokenize.rs

Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.

Replace per-line reparsing with single-pass tokenization: - Read all lines via readline, parse once, yield tokens - Fix token type values (COMMENT=65, NL=66, OP=55) - Fix NEWLINE/NL end positions and implicit newline handling - Fix DEDENT positions via look-ahead to next non-DEDENT token - Handle FSTRING_MIDDLE brace unescaping ({{ → {, }} → }) - Emit implicit NL before ENDMARKER when source lacks trailing newline - Raise IndentationError from lexer errors - Remove 13 expectedFailure marks for now-passing tests

youknowone requested a review from ShaharNaveh March 9, 2026 13:56

ShaharNaveh reviewed Mar 9, 2026

View reviewed changes

crates/stdlib/src/_tokenize.rs

Copy link

Contributor

ShaharNaveh Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should rename to _tokenize, right?

youknowone reacted with thumbs up emoji

ShaharNaveh approved these changes Mar 9, 2026

View reviewed changes

youknowone marked this pull request as ready for review March 10, 2026 02:14

youknowone changed the title ~~Implement _tokenize~~ Implement _tokenize and update tokenize from v3.14.3 Mar 10, 2026

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

ShaharNaveh and others added 3 commits March 10, 2026 11:28

Base implementation of _tokenize module

5c5aff8

Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase. Uses ruff_python_parser for tokenization via TokenizerIter.

Update tokenize from v3.14.3

bf2b993

youknowone force-pushed the update-tokenize branch from 042aed4 to 55737ed Compare March 10, 2026 02:28

youknowone merged commit e645761 into RustPython:main Mar 10, 2026
13 of 14 checks passed

youknowone deleted the update-tokenize branch March 10, 2026 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement _tokenize and update tokenize from v3.14.3#7392

Implement _tokenize and update tokenize from v3.14.3#7392
youknowone merged 3 commits intoRustPython:mainfrom
youknowone:update-tokenize

youknowone commented Mar 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

ShaharNaveh Mar 9, 2026

Uh oh!

ShaharNaveh left a comment

Uh oh!

youknowone commented Mar 10, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

youknowone commented Mar 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Library Dependencies

Uh oh!

ShaharNaveh Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh left a comment

Choose a reason for hiding this comment

Uh oh!

youknowone commented Mar 10, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

youknowone commented Mar 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

github-actions bot commented Mar 9, 2026 •

edited

Loading