X Tutup
Skip to content

Implement _tokenize and update tokenize from v3.14.3#7392

Merged
youknowone merged 3 commits intoRustPython:mainfrom
youknowone:update-tokenize
Mar 10, 2026
Merged

Implement _tokenize and update tokenize from v3.14.3#7392
youknowone merged 3 commits intoRustPython:mainfrom
youknowone:update-tokenize

Conversation

@youknowone
Copy link
Member

@youknowone youknowone commented Mar 9, 2026

close #6240

Summary by CodeRabbit

  • New Features

    • Added a tokenizer accessible from Python that iterates token streams, with f-string handling, encoding awareness, implicit-newline and indentation error reporting.
  • Chores

    • Added workspace dependencies for Ruff-related crates (parser, AST, text-size, source-file) to support the stdlib build.

@youknowone youknowone requested a review from ShaharNaveh March 9, 2026 13:56
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 0f32219c-d6cf-4084-8ad8-5c73335a6b68

📥 Commits

Reviewing files that changed from the base of the PR and between 042aed4 and 55737ed.

⛔ Files ignored due to path filters (4)
  • Cargo.lock is excluded by !**/*.lock
  • Lib/test/test_tabnanny.py is excluded by !Lib/**
  • Lib/test/test_tokenize.py is excluded by !Lib/**
  • Lib/tokenize.py is excluded by !Lib/**
📒 Files selected for processing (3)
  • crates/stdlib/Cargo.toml
  • crates/stdlib/src/_tokenize.rs
  • crates/stdlib/src/lib.rs
🚧 Files skipped from review as they are similar to previous changes (3)
  • crates/stdlib/src/lib.rs
  • crates/stdlib/Cargo.toml
  • crates/stdlib/src/_tokenize.rs

📝 Walkthrough

Walkthrough

The PR adds a new Python-accessible tokenizer module (_tokenize) implemented in Rust, registers it in stdlib, and adds four Ruff workspace dependencies to crates/stdlib/Cargo.toml. The tokenizer provides a TokenizerIter exposing Python iterator semantics with f-string, indentation, and EOF handling.

Changes

Cohort / File(s) Summary
Dependency Management
crates/stdlib/Cargo.toml
Added four workspace dependencies: ruff_python_parser, ruff_python_ast, ruff_text_size, ruff_source_file.
Tokenizer Implementation
crates/stdlib/src/_tokenize.rs
New Rust module (~747 lines) adding PyTokenizerIter (Python class TokenizerIter) with iterator protocol, multi-phase state (Reading, Yielding, Done), f-string splitting/handling, implicit-NL/EOF logic, indentation/syntax error reporting, and token tuple construction.
Module Registration
crates/stdlib/src/lib.rs
Declared mod _tokenize; and added _tokenize::module_def(ctx) to stdlib module registrations.

Sequence Diagram

sequenceDiagram
    participant Python as Python Caller
    participant Iter as PyTokenizerIter
    participant Reader as readline Callback
    participant Parser as Token Parser
    participant TokenMgmt as Token State Manager

    Python->>Iter: __init__(readline, extra_tokens, encoding)
    Iter->>Iter: Initialize state (Reading phase)
    
    Python->>Iter: __next__()
    Iter->>Iter: Check current phase
    
    alt Reading Phase
        Iter->>Reader: Call readline callback
        Reader-->>Iter: Return line text
        Iter->>Parser: Parse line → tokens
        Parser-->>Iter: Return tokens
        Iter->>TokenMgmt: Queue pending tokens
        Iter->>Iter: Transition to Yielding phase
    end
    
    alt Yielding Phase
        Iter->>TokenMgmt: Fetch next token
        Iter->>Iter: Handle special cases (FSTRING_MIDDLE, Dedent, NL, EOF)
        Iter->>Iter: Build token tuple (type, text, start, end, line)
        Iter-->>Python: Return token tuple
    end
    
    alt Done Phase
        Iter-->>Python: Raise StopIteration
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 I hopped through code with tiny feet,
A tokenizer tune I learned to beat,
F-strings twirled and tokens spun,
Readline sang until the job was done.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main changes: implementing the _tokenize module and updating tokenize from v3.14.3.
Linked Issues check ✅ Passed All coding requirements from #6240 are met: the new _tokenize module is implemented with a Tokenizer iterator yielding token tuples, accepting readline callable and optional encoding, and internal dependencies are updated.
Out of Scope Changes check ✅ Passed All changes directly support the implementation of the _tokenize module and its dependencies as required by #6240; no unrelated changes detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] lib: cpython/Lib/tabnanny.py
[ ] test: cpython/Lib/test/test_tabnanny.py (TODO: 4)

dependencies:

  • tabnanny

dependent tests: (1 tests)

  • tabnanny: test_tabnanny

[x] lib: cpython/Lib/tokenize.py
[x] test: cpython/Lib/test/test_tokenize.py (TODO: 7)

dependencies:

  • tokenize

dependent tests: (132 tests)

  • tokenize: test_inspect test_linecache test_tabnanny test_tokenize test_unparse
    • importlib._bootstrap_external: test_importlib test_unittest
      • py_compile: test_argparse test_cmd_line_script test_compileall test_importlib test_multiprocessing_main_handling test_py_compile test_pydoc test_runpy
      • pydoc: test_enum
    • inspect: test_abc test_asyncgen test_buffer test_builtin test_code test_collections test_coroutines test_decimal test_functools test_generators test_grammar test_monitoring test_ntpath test_operator test_patma test_posixpath test_signal test_sqlite3 test_traceback test_type_annotations test_types test_typing test_unittest test_yield_from test_zipimport test_zoneinfo
      • ast: test_ast test_compile test_compiler_codegen test_dis test_fstring test_future_stmt test_site test_ssl test_type_comments test_ucn
      • bdb: test_bdb
      • cmd: test_cmd
      • dataclasses: test__colorize test_copy test_ctypes test_genericalias test_pprint test_regrtest
      • pkgutil: test_pkgutil
      • rlcompleter: test_rlcompleter
      • trace: test_trace
      • xmlrpc.server: test_docxmlrpc test_xmlrpc
    • linecache:
      • timeit: test_timeit
      • traceback: test_asyncio test_code_module test_contextlib test_contextlib_async test_dictcomps test_exceptions test_http_cookiejar test_importlib test_iter test_listcomps test_pyexpat test_setcomps test_socket test_subprocess test_sys test_threadedtempfile test_threading test_unittest test_with
    • traceback:
      • concurrent.futures.process: test_concurrent_futures
      • http.cookiejar: test_urllib2
      • logging: test_asyncio test_hashlib test_logging test_support test_urllib2net
      • multiprocessing: test_asyncio test_concurrent_futures test_fcntl test_memoryview test_re
      • socketserver: test_imaplib test_socketserver test_wsgiref
      • threading: test_android test_asyncio test_bz2 test_concurrent_futures test_ctypes test_email test_fork1 test_frame test_ftplib test_gc test_httplib test_httpservers test_importlib test_io test_itertools test_largefile test_opcache test_pathlib test_poll test_queue test_robotparser test_sched test_smtplib test_super test_syslog test_termios test_threading_local test_time test_urllib2_localnet test_weakref test_winreg test_zstd

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should rename to _tokenize, right?

Copy link
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way better than what I did

@youknowone
Copy link
Member Author

@ShaharNaveh this is obviously based on your work!

@youknowone youknowone marked this pull request as ready for review March 10, 2026 02:14
@youknowone youknowone changed the title Implement _tokenize Implement _tokenize and update tokenize from v3.14.3 Mar 10, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
crates/stdlib/src/_tokenize.rs (2)

396-419: Extract magic numbers 60 and 63 to named constants.

The function uses magic numbers 60 (FSTRING_MIDDLE) and 63 (TSTRING_MIDDLE) directly. For consistency with the other token constants defined at the top of the module (lines 25-29), consider adding named constants.

♻️ Suggested refactor

Add to the constants section (after line 29):

const TOKEN_FSTRING_MIDDLE: u8 = 60;
const TOKEN_TSTRING_MIDDLE: u8 = 63;

Then update the function:

                 TokenKind::FStringStart => {
                     if depth == 0 {
-                        return 60; // FSTRING_MIDDLE
+                        return TOKEN_FSTRING_MIDDLE;
                     }
                     depth -= 1;
                 }
                 TokenKind::TStringStart => {
                     if depth == 0 {
-                        return 63; // TSTRING_MIDDLE
+                        return TOKEN_TSTRING_MIDDLE;
                     }
                     depth -= 1;
                 }
                 _ => {}
             }
         }
-        60 // default to FSTRING_MIDDLE
+        TOKEN_FSTRING_MIDDLE // default
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/_tokenize.rs` around lines 396 - 419, The function
find_fstring_middle_type currently returns magic numbers 60 and 63; define named
constants (e.g., TOKEN_FSTRING_MIDDLE and TOKEN_TSTRING_MIDDLE) alongside the
other token constants at the top of the module and replace the literal returns
(and the final default 60) with those constants so the function returns
TOKEN_FSTRING_MIDDLE or TOKEN_TSTRING_MIDDLE instead of raw numbers.

107-149: Consider the memory overhead of cloning TokenizerState.

The clone-modify-write pattern (line 109 reads/clones, line 140 writes back) avoids holding the lock during potentially slow readline() calls. However, in the Yielding phase, this clones the entire source string and token vectors on each next() call, which could be expensive for large files.

A more efficient approach would be to hold the write lock directly in the Yielding phase (since no external calls are made):

♻️ Suggested optimization (optional)
     impl IterNext for PyTokenizerIter {
         fn next(zelf: &Py<Self>, vm: &VirtualMachine) -> PyResult<PyIterReturn> {
-            let mut state = zelf.state.read().clone();
-
             loop {
-                match &mut state.phase {
-                    TokenizerPhase::Reading { source } => {
+                // Check phase without holding lock
+                let is_reading = matches!(&zelf.state.read().phase, TokenizerPhase::Reading { .. });
+                
+                if is_reading {
+                    let mut state = zelf.state.write();
+                    if let TokenizerPhase::Reading { source } = &mut state.phase {
                         let line = zelf.readline(vm)?;
                         // ... rest of reading logic
-                    }
-                    TokenizerPhase::Yielding { .. } => {
-                        let result =
-                            emit_next_token(&mut state, zelf.extra_tokens, vm)?;
-                        *zelf.state.write() = state;
-                        return Ok(result);
                     }
+                } else {
+                    let mut state = zelf.state.write();
+                    return emit_next_token(&mut state, zelf.extra_tokens, vm);
                 }
             }
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/_tokenize.rs` around lines 107 - 149, The current
PyTokenizerIter::next clones the entire TokenizerState at the start
(zelf.state.read().clone()), which causes repeated heavy clones when in
TokenizerPhase::Yielding; instead, only clone/read for the Reading branch but
for the Yielding branch acquire a write lock on zelf.state (i.e., get a write
guard) and call emit_next_token with a mutable reference into that guarded state
(TokenizerPhase::Yielding) so you avoid cloning source/tokens on each next();
keep the readline path unchanged (read/clone/modify/write) but switch to holding
the write lock inside the TokenizerPhase::Yielding arm before calling
emit_next_token and returning its PyIterReturn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/stdlib/src/_tokenize.rs`:
- Around line 742-744: The match currently returns 0 (ENDMARKER) for
TokenKind::Unknown (and leaves IpyEscapeCommand and Question as 0); change these
arms so they map to ERRORTOKEN (numeric 67 or the appropriate ERRORTOKEN
constant) instead of 0: update the match handling for TokenKind::Unknown,
TokenKind::IpyEscapeCommand, and TokenKind::Question in the tokenizer function
to return 67 (or token::ERRORTOKEN) so unrecognized/IPython-specific tokens
produce ERRORTOKEN rather than ENDMARKER.
- Around line 189-199: The call to source.full_line_str(...) is invalid; replace
it with a small helper function (e.g., full_line_at(source: &str, byte_offset:
usize) -> &str or String) and call that from the tokenization code (replace
source.full_line_str(TextSize::from(...)) with full_line_at(source,
offset_min)). Implement full_line_at to locate the start and end byte indices
around byte_offset by scanning source bytes for the previous '\n' (or start) and
next '\n' (or end), trimming a trailing '\r' if present so CRLF lines return the
line without '\r'; keep using TextSize::from(offset.min(source.len()) as u32) to
compute the offset passed from the existing code. Also fix the offset
calculation (where you currently do .map(|l| l.len() + 1)) to sum byte lengths
including the actual newline sequence length for each consumed line (detect
'\r\n' vs '\n') or instead compute the byte offset by scanning source bytes for
the (sl-1)-th newline to avoid assuming single-byte line endings; update all
places calling full_line_str to use the new full_line_at helper.

---

Nitpick comments:
In `@crates/stdlib/src/_tokenize.rs`:
- Around line 396-419: The function find_fstring_middle_type currently returns
magic numbers 60 and 63; define named constants (e.g., TOKEN_FSTRING_MIDDLE and
TOKEN_TSTRING_MIDDLE) alongside the other token constants at the top of the
module and replace the literal returns (and the final default 60) with those
constants so the function returns TOKEN_FSTRING_MIDDLE or TOKEN_TSTRING_MIDDLE
instead of raw numbers.
- Around line 107-149: The current PyTokenizerIter::next clones the entire
TokenizerState at the start (zelf.state.read().clone()), which causes repeated
heavy clones when in TokenizerPhase::Yielding; instead, only clone/read for the
Reading branch but for the Yielding branch acquire a write lock on zelf.state
(i.e., get a write guard) and call emit_next_token with a mutable reference into
that guarded state (TokenizerPhase::Yielding) so you avoid cloning source/tokens
on each next(); keep the readline path unchanged (read/clone/modify/write) but
switch to holding the write lock inside the TokenizerPhase::Yielding arm before
calling emit_next_token and returning its PyIterReturn.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 51c73337-da3f-40a7-a355-8dedaf682e2d

📥 Commits

Reviewing files that changed from the base of the PR and between 2266ba7 and 042aed4.

⛔ Files ignored due to path filters (4)
  • Cargo.lock is excluded by !**/*.lock
  • Lib/test/test_tabnanny.py is excluded by !Lib/**
  • Lib/test/test_tokenize.py is excluded by !Lib/**
  • Lib/tokenize.py is excluded by !Lib/**
📒 Files selected for processing (3)
  • crates/stdlib/Cargo.toml
  • crates/stdlib/src/_tokenize.rs
  • crates/stdlib/src/lib.rs

Comment on lines +189 to +199
let offset: usize = source
.lines()
.take(sl.saturating_sub(1))
.map(|l| l.len() + 1)
.sum();
let full_line =
source.full_line_str(TextSize::from(offset.min(source.len()) as u32));
return Ok(PyIterReturn::Return(make_token_tuple(
vm, tok_type, &tok_str, sl, sc as isize, el, ec as isize, full_line,
)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if line endings are normalized elsewhere in the codebase
rg -n "replace.*\\\\r\\\\n|normalize.*line" --type=rs crates/stdlib/src/ -C2

# Check how other modules handle line counting
ast-grep --pattern $'source.lines().$$$'

Repository: RustPython/RustPython

Length of output: 754


🏁 Script executed:

# Get the full function context around lines 189-199
sed -n '170,210p' crates/stdlib/src/_tokenize.rs

# Find the function signature to understand where source comes from
rg -B 20 "let offset: usize = source" crates/stdlib/src/_tokenize.rs | head -50

Repository: RustPython/RustPython

Length of output: 2230


🏁 Script executed:

# Check how source is created/normalized in this file
rg -n "fn.*source|Source::" crates/stdlib/src/_tokenize.rs | head -20

# Look for any line ending normalization in the module
rg -n "replace|normalize|CRLF|\\\\r" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 280


🏁 Script executed:

# Check what Source type is and if it handles normalization
rg -B 5 -A 10 "impl.*Source|struct Source" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Check the Source type definition
rg -n "struct Source|impl Source" crates/stdlib/src/_tokenize.rs -A 5

# Look at the context of line 270-271 where \r is handled
sed -n '265,280p' crates/stdlib/src/_tokenize.rs

# Find the source variable declaration to understand its type
rg -n "let source" crates/stdlib/src/_tokenize.rs | head -10

Repository: RustPython/RustPython

Length of output: 913


🏁 Script executed:

# Check what the offset is used for - full_line_str method
rg -n "full_line_str" crates/stdlib/src/_tokenize.rs -B 2 -A 2

Repository: RustPython/RustPython

Length of output: 1711


🏁 Script executed:

# Find the definition of full_line_str to understand what it does with the offset
rg -n "fn full_line_str" crates/stdlib/src/_tokenize.rs -A 10

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Check what type source is - likely a reference to some Source type
rg -n "fn next_token.*source:|impl Iterator" crates/stdlib/src/_tokenize.rs -B 5 | head -40

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Let's look at the function signature that contains this code
sed -n '1,189p' crates/stdlib/src/_tokenize.rs | tail -80

Repository: RustPython/RustPython

Length of output: 3182


🏁 Script executed:

# Find the readline method to see if it normalizes line endings
rg -n "fn readline" crates/stdlib/src/_tokenize.rs -A 15

Repository: RustPython/RustPython

Length of output: 816


🏁 Script executed:

# Look for the readline implementation on the tokenizer/file object
rg -n "impl.*Tokenizer|readline" crates/stdlib/src/_tokenize.rs | head -20

Repository: RustPython/RustPython

Length of output: 834


🏁 Script executed:

# Check if there's any normalization of line endings when reading
rg -n "readline|\.lines\(\)" crates/ --type rust -A 2 -B 2 | grep -i "crlf\|normalize\|replace.*\\\\r" -A 5 -B 5

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Let's find the full_line_str implementation - it might be on a Source type or Line Index type
rg -n "full_line_str" crates/ --type rust -B 2 -A 8 | head -50

Repository: RustPython/RustPython

Length of output: 3566


🏁 Script executed:

# Check the imports and what Source type is being used
head -30 crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 985


🏁 Script executed:

# Look for TextSize or other offset-related types
rg -n "TextSize|LineIndex" crates/stdlib/src/_tokenize.rs | head -20

Repository: RustPython/RustPython

Length of output: 599


🏁 Script executed:

# Search for full_line_str definition - it might be from ruff_source_file
rg -n "full_line_str" crates/ -A 5 | grep "fn\|trait\|impl" | head -20

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Check if source has a method full_line_str or if it's on the string type
# Let's see what type source is more carefully
sed -n '117,130p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 959


🏁 Script executed:

# Look at ruff_source_file usage - check if LineIndex has methods for extracting lines
rg -n "LineIndex::from_source_text|\.line_str\|\.full_line\|\.get_line" crates/ --type rust | head -20

Repository: RustPython/RustPython

Length of output: 683


🏁 Script executed:

# Let's look at what type source actually is - check the TokenizerState struct
rg -n "struct TokenizerState|TokenizerPhase::Yielding" crates/stdlib/src/_tokenize.rs -A 10 | head -50

Repository: RustPython/RustPython

Length of output: 1779


🏁 Script executed:

# Search for the definition of full_line_str more broadly
rg "fn full_line_str" crates/ --type rust

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Check if source might implement a trait that has full_line_str
rg -n "impl.*for.*str|impl.*for.*String" crates/stdlib/src/_tokenize.rs -A 10

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Check what full_line_str is - might be a method on String or &str
rg -n "fn full_line_str|trait.*full_line" crates/ --type rust

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Search across the entire codebase for full_line_str definition
find crates -name "*.rs" -exec grep -l "full_line_str" {} \;

Repository: RustPython/RustPython

Length of output: 95


🏁 Script executed:

# Let's check if it's from the ruff_source_file crate - look at Cargo.toml
grep -r "ruff_source_file\|ruff_text_size" Cargo.toml

Repository: RustPython/RustPython

Length of output: 300


🏁 Script executed:

# Since source is a String, let's check if full_line_str is a custom extension method
rg -n "trait.*full_line|impl.*String.*full_line" crates/ --type rust

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Check if there's a trait implementation or extension in the file itself
grep -n "impl\|trait" crates/stdlib/src/_tokenize.rs | tail -30

Repository: RustPython/RustPython

Length of output: 782


🏁 Script executed:

# Let's look at the actual method calls - maybe it's a typo or the code doesn't compile
# Check if this file has any compilation issues
sed -n '189,200p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 523


🏁 Script executed:

# Search for what could provide full_line_str - check if it's defined on a wrapper type
rg -C 5 "full_line_str" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 2873


🏁 Script executed:

# Check if there are any trait implementations or extensions in the file
sed -n '500,650p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 5030


🏁 Script executed:

# Let's see the end of the file to find if there's a trait impl
tail -100 crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 3667


🏁 Script executed:

# Search for where full_line_str might come from - perhaps it's defined as an extension
rg -n "extension|trait Ext" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Let's search for any custom trait or extension method definition
grep -n "full_line_str\|fn.*full_line" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 533


🏁 Script executed:

# Check all the lines that compile and use full_line_str
# This suggests it's a valid method - let's check if there's a custom impl block
sed -n '1,650p' crates/stdlib/src/_tokenize.rs | grep -n "impl\|trait\|full_line"

Repository: RustPython/RustPython

Length of output: 1372


🏁 Script executed:

# Let's look for uses of the method and try to understand what it should return
# Check what type full_line parameter expects in make_token_tuple
sed -n '583,598p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 646


🏁 Script executed:

# Search for full_line_str implementation in the entire codebase
grep -r "fn full_line_str" crates/ --include="*.rs"

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Search in the entire repository including external dependencies
find . -name "*.rs" -type f | xargs grep -l "fn full_line_str" 2>/dev/null

Repository: RustPython/RustPython

Length of output: 49


🏁 Script executed:

# Maybe it's in ruff_source_file - check if there's a local copy
find . -path "./target" -prune -o -name "*.rs" -type f | xargs grep "full_line_str" 2>/dev/null | head -20

Repository: RustPython/RustPython

Length of output: 707


🏁 Script executed:

# Search for trait definitions or macros in the _tokenize.rs file
sed -n '1,100p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 3434


🏁 Script executed:

# Check if there's a custom trait or impl on &str or String in the file
grep -n "trait\|impl.*for.*str\|impl.*for.*String\|impl.*str\|impl.*String" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 110


🏁 Script executed:

# Let's check the git history to see if full_line_str was recently added or if there was a refactor
git log --oneline crates/stdlib/src/_tokenize.rs | head -20

Repository: RustPython/RustPython

Length of output: 123


🏁 Script executed:

# Let's check the full recent git history for this file
git log -n 5 --format="%H %s" crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 155


🏁 Script executed:

# Check the most recent commit to see what changed
git show 042aed48 --stat | head -30

Repository: RustPython/RustPython

Length of output: 1688


🏁 Script executed:

# Look for full_line_str in the recent commit
git show 042aed48 crates/stdlib/src/_tokenize.rs | grep -A 5 -B 5 "full_line_str" | head -40

Repository: RustPython/RustPython

Length of output: 1694


🏁 Script executed:

# Search for full_line_str definition in ruff dependencies or any Rust trait
grep -r "full_line_str" . --include="*.rs" 2>/dev/null | grep -v "_tokenize.rs" | head -20

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Let's check if this is maybe a method that was supposed to be defined but isn't
# Try to see if there's a compile error by checking if the code can be built
cd crates/stdlib && cargo check 2>&1 | grep -A 5 "full_line_str\|error" | head -30

Repository: RustPython/RustPython

Length of output: 151


🏁 Script executed:

# Check if there's maybe a different approach used elsewhere for extracting lines
rg "full_line|get_line|extract.*line" crates/ --type rust -A 2 | head -40

Repository: RustPython/RustPython

Length of output: 2498


🏁 Script executed:

# Let's check if maybe full_line_str is defined as a macro or via a trait method
# Look for any trait definitions on String or &str
rg "impl.*str|impl.*String|trait.*str" crates/stdlib/src/_tokenize.rs -A 10

Repository: RustPython/RustPython

Length of output: 402


🏁 Script executed:

# Check what methods are available on String type that might extract a line
# Let's look at how line_str is used - it should be a &str
sed -n '280,295p' crates/stdlib/src/_tokenize.rs

Repository: RustPython/RustPython

Length of output: 959


🏁 Script executed:

# Check if this might be a method that should be defined but isn't
# Look at the method signature we're trying to call
rg "full_line_str\(TextSize" crates/stdlib/src/_tokenize.rs -B 3 -A 1

Repository: RustPython/RustPython

Length of output: 567


full_line_str method does not exist and will cause compilation failure.

The code calls source.full_line_str(TextSize) at multiple locations (lines 195, 288, 351, 383, 437, 472), but this method is not defined on String or &str types in the codebase. This will prevent compilation. Either implement this method as an extension trait or replace these calls with an alternative approach to extract line content at the given offset.

The secondary concern about CRLF line ending handling in the offset calculation (lines 189-193) also applies: .map(|l| l.len() + 1) assumes single-byte (\n) line endings and will miscalculate for CRLF (\r\n) sources.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/_tokenize.rs` around lines 189 - 199, The call to
source.full_line_str(...) is invalid; replace it with a small helper function
(e.g., full_line_at(source: &str, byte_offset: usize) -> &str or String) and
call that from the tokenization code (replace
source.full_line_str(TextSize::from(...)) with full_line_at(source,
offset_min)). Implement full_line_at to locate the start and end byte indices
around byte_offset by scanning source bytes for the previous '\n' (or start) and
next '\n' (or end), trimming a trailing '\r' if present so CRLF lines return the
line without '\r'; keep using TextSize::from(offset.min(source.len()) as u32) to
compute the offset passed from the existing code. Also fix the offset
calculation (where you currently do .map(|l| l.len() + 1)) to sum byte lengths
including the actual newline sequence length for each consumed line (detect
'\r\n' vs '\n') or instead compute the byte offset by scanning source bytes for
the (sl-1)-th newline to avoid assuming single-byte line endings; update all
places calling full_line_str to use the new full_line_at helper.

ShaharNaveh and others added 3 commits March 10, 2026 11:28
Port from PR RustPython#6240 by ShaharNaveh, adapted to current codebase.
Uses ruff_python_parser for tokenization via TokenizerIter.
Replace per-line reparsing with single-pass tokenization:
- Read all lines via readline, parse once, yield tokens
- Fix token type values (COMMENT=65, NL=66, OP=55)
- Fix NEWLINE/NL end positions and implicit newline handling
- Fix DEDENT positions via look-ahead to next non-DEDENT token
- Handle FSTRING_MIDDLE brace unescaping ({{ → {, }} → })
- Emit implicit NL before ENDMARKER when source lacks trailing newline
- Raise IndentationError from lexer errors
- Remove 13 expectedFailure marks for now-passing tests
@youknowone youknowone merged commit e645761 into RustPython:main Mar 10, 2026
13 of 14 checks passed
@youknowone youknowone deleted the update-tokenize branch March 10, 2026 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

X Tutup