X Tutup
The Wayback Machine - https://web.archive.org/web/20251210232534/https://github.com/python/cpython/issues/109638
Skip to content

csv.Sniffer regular expression has significant backtracing #109638

@sg3-141-592

Description

@sg3-141-592

Bug report

Bug description:

You can pass strings to csv.Sniffer that can generate significant Regex backtracing and processing time. For example

import csv
import time
 
NUM_ITERATIONS=200
# Example test str "","",""0""0
test_str = '"",'*NUM_ITERATIONS + '"'*NUM_ITERATIONS + '0' + '"'*NUM_ITERATIONS + '0'
print(test_str)
 
t0 = time.time()
 
dialect = csv.Sniffer().sniff(test_str)
 
t1 = time.time()
print(f"{t1-t0}")

Some example runs

NUM_ITERATIONS Running Time (seconds)
1              0.0008
10             0.0030
100            254.32

I've checked against different versions of Python and they all return similar results.

For input NUM_ITERATIONS 100 above
Version       : Running Time (Seconds)
Python 3.8.16 : 254
Python 3.9.16 : 250
Python 3.10.9 : 319
Python 3.11.1 : 236

This issue lies in this Regex for finding double quoted format

r"((%(delim)s)|^)\W*%(quote)s[^%(delim)s\n]*%(quote)s[^%(delim)s\n]*%(quote)s\W*((%(delim)s)|$)" % \

I've done some testing and a zero length lookahead assertion (or atomic group) you can get a significant performance improvement

r"((%(delim)s)|^)\W*%(quote)s(?=(?P<zero>[^%(delim)s%(quote)s\n]*))(?P=zero)%(quote)s[^%(delim)s\n]*%(quote)s\W*((%(delim)s)|$)" % \

image

CPython versions tested on:

3.8, 3.9, 3.10, 3.11

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Labels

3.13bugs and security fixesperformancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    X Tutup