This article is a look into the performance of one of the regular expressions used in the python-markdown2 Python module for converting Markdown syntax to HTML. It was initially written for pure fun, and in celebration of its own pointlessness, but eventually the changes proposed here made it upstream in pull request 207.
Replace tabs with spaces
A detour to Perl
The initial post in that thread was replacing tabs like this:
That code misses one point: if there is any string before a tab, it will simply add four spaces after that string. However, that is not how tabs work. What should happen is that enough spaces should be added, until the length of the initial string plus the newly added spaces, add up to the next multiple of four. So, the suggested substitution in Perl becomes:
There are two flags used there:
g applies the substitution for all matches of
the left pattern (
(.*?)\t). Without that flag, only the first match would be
processed. The second flag,
e, forces the substitute
$1.(' ' x (4-length($1)%4))) to be evaluated as an expression itself.
Without this flag, the second part would be handled as a raw string.
Back to Python
Here is the Python code, cleaned up a little:
_detab_re object is a compiled Regular Expression object, built with the
same pattern as the one used in the Perl example, and with the multiline flag
re.M). You can test this out at RegExr. The
method of that object is called in the last line. It takes two parameters: the
_detab_sub() function, and the text to be processed. For every match of the
_detab_sub() is called, and the matched string is passed to the
_detab_sub() function for processing. Finally,
subn() returns a tuple with
the text with the pattern substituted, and the number of substitutions that
happened. From that result, only the text is kept, with that
which seems a bit redundant, since the
sub() method would do that without
No regular expressions please
Here is a Python snippet that does the same thing as the previous one, without using regular expressions:
In the previous article on regular expressions in python-markdown2 I
dismissed the difference between a substring substitution with
str.replace() as being negligible, but in this case it seems that it
is more substantial. This simple example already indicates some difference:
To test a larger example, I took this version of the source code of bzip2 which uses three spaces for indentation, and made some substitutions in it:
Timing test with this file:
That is significant difference, not using regular expressions makes the process about 8 times faster.
Based on this article, and on the previous one, I would prefer to use other methods for substring replacements than regular expressions.