This article is a look into the performance of one of the regular expressions used in the python-markdown2 Python module for converting Markdown syntax to HTML. It was initially written for pure fun, and in celebration of its own pointlessness, but eventually the changes proposed here made it upstream in pull request 204.
Standardize line endings
This regular expression appears very early in the conversion process:
Its use is fairly obvious: it changes all single carriage returns (
all carriage returns followed by a newline (
\r\n) to single newlines (
The same effect can be achieved in Python with two
and in fact that would be much faster. The following example uses
which comes with the IPython shell:
So the two runs of
str.replace() add up to 465 nanoseconds, whereas one run
re.sub() takes 2.31 microseconds, that is 2310 nanoseconds, or about
five times slower.
The question is: Does it matter? Well, my copy of
The Hitch Hiker's Guide to the Galaxy that includes all five books in the
series, is 776 pages long, and each full page has 42 lines (yes, I counted
twice, and now I am wondering if it was done on purpose). Following up on the
previous calculations, if you had to convert that book from Markdown to HTML,
(about 32592 lines), it would take you a whole 0.02 seconds to do that with
re.sub(), or about 0.004 seconds to do that with
Therefore, the answer to my previous question: Does it matter? is 42.
Now the question becomes: Does it *really** matter?* Well, if you had to
convert all 30 million paperback books that Amazon has for sale (number found
through a search on amazon.com), and assuming each book is as healthy in size
as THHGTTG, then it
would take you a week to do that with
re.sub(), but only a day and a
half to do it with
str.replace(). Thus, for the Python developer out there
who is pondering on converting 30 million books from Markdown to HTML, the
answer is: Go with
str.replace(). For the rest of us it's still 42.