Marios Zindilis

Hexadecimal Primary Key in Django Model

The following snippet of code will make the primary key in a Django model to be a hexadecimal string of 8 characters instead of an integer.

I wrote this earlier today, and it works fine, but I have a bad feeling about it. I don't know why yet, but something doesn't feel right. Nevertheless, I am putting here as a note to myself. You should probably not use it. If you do, note that if you exhaust the possible IDs, generate_id() will recurse forever.

import os
from binascii import hexlify
from django.db import models

# Create your models here.
class Person(models.Model):
    '''Hold a Person object'''

    def generate_id():
        '''Generate an 8-character long hexadecimal ID'''
        possible = hexlify(os.urandom(4))
        try:
            # if this possible ID exists, run again:
            Person.objects.get(ID=possible)
            return self.generate_id()
        except:
            return possible

    ID = models.CharField(
        max_length = 8,
        primary_key = True,
        editable = False,
        default = generate_id
    )
    first_name = models.CharField(max_length=240)

Regular expressions in python-markdown2 (part 2)

This article is a look into the performance of one of the regular expressions used in the python-markdown2 Python module for converting Markdown syntax to HTML. It was initially written for pure fun, and in celebration of its own pointlessness, but eventually the changes proposed here made it upstream in pull request 207.

Replace tabs with spaces

This snippet of code replaces tab characters with a predefined number of spaces. It is a Python port of the Perl code mentioned by Bart Lateur in a post about turning tabs to spaces in Perl.

A detour to Perl

The initial post in that thread was replacing tabs like this:

#!/usr/bin/perl -pi
s/\t/    /;

That code misses one point: if there is any string before a tab, it will simply add four spaces after that string. However, that is not how tabs work. What should happen is that enough spaces should be added, until the length of the initial string plus the newly added spaces, add up to the next multiple of four. So, the suggested substitution in Perl becomes:

s/(.*?)\t/$1.(' ' x (4-length($1)%4))/ge;

There are two flags used there: g applies the substitution for all matches of the left pattern ((.*?)\t). Without that flag, only the first match would be processed. The second flag, e, forces the substitute ($1.(' ' x (4-length($1)%4))) to be evaluated as an expression itself. Without this flag, the second part would be handled as a raw string.

Back to Python

Here is the Python code, cleaned up a little:

import re
DEFAULT_TAB_LENGTH = 4

_detab_re = re.compile(r'(.*?)\t', re.M)
def _detab_sub(match):
    g1 = match.group(1)
    return g1 + (' ' * (DEFAULT_TAB_LENGTH - len(g1) % DEFAULT_TAB_LENGTH))

def _detab(text):
    if '\t' not in text:
        return text
    return _detab_re.subn(_detab_sub, text)[0]

Explanation

The _detab_re object is a compiled Regular Expression object, built with the same pattern as the one used in the Perl example, and with the multiline flag enabled (re.M). You can test this out at RegExr. The subn() method of that object is called in the last line. It takes two parameters: the _detab_sub() function, and the text to be processed. For every match of the pattern, _detab_sub() is called, and the matched string is passed to the _detab_sub() function for processing. Finally, subn() returns a tuple with the text with the pattern substituted, and the number of substitutions that happened. From that result, only the text is kept, with that subn()[0], which seems a bit redundant, since the sub() method would do that without requiring the [0] subscription.

No regular expressions please

Here is a Python snippet that does the same thing as the previous one, without using regular expressions:

DEFAULT_TAB_LENGTH = 4

def _detab_no_re_sub(l):
    if '\t' not in l:
        return l
    else:
        g1 = l.split('\t', 1)[0]
        output = g1
        output += (' ' * (DEFAULT_TAB_LENGTH - len(g1) % DEFAULT_TAB_LENGTH))
        output += l.split('\t', 1)[1]
        return _detab_no_re_sub(output)

def _detab_no_re(text):
    if not '\t' in text:
        return text
    output = []
    for line in text.splitlines():
        output.append(_detab_no_re_sub(line))
    return '\n'.join(output)

Performance

In the previous article on regular expressions in python-markdown2 I dismissed the difference between a substring substitution with re.sub() versus str.replace() as being negligible, but in this case it seems that it is more substantial. This simple example already indicates some difference:

text = '''
We are
        NOT
in Kansas
        any more!
'''

%timeit _detab(text)
100000 loops, best of 3: 6.14 us per loop

%timeit _detab_no_re(text)
100000 loops, best of 3: 3.82 us per loop

To test a larger example, I took this version of the source code of bzip2 which uses three spaces for indentation, and made some substitutions in it:

# Change some spaces in the beginning of lines with tabs:
sed -i 's/^   /\t/' bzip2.c 
sed -i 's/^\t   /\t\t/' bzip2.c
# Lines with tabs:
grep -c '\t' bzip2.c 
3032
# Total lines:
wc -l bzip2.c 
6998 bzip2.c

Timing test with this file:

text = file('bzip2.c').read()

%timeit _detab(text)
10 loops, best of 3: 90.1 ms per loop

%timeit _detab_no_re(text)
100 loops, best of 3: 11 ms per loop

That is significant difference, not using regular expressions makes the process about 8 times faster.

Conclusion

Based on this article, and on the previous one, I would prefer to use other methods for substring replacements than regular expressions.

Regular expressions in python-markdown2 (part 1)

This article is a look into the performance of one of the regular expressions used in the python-markdown2 Python module for converting Markdown syntax to HTML. It was initially written for pure fun, and in celebration of its own pointlessness, but eventually the changes proposed here made it upstream in pull request 204.

Standardize line endings

This regular expression appears very early in the conversion process:

text = re.sub("\r\n|\r", "\n", text)

Its use is fairly obvious: it changes all single carriage returns (\r) and all carriage returns followed by a newline (\r\n) to single newlines (\n). The same effect can be achieved in Python with two str.replace() statements and in fact that would be much faster. The following example uses timeit, which comes with the IPython shell:

%timeit 'Apples\r\nOranges\r\nKiwis\rGrapes\r'.replace('\r\n', '\n')
1000000 loops, best of 3: 270 ns per loop

%timeit 'Apples\r\nOranges\r\nKiwis\rGrapes\r'.replace('\r', '\n')
1000000 loops, best of 3: 195 ns per loop

%timeit re.sub(r'\r\n|\r', '\n', 'Apples\r\nOranges\r\nKiwis\rGrapes\r')
100000 loops, best of 3: 2.31 us per loop

So the two runs of str.replace() add up to 465 nanoseconds, whereas one run of re.sub() takes 2.31 microseconds, that is 2310 nanoseconds, or about five times slower.

The question is: Does it matter? Well, my copy of The Hitch Hiker's Guide to the Galaxy that includes all five books in the series, is 776 pages long, and each full page has 42 lines (yes, I counted twice, and now I am wondering if it was done on purpose). Following up on the previous calculations, if you had to convert that book from Markdown to HTML, (about 32592 lines), it would take you a whole 0.02 seconds to do that with re.sub(), or about 0.004 seconds to do that with str.replace(). Therefore, the answer to my previous question: Does it matter? is 42.

Now the question becomes: Does it really matter? Well, if you had to convert all 30 million paperback books that Amazon has for sale (number found through a search on amazon.com), and assuming each book is as healthy in size as THHGTTG, then it would take you a week to do that with re.sub(), but only a day and a half to do it with str.replace(). Thus, for the Python developer out there who is pondering on converting 30 million books from Markdown to HTML, the answer is: Go with str.replace(). For the rest of us it's still 42.

Notes and links from PyCon 2015

These are a couple of random links from things mentioned in talks during PyCon 2015, which took place in Dublin in October 2015. There were two tracks of talks and two tracks of workshops. One of the non-workshop tracks was almost dedicated to data processing with Python, and the other had various subjects. I followed the latter track.

PyCon 2015 was organized by Python Ireland, and took place on the 24th and 25th of October 2015.

What I Learned While Migrating MySQL On-Premises To Amazon RDS

I miss having the free time to attend Percona Webinars. They are really good.

This is a recording from a webinar presented by a Technical Account Manager at Percona, regarding experiences gained from the migration of a sizeable MySQL installation from onsite to AWS RDS. It's packed with valuable technical information, as are Percona webinars, typically.

Webinar recording from Imperva

This is a recording of a webinar that I watched this week, titled "Balancing Ecommerce Security with Performance". The talks refer to companies Imperva, American Eagle and Incapsula, and there is some product plugging going on during the talks, however there is a some amount of introductory web application security information.

The most useful bit in my opinion was pointing out demo.testfire.net, a test website that is open to SQL Injection for demostration purposes, and thus it can be useful as a security awareness training material.

How to run Firefox 3.6 on Ubuntu 15.04

These instructions will allow you to run the ancient 3.6 version of Firefox on a recent Ubuntu installation, namely 15.04, but it could apply to versions of Debian, Ubuntu and Linux Mint released close to 15.04.

The reasons why you might want to run such an old version of Firefox are irrelevant to this post. For me, this solves a problem of very limited scope: having to run some browser tests, written in Javascript as bookmarklets, that only last executed correctly in Firefox 3.6. Those tests access user information that is not available to the Javascript engine in versions of Firefox newer that 3.6, since Mozilla tightened its security and it is not exposing the user's visited history any more.

Now, I suppose I could migrate my tests out of the browser, read the browsing history from some SQLite file in the user's Firefox profile, and simulate the browser with something like Selenium, but I just cannot be bothered.

The guide

  1. Download firefox-3.6.tar.bz2 from ftp.mozilla.org. Decompressing this archive will give you a directory named firefox.

  2. Move the firefox directory in /opt/. The target of these instructions is to get /opt/firefox/firefox to execute without errors.

  3. Trying to run /opt/firefox/firefox now, results in 'library missing' errors for libgtk-2.0-0 and libdbus-glib-1-2. Both these libraries exist in an Ubuntu 15.04 installation, but they are 64bit libraries whereas Firefox 3.6 was only ever released as a 32bit application.

    Both problems are solved by installing the 32bit versions of those libraries:

    sudo apt-get install libgtk-2.0-0:i386
    sudo apt-get install libdbus-glib-1-2:i386
    
  4. Run /opt/firefox/firefox now and you should be able to enjoy the retro experience of times gone by, with no Flash or any other plugin for that matter. A note of caution: running such an old version of a browser is very unsafe. Don't do anything other than testing with it, use a clean profile (run with -P option and create a test profile), and if possible, sandbox the application so that it can't touch anything on your main system.

A note about library paths: Firefox 3.6 looks for libraries into its installation directory (in this case /opt/firefox/firefox), in addition to directories in the library path. Therefore, if you hit an issue where the browser can't locate libraries that exist on the system, it is easier and probably safer to create symbolic links to those libraries in /opt/firefox/firefox rather than altering your library path just to accommodate the needs of this old application.

Enjoy!

Stale NFS Causes BackupPC fileListReceive Failure

Recently, one of my BackupPC clients running CentOS failed to backup, with the contents of the host log being:

    2015-06-10 01:40:10 incr backup started back to 2015-05-16 08:56:42 (backup #600) for directory /
    2015-06-10 21:40:18 Aborting backup up after signal ALRM
    2015-06-10 21:40:18 Got fatal error during xfer (fileListReceive failed)

...and the last bad XferLOG containing:

    fileListReceive() failed

This happened a couple of times in a row, and the interval between the start time of the backup and the failure was consistently 20 hours. While checking, I noticed that an rsync process started on the client by BackupPC was running for about a week. I did an strace -p <PID> on the process ID of rsync and noticed that it was trying to stat an old NFS export, mounted from a server that no longer exists.

Although there are other ways to fix this, it was OK for this host to be rebooted at the time, problem solved.

Features disabled when VMware Evaluation Expires

As a note to myself and a future reference, here is a list of features of VMware ESXi 6 that get disabled when the 60-day evaluation period expires:

List of features disabled when VMware Evaluation Expires

I registered as a Beta tester and installed this evaluation version of ESXi 6 before it went into general availability, so there might be some differences compared to GA versions, but it's not likely.

Security bootstrap on a Raspberry Pi

These are some notes on improving the security of a Raspberry Pi running a fresh installation of Raspbian, before exposing it to the world, either by giving it a public IP, or with some NAT/PAT configuration.

Read: Raspberry Pi Security Bootstrap