Bill Mill web site logo

Reverse File Iterator and Premature Optimization

This afternoon, I had a need for a reverse file iterator in a script I was writing to process a quite large log file. After some unsuccessful googling followed by some quick hacking, I was left with a frustratingly close, but nonfunctional, iterator. Instead of fixing it while I was at work, I hacked up a solution calling "tail" and left it for a while.

When I came back to finish the job this evening, I figured that I ought to go look at the source of tail for some inspiration. Surely the unix hackers had figured this problem out long ago? In reverse.c, I found the inspiration I needed:

    for (; pos >= start; pos--) {
        /* A seek per char isn't a problem with a smart stdio */
        if (fseeko(fp, pos, SEEK_SET) != 0) {
        //snip
            if ((ch = getc(fp)) == '\\n')

I had been reading the file in chunks, splitting the chunk into lines, handling a cache of the lines, and compensating for unfinished lines, all because I had buried deep down in my lizard brain the idea that an fseek per character was "slow". In pure python, for chrissake!

Properly reminded of the fact that premature optimization can sneak up anywhere, I ended up with this code:

import os

class reversefile(object):
    """Iterate backwards through a file. f should be an open file handle"""
    def __init__(self, f):
        self._f = f
        self.end = os.stat(f.name).st_size

    def __iter__(self): return self

    def next(self):
        if self.end == 0:
            raise StopIteration

        pos = self.end-2
        while pos >= 0:
            self._f.seek(pos)
            if self._f.read(1) == '\\n':
                end = self.end
                self.end = pos
                return self._f.read(end - pos - 1)
            pos -= 1

        end = self.end
        self.end = 0
        self._f.seek(0)
        return self._f.read(end).strip("\\n")

You can see the whole thing, with source and some very brief tests I hacked up, here.