<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
	<link rel="self" href="/Atom/" />
	<id>http://billmill.org/</id>
	<title>My Name Rhymes</title>
	<subtitle>Bill Mill blogs irregularly</subtitle>
	<updated>2007-10-03T00:30:00Z</updated>
	<author>
		<name>Bill Mill</name>
		<email>bill.mill@gmail.com</email>
		<uri>http://billmill.org/</uri>
	</author>
	<link href="http://billmill.org/" />
	<entry>
		<title>Reverse File Iterator and Premature Optimization</title>
		<link href="http://billmill.org/reversefile.html" />	
		<id>http://billmill.org/reversefile.html</id>
		<updated>2007-10-03T00:30:00Z</updated>
		<summary type="html">This afternoon, I had a need for a reverse file iterator in a script I was 
writing to process a quite large log file. After some unsuccessful googling 
followed by some quick hacking, I was left with a frustratingly close, but 
nonfunctional, iterator. Instead of fixing it while I was at work, I hacked up 
a solution calling "tail" and left it for a while.&lt;p&gt;
When I came back to finish the job this evening, I figured that I ought to go 
look at the &lt;a 
href="http://www.openbsd.org/cgi-bin/cvsweb/src/usr.bin/tail"&gt;source&lt;/a&gt; of tail 
for some inspiration. Surely the unix hackers had figured this problem out long 
ago? In &lt;a 
href="http://www.openbsd.org/cgi-bin/cvsweb/src/usr.bin/tail/reverse.c?rev=1.18&amp;content-type=text/x-cvsweb-markup"&gt;reverse.c&lt;/a&gt;, 
I found the inspiration I needed:&lt;p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;    &lt;span style="font-weight: bold"&gt;for&lt;/span&gt; (; pos &lt;span style="font-weight: bold"&gt;&amp;gt;=&lt;/span&gt; start; pos&lt;span style="font-weight: bold"&gt;--&lt;/span&gt;) {
        &lt;span style="color: #999988; font-style: italic"&gt;/* A seek per char isn&amp;#39;t a problem with a smart stdio */&lt;/span&gt;
        &lt;span style="font-weight: bold"&gt;if&lt;/span&gt; (fseeko(fp, pos, SEEK_SET) &lt;span style="font-weight: bold"&gt;!=&lt;/span&gt; &lt;span style="color: #009999"&gt;0&lt;/span&gt;) {
        &lt;span style="color: #999988; font-style: italic"&gt;//snip&lt;/span&gt;
            &lt;span style="font-weight: bold"&gt;if&lt;/span&gt; ((ch &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; getc(fp)) &lt;span style="font-weight: bold"&gt;==&lt;/span&gt; &lt;span style="color: #a61717; background-color: #e3d2d2"&gt;&amp;#39;\&lt;/span&gt;n&lt;span style="color: #a61717; background-color: #e3d2d2"&gt;&amp;#39;&lt;/span&gt;)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;
I had been reading the file in chunks, splitting the chunk into lines, handling 
a cache of the lines, and compensating for unfinished lines, all because I had 
buried deep down in my lizard brain the idea that an fseek per character was 
"slow". In pure python, for chrissake!&lt;p&gt;
Properly reminded of the fact that premature optimization can sneak up 
anywhere, I ended up with this code:&lt;p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span style="font-weight: bold"&gt;import&lt;/span&gt; &lt;span style="color: #555555"&gt;os&lt;/span&gt;

&lt;span style="font-weight: bold"&gt;class&lt;/span&gt; &lt;span style="color: #445588; font-weight: bold"&gt;reversefile&lt;/span&gt;(&lt;span style="color: #999999"&gt;object&lt;/span&gt;):
    &lt;span style="color: #bb8844"&gt;&amp;quot;&amp;quot;&amp;quot;Iterate backwards through a file. f should be an open file handle&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span style="font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #990000; font-weight: bold"&gt;__init__&lt;/span&gt;(&lt;span style="color: #999999"&gt;self&lt;/span&gt;, f):
        &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; f
        &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; os&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;stat(f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;name)&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;st_size

    &lt;span style="font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #990000; font-weight: bold"&gt;__iter__&lt;/span&gt;(&lt;span style="color: #999999"&gt;self&lt;/span&gt;): &lt;span style="font-weight: bold"&gt;return&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;

    &lt;span style="font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #990000; font-weight: bold"&gt;next&lt;/span&gt;(&lt;span style="color: #999999"&gt;self&lt;/span&gt;):
        &lt;span style="font-weight: bold"&gt;if&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end &lt;span style="font-weight: bold"&gt;==&lt;/span&gt; &lt;span style="color: #009999"&gt;0&lt;/span&gt;:
            &lt;span style="font-weight: bold"&gt;raise&lt;/span&gt; &lt;span style="color: #990000; font-weight: bold"&gt;StopIteration&lt;/span&gt;

        pos &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end&lt;span style="font-weight: bold"&gt;-&lt;/span&gt;&lt;span style="color: #009999"&gt;2&lt;/span&gt;
        &lt;span style="font-weight: bold"&gt;while&lt;/span&gt; pos &lt;span style="font-weight: bold"&gt;&amp;gt;=&lt;/span&gt; &lt;span style="color: #009999"&gt;0&lt;/span&gt;:
            &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;seek(pos)
            &lt;span style="font-weight: bold"&gt;if&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;read(&lt;span style="color: #009999"&gt;1&lt;/span&gt;) &lt;span style="font-weight: bold"&gt;==&lt;/span&gt; &lt;span style="color: #bb8844"&gt;&amp;#39;\n&amp;#39;&lt;/span&gt;:
                end &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end
                &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; pos
                &lt;span style="font-weight: bold"&gt;return&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;read(end &lt;span style="font-weight: bold"&gt;-&lt;/span&gt; pos &lt;span style="font-weight: bold"&gt;-&lt;/span&gt; &lt;span style="color: #009999"&gt;1&lt;/span&gt;)
            pos &lt;span style="font-weight: bold"&gt;-=&lt;/span&gt; &lt;span style="color: #009999"&gt;1&lt;/span&gt;

        end &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end
        &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; &lt;span style="color: #009999"&gt;0&lt;/span&gt;
        &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;seek(&lt;span style="color: #009999"&gt;0&lt;/span&gt;)
        &lt;span style="font-weight: bold"&gt;return&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;read(end)&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;strip(&lt;span style="color: #bb8844"&gt;&amp;quot;\n&amp;quot;&lt;/span&gt;)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;
You can see the whole thing, with source and some very brief tests I hacked up, 
&lt;a 
href="http://billmill.org:9561/personal_code/browser/python/reversefile/reversefile.py?rev=57"&gt;here&lt;/a&gt;.

&lt;h2&gt;update:&lt;/h2&gt;
Reddit user Brian &lt;a href="http://reddit.com/r/Python/info/6hj75/comments/c03vms4"&gt;posts a solution&lt;/a&gt; that goes much faster on his platform (I've
done no performance testing of my code, it wasn't important for me) and handles
platform linebreaks better. In other words, it's better for all purposes. Thanks
Brian, I wish I'd found that when I was googling!
</summary>
		<content type="html">This afternoon, I had a need for a reverse file iterator in a script I was 
writing to process a quite large log file. After some unsuccessful googling 
followed by some quick hacking, I was left with a frustratingly close, but 
nonfunctional, iterator. Instead of fixing it while I was at work, I hacked up 
a solution calling "tail" and left it for a while.&lt;p&gt;
When I came back to finish the job this evening, I figured that I ought to go 
look at the &lt;a 
href="http://www.openbsd.org/cgi-bin/cvsweb/src/usr.bin/tail"&gt;source&lt;/a&gt; of tail 
for some inspiration. Surely the unix hackers had figured this problem out long 
ago? In &lt;a 
href="http://www.openbsd.org/cgi-bin/cvsweb/src/usr.bin/tail/reverse.c?rev=1.18&amp;content-type=text/x-cvsweb-markup"&gt;reverse.c&lt;/a&gt;, 
I found the inspiration I needed:&lt;p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;    &lt;span style="font-weight: bold"&gt;for&lt;/span&gt; (; pos &lt;span style="font-weight: bold"&gt;&amp;gt;=&lt;/span&gt; start; pos&lt;span style="font-weight: bold"&gt;--&lt;/span&gt;) {
        &lt;span style="color: #999988; font-style: italic"&gt;/* A seek per char isn&amp;#39;t a problem with a smart stdio */&lt;/span&gt;
        &lt;span style="font-weight: bold"&gt;if&lt;/span&gt; (fseeko(fp, pos, SEEK_SET) &lt;span style="font-weight: bold"&gt;!=&lt;/span&gt; &lt;span style="color: #009999"&gt;0&lt;/span&gt;) {
        &lt;span style="color: #999988; font-style: italic"&gt;//snip&lt;/span&gt;
            &lt;span style="font-weight: bold"&gt;if&lt;/span&gt; ((ch &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; getc(fp)) &lt;span style="font-weight: bold"&gt;==&lt;/span&gt; &lt;span style="color: #a61717; background-color: #e3d2d2"&gt;&amp;#39;\&lt;/span&gt;n&lt;span style="color: #a61717; background-color: #e3d2d2"&gt;&amp;#39;&lt;/span&gt;)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;
I had been reading the file in chunks, splitting the chunk into lines, handling 
a cache of the lines, and compensating for unfinished lines, all because I had 
buried deep down in my lizard brain the idea that an fseek per character was 
"slow". In pure python, for chrissake!&lt;p&gt;
Properly reminded of the fact that premature optimization can sneak up 
anywhere, I ended up with this code:&lt;p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span style="font-weight: bold"&gt;import&lt;/span&gt; &lt;span style="color: #555555"&gt;os&lt;/span&gt;

&lt;span style="font-weight: bold"&gt;class&lt;/span&gt; &lt;span style="color: #445588; font-weight: bold"&gt;reversefile&lt;/span&gt;(&lt;span style="color: #999999"&gt;object&lt;/span&gt;):
    &lt;span style="color: #bb8844"&gt;&amp;quot;&amp;quot;&amp;quot;Iterate backwards through a file. f should be an open file handle&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span style="font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #990000; font-weight: bold"&gt;__init__&lt;/span&gt;(&lt;span style="color: #999999"&gt;self&lt;/span&gt;, f):
        &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; f
        &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; os&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;stat(f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;name)&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;st_size

    &lt;span style="font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #990000; font-weight: bold"&gt;__iter__&lt;/span&gt;(&lt;span style="color: #999999"&gt;self&lt;/span&gt;): &lt;span style="font-weight: bold"&gt;return&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;

    &lt;span style="font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #990000; font-weight: bold"&gt;next&lt;/span&gt;(&lt;span style="color: #999999"&gt;self&lt;/span&gt;):
        &lt;span style="font-weight: bold"&gt;if&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end &lt;span style="font-weight: bold"&gt;==&lt;/span&gt; &lt;span style="color: #009999"&gt;0&lt;/span&gt;:
            &lt;span style="font-weight: bold"&gt;raise&lt;/span&gt; &lt;span style="color: #990000; font-weight: bold"&gt;StopIteration&lt;/span&gt;

        pos &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end&lt;span style="font-weight: bold"&gt;-&lt;/span&gt;&lt;span style="color: #009999"&gt;2&lt;/span&gt;
        &lt;span style="font-weight: bold"&gt;while&lt;/span&gt; pos &lt;span style="font-weight: bold"&gt;&amp;gt;=&lt;/span&gt; &lt;span style="color: #009999"&gt;0&lt;/span&gt;:
            &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;seek(pos)
            &lt;span style="font-weight: bold"&gt;if&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;read(&lt;span style="color: #009999"&gt;1&lt;/span&gt;) &lt;span style="font-weight: bold"&gt;==&lt;/span&gt; &lt;span style="color: #bb8844"&gt;&amp;#39;\n&amp;#39;&lt;/span&gt;:
                end &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end
                &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; pos
                &lt;span style="font-weight: bold"&gt;return&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;read(end &lt;span style="font-weight: bold"&gt;-&lt;/span&gt; pos &lt;span style="font-weight: bold"&gt;-&lt;/span&gt; &lt;span style="color: #009999"&gt;1&lt;/span&gt;)
            pos &lt;span style="font-weight: bold"&gt;-=&lt;/span&gt; &lt;span style="color: #009999"&gt;1&lt;/span&gt;

        end &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end
        &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;end &lt;span style="font-weight: bold"&gt;=&lt;/span&gt; &lt;span style="color: #009999"&gt;0&lt;/span&gt;
        &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;seek(&lt;span style="color: #009999"&gt;0&lt;/span&gt;)
        &lt;span style="font-weight: bold"&gt;return&lt;/span&gt; &lt;span style="color: #999999"&gt;self&lt;/span&gt;&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;_f&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;read(end)&lt;span style="font-weight: bold"&gt;.&lt;/span&gt;strip(&lt;span style="color: #bb8844"&gt;&amp;quot;\n&amp;quot;&lt;/span&gt;)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;
You can see the whole thing, with source and some very brief tests I hacked up, 
&lt;a 
href="http://billmill.org:9561/personal_code/browser/python/reversefile/reversefile.py?rev=57"&gt;here&lt;/a&gt;.

&lt;h2&gt;update:&lt;/h2&gt;
Reddit user Brian &lt;a href="http://reddit.com/r/Python/info/6hj75/comments/c03vms4"&gt;posts a solution&lt;/a&gt; that goes much faster on his platform (I've
done no performance testing of my code, it wasn't important for me) and handles
platform linebreaks better. In other words, it's better for all purposes. Thanks
Brian, I wish I'd found that when I was googling!
</content>
	</entry>
</feed>
