The last time I blogged about a new course I’m fol­low­ing at my uni­ver­sity. This course, held by Pasquale Lops and Gio­vanni Semer­aro, is very inter­est­ing at the point that I’ll be devel­op­ing a custom infor­ma­tion retrieval engine as part of my intern­ship project. I can’t tell much more at this point since the intern­ship haven’t started yet and I’m not sure I can release more details about this project (we’re still in the process of decid­ing if and how the whole thing will be released to the world).

In the mean­time, I’ve been doing sev­eral exper­i­ments on this topic mostly about the memory usage and the per­for­mances of such system on lim­ited hard­ware. This prac­ti­cally means imple­ment­ing the algo­rithms you’ll be using and mea­sur­ing the com­pu­ta­tional time they require.

One of the most common thing that our infor­ma­tion retrieval engine have to do is to take a doc­u­ment and com­press it, but con­sid­er­ing that:

  • this is a fundamental piece of this IR engine
  • it will be used very often
  • it’s not rare to process very large documents

You’ll get that this oper­a­tion should be as effi­cient as possible.

I chosen to go down with zlib as my com­pres­sion library for mainly two reasons:

  • it’s already included in Python (this is not really a strong point since better compression algorithms are included in Python too)
  • offers the best compromise in speed/compression ratio

Given the above con­sid­er­a­tions, let start coding our com­pres­sion system.

We will use as our doc­u­ment exam­ple the PDF spec­i­fi­ca­tions, avail­able at the Adobe Devel­op­ment Center (this is the file) that are 8.6Mb large.

So let start doing the things the basic way:

#!/usr/bin/env python
# compress1.py
import zlib

def compress(input_path, output_path, compression_level=6):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    cobj = zlib.compressobj(compression_level)
    out = ''
    for line in input_fd:
        out += cobj.compress(line)
    out += cobj.flush()

    output_fd.write(out)

    input_fd.close()
    output_fd.close()

def decompress(input_path, output_path):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    dobj = zlib.decompressobj()
    out = ''
    for line in input_fd:
        out += dobj.decompress(line)
    out += dobj.flush()

    output_fd.write(out)

    input_fd.close()
    output_fd.close()

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]

    options = { 'compress': compress,
                'decompress': decompress,
    }

    input_path, output_path = args[1], args[2]

    try:
        options[args[0]](input_path, output_path)
    except (KeyError, IndexError):
        print("Invalid arguments")

By run­ning this pro­gram and per­form­ing a very basic pro­fil­ing we get some indications:

kratorius@becks:~/compress$ time ./compress1.py compress PDF32000_2008.pdf compr.zlib
real    0m2.517s
user    0m1.496s
sys     0m0.060s

kratorius@becks:~/compress$ time ./compress1.py decompress compr.zlib decompr.pdf
real    0m0.640s
user    0m0.537s
sys     0m0.085s

We need 2.5 secs in order to com­press a file smaller than 10Mb. This is quite unac­cept­able, since it means that we’re pro­cess­ing about 3.5Mb per second; so we need to under­stand what we’re doing wrong. I can spot at least two big errors in this script:

  1. we’re reading the input file line by line that isn’t very efficient since in this way we’re accessing the disk multiple times (not counting that we are also processing the compression stuff line by line, that it’s not efficient and hasn’t so much sense in a binary file like our PDF)
  2. we keep our compressed object in memory until we finish the compression, and this means that if the script would run faster, we’d still have a very high memory usage that is not optimal

So here it is the new ver­sion of our com­pres­sion script that address the issues above:

#!/usr/bin/env python
# compress2.py
import zlib

def compress(input_path, output_path, compression_level=6):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    out = zlib.compress(input_fd.read(), compression_level)
    output_fd.write(out)

    input_fd.close()
    output_fd.close()

def decompress(input_path, output_path):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    out = zlib.decompress(input_fd.read())
    output_fd.write(out)

    input_fd.close()
    output_fd.close()

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]

    options = { 'compress': compress,
                'decompress': decompress,
    }

    input_path, output_path = args[1], args[2]

    try:
        options[args[0]](input_path, output_path)
    except (KeyError, IndexError):
        print("Invalid arguments")

Let per­form our basic pro­fil­ing again:

kratorius@becks:~/compress$ time ./compress2.py compress PDF32000_2008.pdf compr.zlib
real    0m1.668s
user    0m1.337s
sys     0m0.079s

kratorius@becks:~/compress$ time ./compress2.py decompress compr.zlib decompr.pdf
real    0m0.561s
user    0m0.394s
sys     0m0.086s

We are now read­ing the whole input file in memory (min­i­miz­ing the disk accesses), com­press­ing every­thing in memory and writ­ing the com­pressed file to the output in a single shot. We got a high speedup in this way but we have just increased our memory usage since now we’re keep­ing in memory both the input and the com­pressed file. This could be opti­mal if we’re pro­cess­ing small files, but since we need to have a gen­er­al­ized approach, this solu­tion is not that good.

We can do better. And we’ll do better in the third try:

#!/usr/bin/env python
# compress2.py
import zlib

READ_BYTES = 2097152 # 2Mb

def compress(input_path, output_path, compression_level=6):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    cobj = zlib.compressobj(compression_level)
    done = False
    while not done:
        rd = input_fd.read(READ_BYTES)
        done = rd == ''

        output_fd.write(cobj.compress(rd))

    output_fd.write(cobj.flush())

    input_fd.close()
    output_fd.close()

def decompress(input_path, output_path):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    dobj = zlib.decompressobj()
    done = False
    while not done:
        rd = input_fd.read(READ_BYTES)
        done = rd == ''

        output_fd.write(dobj.decompress(rd))

    output_fd.write(dobj.flush())

    input_fd.close()
    output_fd.close()

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]

    options = { 'compress': compress,
                'decompress': decompress,
    }

    input_path, output_path = args[1], args[2]

    try:
        options[args[0]](input_path, output_path)
    except (KeyError, IndexError):
        print("Invalid arguments")

And we finally reached our goal:

kratorius@becks:~/compress$ time ./compress3.py compress PDF32000_2008.pdf compr.zlib
real    0m1.325s
user    0m1.226s
sys     0m0.070s

kratorius@becks:~/compress$ time ./compress3.py decompress compr.zlib decompr.pdf
real    0m0.534s
user    0m0.404s
sys     0m0.119s

This last try works because we’re still min­i­miz­ing the disk accesses for small files (we’re read­ing 2Mb chunks per time) and this time we’re reduc­ing the memory usage since:

  • we read a 2Mb block from our input file
  • we compress the read input
  • we write it directly to our output file

I’m sure there’s still room for improve­ment but at this point we can be quite happy of our achieve­ment. You can find the final script that per­forms error check­ing and file lock­ing here (file lock­ing works only on UNIX sys­tems though, on Win­dows you should just com­ment the fcntl lines out). As always, sug­ges­tions are welcome.