Optimize your programs
December 2nd, 2008
The last time I blogged about a new course I’m following at my university. This course, held by Pasquale Lops and Giovanni Semeraro, is very interesting at the point that I’ll be developing a custom information retrieval engine as part of my internship project. I can’t tell much more at this point since the internship haven’t started yet and I’m not sure I can release more details about this project (we’re still in the process of deciding if and how the whole thing will be released to the world).
In the meantime, I’ve been doing several experiments on this topic mostly about the memory usage and the performances of such system on limited hardware. This practically means implementing the algorithms you’ll be using and measuring the computational time they require.
One of the most common thing that our information retrieval engine have to do is to take a document and compress it, but considering that:
- this is a fundamental piece of this IR engine
- it will be used very often
- it’s not rare to process very large documents
You’ll get that this operation should be as efficient as possible.
I chosen to go down with zlib as my compression library for mainly two reasons:
- it’s already included in Python (this is not really a strong point since better compression algorithms are included in Python too)
- offers the best compromise in speed/compression ratio
Given the above considerations, let start coding our compression system.
We will use as our document example the PDF specifications, available at the Adobe Development Center (this is the file) that are 8.6Mb large.
So let start doing the things the basic way:
#!/usr/bin/env python
# compress1.py
import zlib
def compress(input_path, output_path, compression_level=6):
input_fd = open(input_path, 'rb')
output_fd = open(output_path, 'wb')
cobj = zlib.compressobj(compression_level)
out = ''
for line in input_fd:
out += cobj.compress(line)
out += cobj.flush()
output_fd.write(out)
input_fd.close()
output_fd.close()
def decompress(input_path, output_path):
input_fd = open(input_path, 'rb')
output_fd = open(output_path, 'wb')
dobj = zlib.decompressobj()
out = ''
for line in input_fd:
out += dobj.decompress(line)
out += dobj.flush()
output_fd.write(out)
input_fd.close()
output_fd.close()
if __name__ == '__main__':
import sys
args = sys.argv[1:]
options = { 'compress': compress,
'decompress': decompress,
}
input_path, output_path = args[1], args[2]
try:
options[args[0]](input_path, output_path)
except (KeyError, IndexError):
print("Invalid arguments")
By running this program and performing a very basic profiling we get some indications:
kratorius@becks:~/compress$ time ./compress1.py compress PDF32000_2008.pdf compr.zlib real 0m2.517s user 0m1.496s sys 0m0.060s kratorius@becks:~/compress$ time ./compress1.py decompress compr.zlib decompr.pdf real 0m0.640s user 0m0.537s sys 0m0.085s
We need 2.5 secs in order to compress a file smaller than 10Mb. This is quite unacceptable, since it means that we’re processing about 3.5Mb per second; so we need to understand what we’re doing wrong. I can spot at least two big errors in this script:
- we’re reading the input file line by line that isn’t very efficient since in this way we’re accessing the disk multiple times (not counting that we are also processing the compression stuff line by line, that it’s not efficient and hasn’t so much sense in a binary file like our PDF)
- we keep our compressed object in memory until we finish the compression, and this means that if the script would run faster, we’d still have a very high memory usage that is not optimal
So here it is the new version of our compression script that address the issues above:
#!/usr/bin/env python
# compress2.py
import zlib
def compress(input_path, output_path, compression_level=6):
input_fd = open(input_path, 'rb')
output_fd = open(output_path, 'wb')
out = zlib.compress(input_fd.read(), compression_level)
output_fd.write(out)
input_fd.close()
output_fd.close()
def decompress(input_path, output_path):
input_fd = open(input_path, 'rb')
output_fd = open(output_path, 'wb')
out = zlib.decompress(input_fd.read())
output_fd.write(out)
input_fd.close()
output_fd.close()
if __name__ == '__main__':
import sys
args = sys.argv[1:]
options = { 'compress': compress,
'decompress': decompress,
}
input_path, output_path = args[1], args[2]
try:
options[args[0]](input_path, output_path)
except (KeyError, IndexError):
print("Invalid arguments")
Let perform our basic profiling again:
kratorius@becks:~/compress$ time ./compress2.py compress PDF32000_2008.pdf compr.zlib real 0m1.668s user 0m1.337s sys 0m0.079s kratorius@becks:~/compress$ time ./compress2.py decompress compr.zlib decompr.pdf real 0m0.561s user 0m0.394s sys 0m0.086s
We are now reading the whole input file in memory (minimizing the disk accesses), compressing everything in memory and writing the compressed file to the output in a single shot. We got a high speedup in this way but we have just increased our memory usage since now we’re keeping in memory both the input and the compressed file. This could be optimal if we’re processing small files, but since we need to have a generalized approach, this solution is not that good.
We can do better. And we’ll do better in the third try:
#!/usr/bin/env python
# compress2.py
import zlib
READ_BYTES = 2097152 # 2Mb
def compress(input_path, output_path, compression_level=6):
input_fd = open(input_path, 'rb')
output_fd = open(output_path, 'wb')
cobj = zlib.compressobj(compression_level)
done = False
while not done:
rd = input_fd.read(READ_BYTES)
done = rd == ''
output_fd.write(cobj.compress(rd))
output_fd.write(cobj.flush())
input_fd.close()
output_fd.close()
def decompress(input_path, output_path):
input_fd = open(input_path, 'rb')
output_fd = open(output_path, 'wb')
dobj = zlib.decompressobj()
done = False
while not done:
rd = input_fd.read(READ_BYTES)
done = rd == ''
output_fd.write(dobj.decompress(rd))
output_fd.write(dobj.flush())
input_fd.close()
output_fd.close()
if __name__ == '__main__':
import sys
args = sys.argv[1:]
options = { 'compress': compress,
'decompress': decompress,
}
input_path, output_path = args[1], args[2]
try:
options[args[0]](input_path, output_path)
except (KeyError, IndexError):
print("Invalid arguments")
And we finally reached our goal:
kratorius@becks:~/compress$ time ./compress3.py compress PDF32000_2008.pdf compr.zlib real 0m1.325s user 0m1.226s sys 0m0.070s kratorius@becks:~/compress$ time ./compress3.py decompress compr.zlib decompr.pdf real 0m0.534s user 0m0.404s sys 0m0.119s
This last try works because we’re still minimizing the disk accesses for small files (we’re reading 2Mb chunks per time) and this time we’re reducing the memory usage since:
- we read a 2Mb block from our input file
- we compress the read input
- we write it directly to our output file
I’m sure there’s still room for improvement but at this point we can be quite happy of our achievement. You can find the final script that performs error checking and file locking here (file locking works only on UNIX systems though, on Windows you should just comment the fcntl lines out). As always, suggestions are welcome.