June 19, 2009

Faulty character decoding as the last line of anti-spam defense

I receive spam every day. Filtering is in place and everything, but occasionally some garbage gets through. And then I may look through it, briefly, less than a second perhaps before I hit "Delete", but the eye is fast enough to read and understand more than I'd want to. Then you might say such kamikaze message still had succeeded.

Much of the spam I receive is in Russian. As a side note, Russian characters have multiple encodings - WIN1251, KOI8-R, CP866, ISO-8859-5 and the universal UTF-8 come to mind. This means that the mail client has to properly understand the encoding and decode the message so that it can be displayed correctly.

I use Thunderbird, and it is just awful in decoding Russian messages. I don't have any idea why is that, but I have to manually specify encoding for every last message, because they always appear garbled.


But then, the bug becomes an unexpected feature - the spam messages look undecipherable just like legitimate ones, and even though I look at it, nothing is imprinted in my mind, and I just hit "Delete".

March 30, 2009

Software architecture

is what you explain to somebody else so that he understands the matter.

February 05, 2009

This is Python: context managers and their use

Python allows the developer to override the behavior of pretty much everything. For example, as I explained before, the ability to override the "dot" operator makes all sorts of magic possible.

The topic of this post is similar magic enablers - "context managers", defined in PEP-343. I will also demonstrate one idiosyncratic context manager example.

To begin with, it is important to note that Python reasonably suggests that when a developer modifies the behavior (i.e. the semantics) of something, it is still done somewhat in line with the original syntax. The syntax therefore implies a certain direction in which a particular behavior could be shifted.

For instance, it would be rather awkward if you override the dot operator on some class in such way that it throws an exception upon attribute access:
class Awkward:
def __getattr__(self, n):
raise Exception(n)

Awkward().foo # throws Exception("foo")
It is a possible but very unusual way of interpreting the meaning of a "dot", which is originally a lookup of an instance attribute.

Having this in mind we proceed to the context managers. They originate from the typical resource-accessing syntactical pattern:
r = allocate_resource(...)
try:
r.use()
finally:
r.deallocate()
Such code is encountered so often, that it indeed was a good idea to wrap it into a simpler syntactical primitive. Context manager in Python is an object whose responsibility is to deallocate the resource when it comes out of its scope (or, context). The developer should only be concerned with allocating a resource and using it:
with allocated_resource(...) as r:
r.use(...)
In simple terms, the above translates to:
ctx_mgr = ResourceAllocator(...)
r = ctx_mgr.__enter__()
try:
r.use()
finally:
ctx_mgr.__exit__()
I note a few obvious things first:
  1. Context manager is any instance that supports __enter__ and __exit__ methods (aka context manager protocol).
  2. A specific ResourceAllocator must be defined for a particular kind of resource. The syntactical simplification does not come for free.
  3. Context managers are one-time objects, which are created and disposed of as wrappers around the resource instances they protect.
What is less obvious is that a class can be a context manager for its own instances, there need not be a separate class for that. For example, instances of threading.Lock are their own context managers, they provide the necessary methods and can be used like this:
lock = threading.Lock()
with lock:
# do something while the lock is acquired
which is identical to
lock = threading.Lock()
lock.acquire()
try:
# do something while the lock is acquired
finally:
lock.release()
Finally, I proceed to an example of my own.

See, I tend to write a lot of self-tests and I love Python for forcing me to. And some of the tests require that you check for a failure. Long ago I used to write code like this:
try:
test_specific_failure_condition()
except SpecificError, e:
assert str(e) == "error message"
else:
assert False, "should have thrown SpecificError"
which made my test code very noisy. I have even posted a suggestion that a syntactical primitive is introduced to the language just for that. It was rejected (duh !).

And then I wrote a simple "expected" context manager which makes exactly the same thing for me every day now:
with expected(SpecificError("error message")):
test_specific_failure_condition()
See how much noise has been eliminated ? How much clearer the test code becomes ? It is not a particularly "resource-protecting" kind of thing, but still in line with the original syntax, just like I said above.

The "expected" context manager source code is available here, please feel free to use it if you like.

To be continued...