February 20, 2007

Never underestimate the power of randomness

I've just returned from a deep testing and debugging session and all I can say is again - wow ! never underestimate the power of randomness !

The system I was testing is a complex of network services build on top of Pythomnic platform with multiple Python processes scattered across multiple servers and intertwined together in redundant and fault-tolerant fashion. When it's live, it's going to be the billing hub service for the bank where I work. It has to deal with all sorts of payments to all sorts of providers. And so my job is to build a system to which modules for specific providers will be plugged later. It is also transferring money, so it'd better be reliable.

Someday I'm going to describe the design of that system as a case-study for Pythomnic and publish it on its web site. That will be, but for now, here is my recipe for the best testing:

Stress + failure injection + randomness

Stress: don't spare the system you are testing. The users will not. Give it as high load as it can handle and then some. It's no problem if it breaks now, but the amount of problems (not always bugs) that are revealed under unbearable load is surprising.

Failure injection: don't expect the problems to happen just because you are testing. Make them happen. Break stuff. Insert something like:

if random() < 0.01:
raise Exception("failure before provider request")

if random() < 0.001:
sleep(3600)
raise Exception("provider request hangs")

result = provider_request()

if random() < 0.01:
raise Exception("failure after provider request")


Insert it all over the place. Well, there is no point inserting failures between each two statements, it quickly gets cumbersome, but you should decorate each "external" call with such injected failure frame. It may be a database request, a specific API call etc.

Randomness: that's my favourite part, in testing, you can't beat randomness. You would never make up such a combination of failures that random() would. Make sure your random switches cover all the major code paths and let it running for a while. If it succeeds you can be pretty certain the system is working. To be sure, such random testing may not catch all of the special border cases in each of the modules, but for load testing - it's invaluable.

2 comments:

Anonymous said...

Maybe a place for a decorator?

def fail:
if random() < 0.01:
raise Exception("random failure")


@fail
def my_method:
...

Dmitry Dvoinikov said...

Sure, whatever fits you. The decorator will restrict the failures to the methods borders though.

Another way of using it could be

def fail():
if random() < 0.01:
raise Exception("random failure")

...
fail()
...
fail()
...
fail()
...