Tuesday 26 March 2013

Missing the A in Asynchronous - Part 1

Few things are more irritating in software development than when people use a kludge rather than taking the time to implement a properly designed and robust solution. The peeve I am exercising today relates to the code snippet that follows:


    private BOTransfer getTransferInefficiently(Trade trade) throws RemoteException {
  // There is some delay between when a trade and corresponsing transfer is created
  BOTransfer transfer = null;
int cnt = 0;
do {
           try {
Thread.sleep(500l);
          } catch (Exception e) {
return null;
          }
          transfer = getFirstTransfer(trade);
} while (cnt++ < 10 && transfer == null);
return transfer;
    }

The code is pretty simple. Wait 500ms and try and load a transfer object from the trade, repeat until the transfer object is loaded or the number of load attempts reaches 10. In a single threaded or single user environment this code might be excusable. In a multi-user production trading system processing hundreds of transactions a second this implementation is totally unacceptable. Some well meaning soul has added a comment further explaining the method:

    // This implementation is fundamentally broken in many ways. The transfer creation is
    // asynchronous and this logic should be triggered off the transfer creation event (instead of
    // looping and sleeping). This approach prevents this engine thread from processing any other
    // events and will not find the transfer in the case where the TransferEngine is under heavy
    // load. useInefficientWaitForTransfer method added to the interface to allow correct
    // implementation on new classes

This person at least understood that the code was a problem and tried to mitigate the risks by extracting it into a separate method and adding the explanatory comment. Unfortunately for whatever reason (time constraints, lack of automated regression testing, indifference, ...) they only raised the visibility of the problem rather than fixing it entirely.

Issues

Increased system load - the call to getFirstTransfer actually does a remote call and a database query each time it is executed. This means that in the worst case (when the system is probably already overloaded) 10 database queries are executed at half a second intervals for 5 seconds (per thread), a great way to further degrade performance.

Poor thread pool usage - the getTransferInefficiently method is run from a static sized thread pool responsible for handling all requests queued for this process (typically between a few hundred and a few thousand per second). For each thread looping in this method there are numerous other requests that could have been processed instead. In the worst case if all threads are handling requests requiring this method then for a few seconds potentially thousands of requests will be queued unnecessarily.

Poor transparency - under heavy load most of the invocations of the method will return null as the transfer object will not have been created and consequently can't be loaded in the timeout (despite adding significant load to the already strained system). There is also no mechanism for measuring the average time taken for the object to be created and no backoff policy to try and reduce server load.

Solution

The correct implementation to achieve the desired behaviour asynchronously is relatively simple. A separate process should register/subscribe to BOTransfer events, on successful transfer creation perform the processing currently performed on the object returned by the getTransferInefficiently method.