Wednesday 13 July 2011

The importance of choosing the right data structure

A data structure is used for storing and organising data in system. Selecting the correct data structure for a particular problem is an important part of software design and will influence efficiency and maintainability of the code. This post demonstrates the importance of data structure selection with a contrived problem posed in a technical interview and a real world application.

The problem posed is as follows. Imagine you have a list of unsorted positive integers greater than zero. The list can be of arbitrary size but your algorithm must cater for extremely large lists (think millions or hundreds of millions of items). The list consists of items such that the number of occurrences of any particular integer is always even. One (and only one) integer in the list has an odd number of occurrences, how would you find that integer?

So for example, the list consisting of the following integers

5, 73, 8, 24, 24, 5, 73, 1, 24, 24, 5, 1, 8

has 5 as the number occurring an odd number of times (3 in this case but could be any odd number).

This is a great interview question as it allows the interviewer to get a feel for how an applicant approaches a problem and opens up various other branches of investigation. For example the applicants knowledge of algorithm efficiency (Big O notation) and sorting algorithms can be explored.

The key point in solving the problem revolves around how many times each of the items needs to be accessed to find a solution:
  • Every item needs to be visited at least once, no conclusion can be drawn until the last number is known.
  • There is no benefit to sorting the items, after sorting the problem is not solved and it will have incurred the additional cost of sorting
  • Using a map is not efficient for extremely large inputs and still requires an additional step for iterating over each of the unique values to check oddness

The ideal data structure to solve the problem is actually a Set. For each item add it to the set if it is not already present, if it is present the remove it, after processing all the items the set will contain the integer occurring an odd number of times. Ironically (for a post about data structures) there is an even simpler algebraic solution to the problem, performing a bitwise XOR (the ^ operator in Java) on all the input numbers results in the same solution without the use of a data structure. As stated above this example is contrived specifically for use in technical interviews, a real world application of data structure selection is as follows.

A functional requirement for a critical component of the system is that no two threads try and update an object (with a unique identifier) simultaneously. The system is multi-threaded and event driven, with certain events triggering the object updates. Simultaneous updates would result in a version exception and require manual intervention to correct. An initial (optimistic) solution is to add the synchronized keyword to the method responsible for performing the update, thus ensuring sequential execution of updates. While this approach should work as desired it incurs the cost of class level locking and prevents any kind of concurrency, even for update requests to distinct objects which could actually be executed in parallel. This raises the non-functional requirement of the component namely performance.

The component is responsible for cash movements in the system and at peak times can have thousands of events generated every second. While there is no real-time requirement, the delay introduced by the synchronized method could lead to the last events being processed minutes after they were received, a situation that was not acceptable to the business. After closer investigation it was identified that in the worst case only a fraction of the updates would ever attempt simultaneous updates. To restate the problem, how can duplicate requests be blocked while allowing distinct requests to process concurrently?

public class LockTest {
  protected static Set<Integer> ID_LOCK = Collections.synchronizedSet(new TreeSet<Integer>()); 

    public void performUpdate(int id) {
        if (!ID_LOCK.contains(id)) {
            boolean addedNew = ID_LOCK.add(id);
            if (addedNew) {
                try {
                    SomeObject object = loadObject(id);
                    if (!object.isUpdated())) {
                        object.update();
                        object.save();
                    }
                } finally {
                    ID_LOCK.remove(id);
                }
            }
        }
    }

    private SomeObject loadObject(int id) {
        // Load the object (from database/remote call)
        return someObject;
    }
}

The code above is a fair representation of the production code used to solve this problem and works as desired. A few points before looking at the implementation details. The loadObject() and object.update() calls are quite expensive and involve remote calls and over-the-wire object serialisation. It is also entirely possible for the object to be changed without the local copy being updated or notified of the change. I raise these points because they are significant for the (slightly off topic) synchronisation discussion below.

Lets start by looking at the flow for a single threaded (non-duplicate) call to the performUpdate method.

  1. !ID_LOCK.contains(id) condition (line 5) passes as the set is empty
  2. id is added and the addedNew variable is set to true (line 6)
  3. addedNew condition is met and the object is loaded (lines 7 - 9)
  4. !object.isUpdated() condition is met and the object is updated and saved (lines 10 - 12)
  5. id is removed from the set in the finally clause (line 15)
This is the simplest scenario. Extending it to two (or n) concurrent calls to the performUpdate method with distinct IDs follows exactly the same flow just interleaved with the same operations potentially at different stages of execution. So for example, two concurrent calls for id1 and id2 could look as follows:
  1. !ID_LOCK.contains(id1) condition (line 5) passes as the set is empty
  2. id1 is added and the addedNew variable is set to true (line 6)
  3. !ID_LOCK.contains(id2) condition (line 5) passes as the set contains id1
  4. id1 addedNew condition is met and the object is loaded (lines 7 - 9)
  5. id2 is added and the addedNew variable is set to true (line 6)
  6. id2 addedNew condition is met and the object is loaded (lines 7 - 9)
  7. id2 !object.isUpdated() condition is met and the object is updated and saved (lines 10 - 12)
  8. id2 is removed from the set in the finally clause (line 15)
  9. id1 !object.isUpdated() condition is met and the object is updated and saved (lines 10 - 12)
  10. id1 is removed from the set in the finally clause (line 15)
The really interesting stuff happens when two concurrent calls occur with the same ID. Fortunately the code maintains the critical section condition for all permutations:
  • Thread 1 is at line 7 (set contains id), thread 2 fails the condition at line 5
  • Both threads execute line 6, only one of them will return true , the other thread will fail at line 7
  • Thread 1 is at line 16 (id has been removed from the set), thread 2 will fail on line 10 (object already updated)
A few notes about the implementation:
  • The try/finally block is very important to ensure the lock is always released
  • The object.isUpdated() check is only required if the update method can't be called on an object that has already been updated
  • The synchronisation of the set could probably be optimised (instead of using the Collections synchronisation) but it would complicate the implementation
  • TreeSet is used for the ordering of the values, this is not strictly needed as the underlying implementation on a Set is generally a Map which will have a similar complexity for the look up as that of an insert into a SortedMap
  • As always, this is just one solution to the problem. I am sure others exist, please feel free to comment if you have a more elegant solution
As you may have guessed (if you made it this far in the post), I am quite a fan of the Sets and lament the fact that they are not used more widely, particularly when they are well suited to solving a problem.