Troublesome Locks

At my company, we develop a large software with fairly high concurrency. In general, asynchronous message passing is the choice for thread communication for its simplicity. When performance matters, we use locks and mutexes to share data among threads.

As the software grows, subtle bugs from dead-locks and race conditions are appearing around code that was designed with locks and mutexes.

I can say confidently that almost all of these bugs are caused by poor interface design or poor locking granularity. But I also feel that a well-designed thread-safe interface is extremely difficult to achieve.

Always Chasing The Perfect Granularity

To have a thread-safe interface, it requires far more than just putting a lock on every member function. Consider a supposedly thread-safe queue below, is this class really useful in a multi-threaded environment?

// thread-safe queue that has a scoped_lock guarding every member function
template <typename T>
class ts_queue()
{
public:
	T& front();
	void push_front(T const &);
	void pop_front();
	bool empty() const;
};

Unfortunately no. Here’s a simple code below where it falls apart. Say f() is called by multiple threads, if the the empty() check returns false, there is no guarantee that front() and pop_front() will succeed.

Even if front() succeeds, there is no guarantee that pop_front() will succeed. The ts_queue class interface is subject to race condition, and there is nothing you can do about it without an interface change.

//f may be called by multiple threads
void f(ts_queue<int> const &tsq)
{
	if(tsq.empty() == false)
	{
		int i = tsq.front(); // this might fail even if empty() was false
		tsq.pop_front(); // this might fail even if front() succeed
	}
}

The problem is that mutexes are not composable. You can’t just call a thread-safe function one after another to achieve thread-safety. In the ts_queue example, the interface granularity is too small to perform the task required by f().

On the other hand, it would be easy to use some global mutex to lock down the entire ts_queue class. But when the granularity is too large, you lose performance by excessive locking. In that case, the multi-core processors will never be fully utilized.

Some programmers try to solve it by exposing the internal mutex to the interface, and pun the granularity problem to their users. But if the designers can’t solve the problem, is it fair to expect their users to be able to solve it? Also, do you really want to expose concurrency?

So in the process of designing a thread-safe class, I feel that I am spending majority of the time chasing the perfect granularity.

Full of Guidelines

In addition to the problem of granularity, you need to follow many guidelines to maintain a well designed thread-safe interface.

  • You can’t pass reference/pointers of member variables out to external interface, whether it is through return value, or through an out parameter. Anytime a member variable is passed to an external interface, it could be cached and therefore lose the mutex protection.
  • You can’t pass reference/pointers member variables into functions that you don’t control. The reason is same as before, except this is far more tricky to achieve.
  • You can’t call any unknown code while holding onto a lock. That includes library functions, plug-ins, callbacks, virtual functions, and more.
  • You must always remember to lock mutexes in a fixed order to avoid the ABBA deadlocks.
  • There are more…

And if you are programming in C++, you need to watch out for all of the above while wrestling with the most engineered programming language in the world. 🙂

Final Thoughts

Locks and mutexes are just too complex for mortals to master.

And until C++0x defines a memory model for C++, lock-free programming is not even worth trying.

I will stop complaining now.

Test-And-Test-And-Set Lock

After I learned that Compare-And-Swap (CAS) has an infinite consensus number, I felt like I can rule the world with it. Well, maybe not quite, but I felt pretty darn good.

I got the urge to utilize the power of CAS, so I spent some time looking into its usage in the real-world of mutual exclusion.

Just so happens, I bumped into a chapter in the textbook that describes a spin lock algorithm called the Test-And-Test-And-Set (TTAS) lock.

The algorithm is succinct and it is elegantly solved by CAS.

TTASLock

In a mutual exclusion protocol, only one thread can acquire the lock.

If it acquired the lock, it may access the critical region, get its job done, and get out of critical region.

If it can’t acquire the lock, there are two choices. It can repeatedly test the lock – spinning, or it yields its slice of time to another thread and come back later – blocking.

If the contention is low, and spin locks might make sense.

In the test-and-test-and-set algorithm, the mutex is first tested to see if it is free. If it appears to be free, we perform test-and-set (CAS) on mutex.

The two tests idea is similar to the double-checked locking optimization to reduce the CAS overhead.

Here is my implementation of the TTAS lock in C++. It is a scoped lock that utilizes the Win32 Interlocked Variable Access API to perform atomic synchronization.

// atomic read is just a volatile read in C
inline LONG atomicRead(volatile LONG *mem) { return *mem; }

class CTtasLock
{
public:
	// mutex class that holds the lock state
	class CTtasMutex
	{
	public:
		CTtasMutex() : m_state(0) {};
		LONG m_state;
		LONG const static Locked = 1;
		LONG const static NotLocked = 0;
	};

	// RAII constructor that acquire lock upon construction
	explicit CTtasLock(CTtasMutex&m) : m_mutex(m) { lock(); }

	// RAII destructor that performs the "scoped" lock behavior
	~CTtasLock() { unlock(); }

	// spin lock that performs the test-test-and-set algorithm
	void lock()
	{
		while(true)
		{
			// test until state is cleared
			while(atomicRead(&m_mutex.m_state) == CTtasMutex::Locked){};

			// test-and-set
			if(InterlockedCompareExchange(
				&m_mutex.m_state,
				CTtasMutex::Locked,
				CTtasMutex::NotLocked) == CTtasMutex::NotLocked)
			{
				return;
			}
		}
	}
	// unlock the mutex with atomic set
	void unlock()
	{
		InterlockedExchange(&m_mutex.m_state, CTtasMutex::NotLocked);
	}
private:
	CTtasMutex &m_mutex;
};

Performance Under Contention

So how well does TTASLock work under contention?

Unfortunately, not very well at all. 😦

Multiple thread race to increment a shared integer value to one million. As the number of thread increases, contention increases, and performance decreases.

As the number of threads increases, there are more contentions and more cycles are wasted spinning. And since the TTASLock spins on a shared mutex, each spinning thread encounters a cache miss whenever a thread acquires the lock and writes to the mutex. Thus, the increase of contention also leads to the increase in cache misses.

And worse yet, the first test in TTAS has almost no benefit in practice because the performance is dominated by cache misses.

[Update: See ttas-spinlock-with-yield for a more scalable solution]

Thoughts and Conclusions

Although TTAS lock has performance issues in practice, I enjoy its simplicity.

The cache misses issue can be relieved if each thread observes a different counter, and ensure that the counter are far enough that they are on different cache lines.

Source

The source and the data sheet can be downloaded here.

Tools: Visual Studio 2008, Boost 1.41, Window 7

Prefer Simple Mutex Over Recursive Mutex

For those of us who work in the multi-threaded environment, the concept of mutex (“mutual exclusion”) should be no stranger. There are two common categories of mutex – simple and recursive.

Simple mutex can only be locked once, and if the same thread tries to lock the same mutex again, it will deadlock.

Recursive mutex can be locked by the same thread numerous times, and the same thread must unlock the same number of times before the critical region can be accessed by another thread.

Recursive mutex is common. In fact, mutexes are recursive by default in Windows.

With all that said, I believe recursive mutex should be avoid as much as possible.

Obvious Overhead

Since recursive mutex tracks more information than a simple mutex, it should be no surprise that it is more expensive.

Here is a sample test case with multiple threads accessing a critical region. The mutex implementation is from Boost.Thread library.

recursive_mutex_benchmark_small

Red dotted line indicates the cost of a single simple mutex. Blue line indicates the cost of multiple recursive mutexes.

A single recursive mutex is quite a bit more expensive than a simple mutex. If recursive mutex is used, it is probably going to be locked more than once by the same thread (otherwise, why would you use it?). The graph also demonstrates the cost of recursively locking the mutex again, up to five times. As you can see, it is linear, and it is definitely not free.

The term “recursive mutex” might sound fancy, but they are nothing but redundant locks. After the critical region is locked, each additional recursive mutex adds nothing but overhead.

Some might argue that this cost is negligible. This might be true depending on the application, but this is just the surface of the problem.

The Real Problem

The truth is that locks and mutexes are difficult to use correctly. Until we have STM, there is no silver bullet.

Recursive mutex makes locking easy by taking away the programmer’s responsibility of tracking mutexes. Programmers can add recursive mutex to existing functions and member functions easily without refactoring code. This would not have been possible with simple mutex. However, this advantage is merely an illusion.

Don’t believe it? Here’s the thought of David Butenhof, the person who introduced recursive mutex in POSIX thread.

Here’s some of his powerful arguments:

1. Recursive mutex allows you to lose track of your locking scheme and scope. As your call stack becomes deeper, you will have less sense on how long you’ve been locking the critical region.

2. Recursive mutex are reference counted, and releasing it does not imply that the critical region is freed. So how would you know how to unlock?

I’ve learned a lot from his post. His argument addresses the fundamental design issue in software that use recursive mutex.

To avoid using recursive mutex, we have to consider refactoring.

Refactoring

In functional programming, a typical refactoring technique to break up recursive locking is by splitting a function into two components– outer and inner. The outer function holds the lock, and invoke the inner functions to perform its task. The inner functions is an internal function that is not exposed as an API.

In Object Oriented programming, this technique can also be applied in form of Pimpl idiom or private functions.

Consider this example:

Class A has two public functions – One and Two. One calls Two. Both methods are thread-safe, they both lock the same recursive mutex, which is a private member of Class A. Below demonstrates that the functional programming refactoring technique can be applied in OO programming.

recursive_class_bad

A::One calls A::Two, they both lock on the same mutex member variable.

recursive_class_good

Refactor the implemention of A::Two into A::Two_Impl. A::One and A::Two nows calls A::Two_Impl. Recursive mutex becomes unnecessary, and can be refactored into a simple mutex.

Refactoring – A More Complicated Scenario

When more than one classes are involved, things gets more complicated. The technique above will not hold.

Here’s a more complicated (yet realistic) example:

Class A has two public functions – One and Three. Both functions are thread-safe, and lock the same recursive mutex, which is a private member variable of A. Class B has one public function – Two.

A::One calls B::Two, and B::Two calls A::Three.

recursive_class_complicated

A more complicated scenario that involves more than one class.

Now we have a mess. The possible call paths grows exponentially as the call stack gets deeper.

In the example, there are three possible paths.

1. A::One, B::Two, A::Three

2. B::Two, A::Three

3. A::Three

If we apply the previous technique, we would refactor A::Three into A::Three and A::Three_Impl. That won’t work here because class B expects A::Three_Impl to be a public interface.

There is no fix-all solution here, but here’s some suggestion.

1. Refactor A and B such that they do not circular reference each other. In other word, A can reference B, or B can reference A, but not at the same time.

2. Merge A and B in one class, or move B::Two into class A.

Use Recursive Mutex If It Is Absolutely Positively Necessary

Unfortunately, in the real world, things get even more complicated. The complicated example above only involves two classes. But what if more classes are involved, the number of call paths grows exponentially, and becomes unrealistic to refactor?

As a discipline of computer science, I would argue that for every solution with recursive mutex, there must be a solution for simple mutex.

As a software engineer who is under the gun with a deadline… I say just hack it and use recursive mutex. We are shipping tomorrow! 🙂

But as David Butenhof said, recursive mutex is a hack. Instead of relying on recursive mutex up front, we should avoid it until it is absolutely necessary.

Conclusion

Recursive mutex is dangerous because you lose sense of locking scope. It costs more than a simple mutex.

Consider to refactor the code to avoid the use of recursive mutex.

Use recursive mutex if it can’t be refactored economically.

Source

The source and the spreadsheet can be downloaded here.

Tools: Visual Studio 2008, Boost 1.39, Window XP SP3 32bit