Test-And-Test-And-Set Lock

After I learned that Compare-And-Swap (CAS) has an infinite consensus number, I felt like I can rule the world with it. Well, maybe not quite, but I felt pretty darn good.

I got the urge to utilize the power of CAS, so I spent some time looking into its usage in the real-world of mutual exclusion.

Just so happens, I bumped into a chapter in the textbook that describes a spin lock algorithm called the Test-And-Test-And-Set (TTAS) lock.

The algorithm is succinct and it is elegantly solved by CAS.

TTASLock

In a mutual exclusion protocol, only one thread can acquire the lock.

If it acquired the lock, it may access the critical region, get its job done, and get out of critical region.

If it can’t acquire the lock, there are two choices. It can repeatedly test the lock – spinning, or it yields its slice of time to another thread and come back later – blocking.

If the contention is low, and spin locks might make sense.

In the test-and-test-and-set algorithm, the mutex is first tested to see if it is free. If it appears to be free, we perform test-and-set (CAS) on mutex.

The two tests idea is similar to the double-checked locking optimization to reduce the CAS overhead.

Here is my implementation of the TTAS lock in C++. It is a scoped lock that utilizes the Win32 Interlocked Variable Access API to perform atomic synchronization.

// atomic read is just a volatile read in C
inline LONG atomicRead(volatile LONG *mem) { return *mem; }

class CTtasLock
{
public:
	// mutex class that holds the lock state
	class CTtasMutex
	{
	public:
		CTtasMutex() : m_state(0) {};
		LONG m_state;
		LONG const static Locked = 1;
		LONG const static NotLocked = 0;
	};

	// RAII constructor that acquire lock upon construction
	explicit CTtasLock(CTtasMutex&m) : m_mutex(m) { lock(); }

	// RAII destructor that performs the "scoped" lock behavior
	~CTtasLock() { unlock(); }

	// spin lock that performs the test-test-and-set algorithm
	void lock()
	{
		while(true)
		{
			// test until state is cleared
			while(atomicRead(&m_mutex.m_state) == CTtasMutex::Locked){};

			// test-and-set
			if(InterlockedCompareExchange(
				&m_mutex.m_state,
				CTtasMutex::Locked,
				CTtasMutex::NotLocked) == CTtasMutex::NotLocked)
			{
				return;
			}
		}
	}
	// unlock the mutex with atomic set
	void unlock()
	{
		InterlockedExchange(&m_mutex.m_state, CTtasMutex::NotLocked);
	}
private:
	CTtasMutex &m_mutex;
};

Performance Under Contention

So how well does TTASLock work under contention?

Unfortunately, not very well at all. 😦

Multiple thread race to increment a shared integer value to one million. As the number of thread increases, contention increases, and performance decreases.

As the number of threads increases, there are more contentions and more cycles are wasted spinning. And since the TTASLock spins on a shared mutex, each spinning thread encounters a cache miss whenever a thread acquires the lock and writes to the mutex. Thus, the increase of contention also leads to the increase in cache misses.

And worse yet, the first test in TTAS has almost no benefit in practice because the performance is dominated by cache misses.

[Update: See ttas-spinlock-with-yield for a more scalable solution]

Thoughts and Conclusions

Although TTAS lock has performance issues in practice, I enjoy its simplicity.

The cache misses issue can be relieved if each thread observes a different counter, and ensure that the counter are far enough that they are on different cache lines.

Source

The source and the data sheet can be downloaded here.

Tools: Visual Studio 2008, Boost 1.41, Window 7

Pondering on Consensus

I have been reading The Art of Multiprocessor Programming, and have been pondering on the foundation of lock-free and wait-free algorithms for the last few weeks.

The book introduces the concept of registers and consensus. The ideas are fairly abstract, but the proofs and results are very rewarding.

It gives me a good insight on how lock/wait free algorithm are constructed.

Consensus Definitions

Consensus is a distributed computing problem on how multiple thread or task can agree upon some value based on an input. To write out as an object, a consensus object has a single method called decide().

public interface Consensus<T> {
	T decide(T value);
}

The decide() method have to meet the following two conditions:

  1. consistent: all threads decide the same value
  2. valid: the common decision value is some thread’s input

A consensus protocol is a class that implements consensus in a wait-free manner.

A consensus number is the largest number of thread a class can solves the consensus problem. If a class has consensus number >= n in a n-thread system, the class is considered universal. If n is infinite, the consensus number for the class is infinite.

Atomic Registers

In hardware level, thread communicates with each other through shared memory. But in a theoretical standpoint, we can simplify this by saying that threads communicates through shared concurrent objects.

Consider an object called AtomicRegister that holds some value and has a read() and write() function. It is an abstraction from hardware primitives such as load/store instruction, and it is used to inter-thread communications.

public interface AtomicRegister<T> {
	T read();
	void write(T v);
}

AtomicRegister is wait-free and linearizable. This implies that if a method call completes before another, the earlier call must have taken effects before the later call.

Here comes two striking conclusions.

  1. AtomicRegister has a consensus number  1.
  2. It is impossible to construct a wait-free or lock-free implementation of an object with consensus number n from an object with a lower consensus number.

Therefore, in order to implement any lock-free or wait-free data structure, modern processor must provide more than just loads and stores.

The Power of Compare-And-Swap

Modern processor provides some powerful synchronization mechanism such as Compare-And-Swap. In x86, the instruction is called CMPXCHG.

The instruction takes in two arguments: an expected value and an updated value. If the current register value is equal to the expect value, then it is replaced by the updated value. Otherwise, the value is left untouched.

In Java concurrent package, this functionality is provided by the method compareAndSet() in a class called AtomicInteger.

It is not difficult to see that compareAndSet() has an infinite consensus number.

With compareAndSet(), we can implement a consensus protocol with an infinite consensus number.

class CASConsensus<T> {
	private final int FIRST = -1;

	private AtomicInteger r = new AtomicInteger(FIRST);

	// N-thread system where each have its own propose value
	// The array index correspond to its thread id.
	protected T[] proposed = (T[]) new Object[N];

	// Propose a value for this thread. However, it might not be chosen.
	void propose(T value) {
		proposed[ThreadId.get()].value;
	}

	// Take in a value, and obtain consensus among all thread
	// to see which value is decided.
	public Object decide(Object value) {
		propose(value);
		int i = ThreadID.get();
		if(r.compareAndSet(FIRST,i))
			return proposed[i];
		else
			return proposed[r.get()]
		}
	}
}

In the CAS consensus protocol, compareAndSet() ensures that the first thread’s proposed value will be accepted, and all other threads will yield. This is simple, yet very powerful.

Conclusion

Without modern hardware support, it is impossible to implement any lock-free or wait-free object.

Compare-And-Swap has an infinite consensus number, and it is a key operation for a wait-free consensus protocol.

Side Note

Apparently, with a wait-free consensus protocol, there exists algorithms to convert any object into a lock-free or wait-free concurrent object. We will save that for next time. 🙂

Pitfalls of Observer Pattern

Observer pattern is a powerful software design pattern that allows observers to subscribe for event updates from a subject.

Following the Hollywood Principle, it offer for low-coupling and high cohesion between the subject and its observers.

Everyone talks about the good thing, but what about its pitfalls?

Observer Equality

In observer pattern, it is typical to have multiple observers observing a subject.

Here is a simple observer pattern setup.

Observer pattern where a subject with three observers listening for notifications.

Among the observers for a subject, the observers should be treated equally. This is very important and here is why.

The moment a subject provides preferential treatment for a particular observer, it becomes highly-coupled with the observer. And this defeats the purpose of the observer pattern.

By preferential treatment, it could be several things.

  1. Order of notification (e.g. Observer A must be notified before Observer B and C)
  2. Conditional notification (e.g. Observer A only likes event X and Observer B only likes event Y).
  3. Specialized notification (e.g. Create a event specialized just for Observer A)

In my experience, observer pattern usually starts out with a good implementation. But as events and interactions among objects become more complex, the assumption that observer equality becomes increasingly difficult to uphold. It is very tempting to just “hack” a few lines to get something work. Unfortunately, complexity will keep increasing, and a few lines of hack morph into nasty spaghetti code.

Cycle In Event Flow

Within observer pattern, another pitfall to watch out is the cycles in the event flow.

Observer Pattern with event flow cycle

With a cycle in the event flow, it will likely cause undesired recursion behavior. In worst case, the cycle involves the subject, and may cause an infinite recursion.

This may seem obvious at a glance. After all, who would purposely design a cycle within an observer pattern. But remember, software has no shape or form. It is easy to introduce unintentional cycles in complex architecture that you are unfamiliar with.

Other Thoughts

Observer pattern has been somewhat standardized into libraries in modern programming languages.

Java Beans has a powerful utility class called PropertyChangeSupport that standardize event notification and listener interfaces. Ruby has observer.rb that is similar to Java, and it is simple and powerful. I have good experience with both of them.

And as usual, C++ has an overengineered solution called Boost Signal and Slots. The difference here is that it is tailored toward functional programming. Notification callbacks are done in forms of boost function pointers (object wrapper on actual function pointers). I dislike it because  the observer pattern is hidden from the class interface, and it creates call stack that makes your eyes bleed. The original design of Signal and Slots is so bad that they have to make a Signal and Slots 2 library. 😆

C++ for_each Hurts My Head

I admit that this is my personal pet peeve. I hate C++ for_each.

The foreach statement is a syntactic sugar that traverse a collection of items.

It offers better clarity than C style for loops because it does not require an index counter, or an iterator.

foreach item in collection
	do stuff to item

This features is available in most modern programming languages, and they are very well-done.

C++ doesn’t support foreach natively. Instead, it is done in a template function, and it hurts my brain.

foreach – the C++ Way

Let’s take a look at the VC90 implementation of for_each in STL.

// TEMPLATE FUNCTION for_each
template<class _InIt,
	class _Fn1> inline
	_Fn1 for_each(_InIt _First, _InIt _Last, _Fn1 _Func)
	{	// perform function for each element
	_DEBUG_RANGE(_First, _Last);
	_DEBUG_POINTER(_Func);
	_CHECKED_BASE_TYPE(_InIt) _ChkFirst(_CHECKED_BASE(_First));
	_CHECKED_BASE_TYPE(_InIt) _ChkLast(_CHECKED_BASE(_Last));
	for (; _ChkFirst != _ChkLast; ++_ChkFirst)
		_Func(*_ChkFirst);
	return (_Func);
	}

[Update: The for_each code is modified to match completely with VC90. I didn’t do a good job before, and it was pointed out by my colleague]

C++ for_each is a template function that takes in two iterators that defines the first and last of a sequence, and a function pointer or an operator() that performs on an item of the sequence.

In Scott Meyer’s Effective STL, he has argued that for_each makes code more direct and straightforward.

So here is an example from directly from the book.

// function object that performs action on an integer
// adaptable as a function object
class DoSomething : public unary_function<int,void>
{
	void operator()(int x) {...}
	...
};
int main()
{
	deque<int> di; // a deque of integer
	...
	DoSomething d; // function object

	// oh god, what's this
	for_each<DequeIntIter, DoSomething&>(
		di.begin(),
		di.end(),
		d);
}

No matter how many times I look at this, it is not intuitive to me. The template syntax is obscure, and I don’t get a sense that there is a loop in the code.

I tried to get used to it, I really did. But after all these years, I have finally given up.

Strangely, some of my colleagues feel the same way.

Maybe it is not just me …

Going back to one of my CS courses in my M.S, I remember reading a study on how programmers review code.

The idea is to analyze a programmer’s eye movement during code review. This eye movement is called the scan pattern.

The results were very interesting. In general, programmers tend to spend a third of the time scanning most of the function. And then, they will spend the remaining time going back and forth within a loop.

In hindsight, this is not surprising. Loops often holds the most complex logic in a function. To understand the logic, programmers need to spend time “unrolling” the loop mentally to see the side effects of each iteration.

If the study is accurate, then it certainly explains why C++ for_each is difficult to read.

Conclusion

I will avoid C++ for_each at all cost.