TTAS Spinlock with Yield

Last week, I implemented the Test-And-Test-And-Set (TTAS) spinlock in C++. The algorithm is simple and elegant, but it wasn’t very efficient under high contention.

Ideally, we would like a spinlock algorithm to have zero overhead when an additional thread is added. To do this, we need a throttling mechanism that decrease the amount of contention.

Yield Under Contention

Recall that CTtaslock performs two steps.

1. It repeatedly tries to test the state of the mutex to see if it is unlocked.

2. If the mutex is unlocked, perform a compare-and-swap (CAS) to commit to the variable. If the commit fails, it goes back to step 1.

It is not difficult to see that more threads only leads to more contention with this algorithm.

Now what if we implement a simple backoff algorithm to allow a thread to yield under high contention. The amount to yield would be based on the number of times a thread failed to acquire a lock.

// This algorithm is based on the boost smart_ptr's yield_k
// algorithm.
void yield( unsigned int k )
{
	if( k < 2)
	{
	}
	else if( k < 16)
	{
		Sleep( 0 );
	}
	else
	{
		Sleep( 1 );
	}
}

This algorithm is used in boost smart_ptr yield_k. The number here are chosen arbitrarily based on my test, but the idea is clear. If the thread fails to acquire the lock more than twice, we perform Sleep(0). If we still fail, then we perform Sleep(1) for more aggressive yielding in case we are running into priority induced starvation.

TTAS with Yield Implementation

Here’s the implementation of CTTASLock with yield.

// atomic read is just a volatile read in C
inline LONG atomicRead(volatile LONG *mem) { return *mem; }

class CTtasLock
{
public:
	// mutex class that holds the lock state
	class CTtasMutex
	{
	public:
		CTtasMutex() : m_state(0) {};
		LONG m_state;
		LONG const static Locked = 1;
		LONG const static NotLocked = 0;
	};

	// RAII constructor that acquire lock upon construction
	explicit CTtasLock(CTtasMutex&m) : m_mutex(m) { lock(); }

	// RAII destructor that performs the "scoped" lock behavior
	~CTtasLock() { unlock(); }

	// spin lock that performs the test-test-and-set algorithm
	void lock()
	{
		while(true)
		{
			// test until state is cleared
			unsigned int k = 0;
			while(atomicRead(&m_mutex.m_state) == CTtasMutex::Locked)
			{
				++k;
				yield(k); // yield algorithm to avoid high contention
			};

			// test-and-set
			if(InterlockedCompareExchange(
				&m_mutex.m_state,
				CTtasMutex::Locked,
				CTtasMutex::NotLocked) == CTtasMutex::NotLocked)
			{
				return;
			}
		}
	}
	// unlock the mutex with atomic set
	void unlock()
	{
		InterlockedExchange(&m_mutex.m_state, CTtasMutex::NotLocked);
	}

private:
	void yield( unsigned int k )
	{
		if( k < 2 ) { }
		else if( k < 8 ) { Sleep( 0 ); }
		else { Sleep( 1 ); }
	}
private:
	CTtasMutex &m_mutex;
};

Performance Under Contention

So how does this implementation perform under contention?

Fortunately, much better. 🙂

When yield is introduced, the TTAS lock scales much better.

Final Thoughts

When threads yield under high contention, the TTAS lock scales well.

The only drawback is that TTAS lock has no sense of fairness. A thread could be starved because of luck. But then again, fairness could be overrated.

Source

The source and the data sheet can be downloaded here.

Tools: Visual Studio 2008, Boost 1.41, Window 7

Test-And-Test-And-Set Lock

After I learned that Compare-And-Swap (CAS) has an infinite consensus number, I felt like I can rule the world with it. Well, maybe not quite, but I felt pretty darn good.

I got the urge to utilize the power of CAS, so I spent some time looking into its usage in the real-world of mutual exclusion.

Just so happens, I bumped into a chapter in the textbook that describes a spin lock algorithm called the Test-And-Test-And-Set (TTAS) lock.

The algorithm is succinct and it is elegantly solved by CAS.

TTASLock

In a mutual exclusion protocol, only one thread can acquire the lock.

If it acquired the lock, it may access the critical region, get its job done, and get out of critical region.

If it can’t acquire the lock, there are two choices. It can repeatedly test the lock – spinning, or it yields its slice of time to another thread and come back later – blocking.

If the contention is low, and spin locks might make sense.

In the test-and-test-and-set algorithm, the mutex is first tested to see if it is free. If it appears to be free, we perform test-and-set (CAS) on mutex.

The two tests idea is similar to the double-checked locking optimization to reduce the CAS overhead.

Here is my implementation of the TTAS lock in C++. It is a scoped lock that utilizes the Win32 Interlocked Variable Access API to perform atomic synchronization.

// atomic read is just a volatile read in C
inline LONG atomicRead(volatile LONG *mem) { return *mem; }

class CTtasLock
{
public:
	// mutex class that holds the lock state
	class CTtasMutex
	{
	public:
		CTtasMutex() : m_state(0) {};
		LONG m_state;
		LONG const static Locked = 1;
		LONG const static NotLocked = 0;
	};

	// RAII constructor that acquire lock upon construction
	explicit CTtasLock(CTtasMutex&m) : m_mutex(m) { lock(); }

	// RAII destructor that performs the "scoped" lock behavior
	~CTtasLock() { unlock(); }

	// spin lock that performs the test-test-and-set algorithm
	void lock()
	{
		while(true)
		{
			// test until state is cleared
			while(atomicRead(&m_mutex.m_state) == CTtasMutex::Locked){};

			// test-and-set
			if(InterlockedCompareExchange(
				&m_mutex.m_state,
				CTtasMutex::Locked,
				CTtasMutex::NotLocked) == CTtasMutex::NotLocked)
			{
				return;
			}
		}
	}
	// unlock the mutex with atomic set
	void unlock()
	{
		InterlockedExchange(&m_mutex.m_state, CTtasMutex::NotLocked);
	}
private:
	CTtasMutex &m_mutex;
};

Performance Under Contention

So how well does TTASLock work under contention?

Unfortunately, not very well at all. 😦

Multiple thread race to increment a shared integer value to one million. As the number of thread increases, contention increases, and performance decreases.

As the number of threads increases, there are more contentions and more cycles are wasted spinning. And since the TTASLock spins on a shared mutex, each spinning thread encounters a cache miss whenever a thread acquires the lock and writes to the mutex. Thus, the increase of contention also leads to the increase in cache misses.

And worse yet, the first test in TTAS has almost no benefit in practice because the performance is dominated by cache misses.

[Update: See ttas-spinlock-with-yield for a more scalable solution]

Thoughts and Conclusions

Although TTAS lock has performance issues in practice, I enjoy its simplicity.

The cache misses issue can be relieved if each thread observes a different counter, and ensure that the counter are far enough that they are on different cache lines.

Source

The source and the data sheet can be downloaded here.

Tools: Visual Studio 2008, Boost 1.41, Window 7