The Lost-Wakeup Problem

In multi-threaded programming, the lost-wakeup problem is a subtle problem that causes a thread to miss a wake-up due to race condition.

In the obvious case, the thread that misses a wake-up would sleep forever and no work is performed.

But often, a thread that missed a wake-up could be recovered by a subsequent wake-up. To the outside world, the post-condition looks identical, and the lost-wakeup problem stays under the radar.

This is scary because the worst kind of software bug you can have is the kind that exists, and you know nothing about it, including its existence.

A Simple Locked Queue

Consider a simple thread-safe queue that supports push() and pop().

The function pop() guarantees to return a item. If the queue is empty, it would block and wait on a “not empty” condition until an item exists.

The function push() guarantees that the item is inserted in the front of the queue. It notifies a thread the “not empty” condition when the queue goes from “empty” to “not empty”.

Guess what? This queue suffers from the lost-wakeup problem.


// generic thread-safe queue that supports push() and pop()
template <typename T>
class locked_queue : noncopyable
{
public:
	// push an item into the end of the queue
	void push(T const &item)
	{
		mutex::scoped_lock lock(m_mutex);
		m_queue.push(item);
		// queue went from empty to one, notify a thread
		if(m_queue.size() == 1)
		{
			m_notEmptyCondition.notify_one();
		}
	}
	// pop the oldest item out of the queue
	// blocks until there are item to be popped
	T pop()
	{
		mutex::scoped_lock lock(m_mutex);
		while(m_queue.empty() == true)
			m_notEmptyCondition.wait(lock);

		T item = m_queue.front();
		m_queue.pop();
		return item;
	}
private:
	queue<T> m_queue;
	mutex m_mutex;
	condition_variable m_notEmptyCondition;
};

The Test Case

To prove that locked_queue suffers the lost-wakeup problem, we can write a small test case to simulate a race condition.

The test case involves a producer thread, and two consumer threads. Two consumers are up and waiting for items. The producer thread then pushes two items in the queue. If the queue works properly, since two items are pushed into the queue, both consumer thread should wakeup and completes its function.

// producer that pushes two items to a queue
void producer(locked_queue_t *q)
{
	q->push(1);
	q->push(2);
}
// consumer that pops an item from the queue
void consumer(locked_queue_t *q)
{
	int i = q->pop();
	std::cout << i << std::endl;
}
int main()
{
	locked_queue_t lq;
	// set up two consumers to wait for items
	thread c1(bind(&consumer, &lq));
	thread c2(bind(&consumer, &lq));
	// sleep for a second to ensure c1 and c2 are ready
	// not the best code, but good enough for a test case
	Sleep(1000);
	// create the producer to push in two items
	thread p(bind(&producer, &lq));
	c1.join();
	c2.join();
	p.join();

	return 0;
}

Here’s the output of the test case on my machine.

1
//wait forever and does not exit

Test Case Step-By-Step Replay

Here’s the step-by-step replay on exactly what happened in our test case.

1. First, we create two consumer threads C1 and C2. Since the queue is initially empty, C1 and C2 must block and wait on the “not empty” condition.

Step 1

2. Then producer thread P starts. P immediately pushes an item into the queue, and then notifies a thread.

Step 2

3. This step is the key to the lost-wakeup. After P notifies a thread with the “not empty” condition, it beats the consumer thread and immediately add another item to the queue. Notice that the queue size is now 2, and because the queue was not empty, the “not empty” condition was not fired. At this point, two threads are waiting for data, and only one would be woken up.

Step 3

4. At last, C1 was woken up. It acquires the lock, pops the queue, and completes its operation. However, C2 will stay asleep forever.

Step 4

The Solution

The root of the problem is from a premature optimization where the “not empty” condition is only notified when the queue size goes from 0 to 1. We can simply change push() to always call notify_all(), and this problem goes away.

	void push(T const &item)
	{
		mutex::scoped_lock lock(m_mutex);
		m_queue.push(item);
		// always notify regardless of the queue size
		m_notEmptyCondition.notify_all();
	}

Final Thoughts

The original locked_queue works perfectly if there is only one producer and one consumer. If there is more than one producer or consumer, lost-wakeup problem exists.

What I’ve demonstrated here is only one variant of the lost-wakeup problem. In practice, there can be far more complex and subtle.

To avoid lost-wakeup, here’s two best practices to follow.

  • Its best to always signal all threads that are waiting for a condition. It might appear to be wasteful, but it is a lot better than losing a wake-up.
  • A timed wait is an alternative way to solve this problem. This would force a thread to wait up periodically to check for the condition, regardless whether or not it has been signaled.

Source

The source and the data sheet can be downloaded here.

Tools: Visual Studio 2008, Boost 1.41, Window 7, Intel I5-750 (quad core)

Volatile, Another Abused C++ Keyword

Volatile is a confusing keyword. In the C++ standard, volatile is described as a type qualifier, and it is often paired up with const.

But unlike const, volatile actually has side effects. I was told that this keyword is used to disable compiler optimizations, and to replace locks in multi-threading programming.

It sounds like a lot of voodoo involved, but I am rarely given a satisfying explanation. I don’t believe in magic, so I spent some time researching it. In the end, I reached a frightening conclusion.

If you use volatile for multi-threaded programming, you are doing it wrong!

Standard’s Definition

In C++ standard, it says this.

In general, the semantics of volatile are intended to be the same in C++ as they are in C.

That certainly wasn’t informative. So I dug up the C99 standard.

This is the C99 definition of volatile – section 6.7.3 Item 6.

An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine… Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine…

And for the footnote, it adds this.

A volatile declaration may be used to describe an object corresponding to a memory-mapped input/output port or an object accessed by an asynchronously interrupting function. Actions on objects so declared shall not be “optimized out” by an implementation or reordered except as permitted by the rules for evaluating expressions.

Ideally, to program in a multi-threaded environment,  we need a concurrency memory model that addresses atomicity, visibility and memory ordering. In C99 standard, volatile is required to address the issue of visibility, but does nothing to address atomicity and memory ordering.

Instead, volatile is intended for memory-mapped I/O ports.

It’s Not News

It’s funny that once you know what to look for, answers will magically appear out of nowhere.

There have been numerous articles on why volatile is not intended for multi-threaded programming. I just happened to discover it 10 years too late.

More Confusion

Since VS8.0, Microsoft decided to go beyond the standard and gave volatile acquire and release semantics. This way, it is actually works like a memory barrier and is useful for multi-threaded programming. This behavior is not portable.

Final Thoughts

Until C++0x defines a concurrency memory model, use a memory barrier instead of volatile for multi-threaded programming.

C++ pitfalls never cease to amaze me. I think C++FQA needs a new section on volatile.

TTAS Spinlock with Yield

Last week, I implemented the Test-And-Test-And-Set (TTAS) spinlock in C++. The algorithm is simple and elegant, but it wasn’t very efficient under high contention.

Ideally, we would like a spinlock algorithm to have zero overhead when an additional thread is added. To do this, we need a throttling mechanism that decrease the amount of contention.

Yield Under Contention

Recall that CTtaslock performs two steps.

1. It repeatedly tries to test the state of the mutex to see if it is unlocked.

2. If the mutex is unlocked, perform a compare-and-swap (CAS) to commit to the variable. If the commit fails, it goes back to step 1.

It is not difficult to see that more threads only leads to more contention with this algorithm.

Now what if we implement a simple backoff algorithm to allow a thread to yield under high contention. The amount to yield would be based on the number of times a thread failed to acquire a lock.

// This algorithm is based on the boost smart_ptr's yield_k
// algorithm.
void yield( unsigned int k )
{
	if( k < 2)
	{
	}
	else if( k < 16)
	{
		Sleep( 0 );
	}
	else
	{
		Sleep( 1 );
	}
}

This algorithm is used in boost smart_ptr yield_k. The number here are chosen arbitrarily based on my test, but the idea is clear. If the thread fails to acquire the lock more than twice, we perform Sleep(0). If we still fail, then we perform Sleep(1) for more aggressive yielding in case we are running into priority induced starvation.

TTAS with Yield Implementation

Here’s the implementation of CTTASLock with yield.

// atomic read is just a volatile read in C
inline LONG atomicRead(volatile LONG *mem) { return *mem; }

class CTtasLock
{
public:
	// mutex class that holds the lock state
	class CTtasMutex
	{
	public:
		CTtasMutex() : m_state(0) {};
		LONG m_state;
		LONG const static Locked = 1;
		LONG const static NotLocked = 0;
	};

	// RAII constructor that acquire lock upon construction
	explicit CTtasLock(CTtasMutex&m) : m_mutex(m) { lock(); }

	// RAII destructor that performs the "scoped" lock behavior
	~CTtasLock() { unlock(); }

	// spin lock that performs the test-test-and-set algorithm
	void lock()
	{
		while(true)
		{
			// test until state is cleared
			unsigned int k = 0;
			while(atomicRead(&m_mutex.m_state) == CTtasMutex::Locked)
			{
				++k;
				yield(k); // yield algorithm to avoid high contention
			};

			// test-and-set
			if(InterlockedCompareExchange(
				&m_mutex.m_state,
				CTtasMutex::Locked,
				CTtasMutex::NotLocked) == CTtasMutex::NotLocked)
			{
				return;
			}
		}
	}
	// unlock the mutex with atomic set
	void unlock()
	{
		InterlockedExchange(&m_mutex.m_state, CTtasMutex::NotLocked);
	}

private:
	void yield( unsigned int k )
	{
		if( k < 2 ) { }
		else if( k < 8 ) { Sleep( 0 ); }
		else { Sleep( 1 ); }
	}
private:
	CTtasMutex &m_mutex;
};

Performance Under Contention

So how does this implementation perform under contention?

Fortunately, much better. 🙂

When yield is introduced, the TTAS lock scales much better.

Final Thoughts

When threads yield under high contention, the TTAS lock scales well.

The only drawback is that TTAS lock has no sense of fairness. A thread could be starved because of luck. But then again, fairness could be overrated.

Source

The source and the data sheet can be downloaded here.

Tools: Visual Studio 2008, Boost 1.41, Window 7