Troublesome Locks

At my company, we develop a large software with fairly high concurrency. In general, asynchronous message passing is the choice for thread communication for its simplicity. When performance matters, we use locks and mutexes to share data among threads.

As the software grows, subtle bugs from dead-locks and race conditions are appearing around code that was designed with locks and mutexes.

I can say confidently that almost all of these bugs are caused by poor interface design or poor locking granularity. But I also feel that a well-designed thread-safe interface is extremely difficult to achieve.

Always Chasing The Perfect Granularity

To have a thread-safe interface, it requires far more than just putting a lock on every member function. Consider a supposedly thread-safe queue below, is this class really useful in a multi-threaded environment?

// thread-safe queue that has a scoped_lock guarding every member function
template <typename T>
class ts_queue()
{
public:
	T& front();
	void push_front(T const &);
	void pop_front();
	bool empty() const;
};

Unfortunately no. Here’s a simple code below where it falls apart. Say f() is called by multiple threads, if the the empty() check returns false, there is no guarantee that front() and pop_front() will succeed.

Even if front() succeeds, there is no guarantee that pop_front() will succeed. The ts_queue class interface is subject to race condition, and there is nothing you can do about it without an interface change.

//f may be called by multiple threads
void f(ts_queue<int> const &tsq)
{
	if(tsq.empty() == false)
	{
		int i = tsq.front(); // this might fail even if empty() was false
		tsq.pop_front(); // this might fail even if front() succeed
	}
}

The problem is that mutexes are not composable. You can’t just call a thread-safe function one after another to achieve thread-safety. In the ts_queue example, the interface granularity is too small to perform the task required by f().

On the other hand, it would be easy to use some global mutex to lock down the entire ts_queue class. But when the granularity is too large, you lose performance by excessive locking. In that case, the multi-core processors will never be fully utilized.

Some programmers try to solve it by exposing the internal mutex to the interface, and pun the granularity problem to their users. But if the designers can’t solve the problem, is it fair to expect their users to be able to solve it? Also, do you really want to expose concurrency?

So in the process of designing a thread-safe class, I feel that I am spending majority of the time chasing the perfect granularity.

Full of Guidelines

In addition to the problem of granularity, you need to follow many guidelines to maintain a well designed thread-safe interface.

  • You can’t pass reference/pointers of member variables out to external interface, whether it is through return value, or through an out parameter. Anytime a member variable is passed to an external interface, it could be cached and therefore lose the mutex protection.
  • You can’t pass reference/pointers member variables into functions that you don’t control. The reason is same as before, except this is far more tricky to achieve.
  • You can’t call any unknown code while holding onto a lock. That includes library functions, plug-ins, callbacks, virtual functions, and more.
  • You must always remember to lock mutexes in a fixed order to avoid the ABBA deadlocks.
  • There are more…

And if you are programming in C++, you need to watch out for all of the above while wrestling with the most engineered programming language in the world. 🙂

Final Thoughts

Locks and mutexes are just too complex for mortals to master.

And until C++0x defines a memory model for C++, lock-free programming is not even worth trying.

I will stop complaining now.

Spin-Wait vs. Condition Variable

Recently at work, I ran into an interesting threading problem.

I have two threads – A and B, running concurrently. Thread A needs a piece of data from Thread B. Normally, Thread B would just pass a message to Thread A. But in this scenario, every millisecond counts. Thread A needs the data A.S.A.P!

In theory, Thread B should have the result within couple milliseconds. Since the wait is so short, the synchronous approach is feasible. I would block Thread A until Thread B gets the data. But should I use a condition varialbe, or should I just spin-wait (assuming that I have a multi-core processor)?

This is very much a real life threading problem that even expert programmer encounters. (And I am no expert. :???:)

Speed Comparison

I expect spin-wait to have lower latency than condition variable. But by how much?

It has been awhile since I ran any performance test. So I cooked up some test code to get a sense of the latency.

Here’s the spin-wait code.

long global_spin_var = 0;
void spin_thread()
{
	while(1)
	{
		// memory barrier for visibility semantics
		_ReadWriteBarrier();

		// spin wait
		if(global_spin_var == 1){ break; }
	}
}
void time_spin_thread()
{
	// ... set up timing code here

	// update the global variable with cmpxchg
	InterlockedExchange(&global_spin_var,0);

	thread t1(bind(&spin_thread));
	// wait for a second for thread start, sloppy but good enough
	Sleep(1000);
	t1.join();
}

Here’s the condition variable code.

long global_condition_wait_var = 0;
condition_variable global_condition;
mutex global_mutex;

void condition_wait_thread()
{
	mutex::scoped_lock lock(m_mutex);
	while(global_condition_wait_var == 0)
		global_condition.wait(lock);
}

void time_condition_wait_thread()
{
	// ... set up timing code here

	// update the global variable with cmpxchg
	InterlockedExchange(&global_condition_wait_var,0);

	thread t2(bind(&condition_wait_thread));

	// wait for a second for thread start, sloppy but good enough
	Sleep(1000);

	global_condition.notify_all();
	t2.join();
}

It turns out that latency of condition variable is higher, but not absurdly high.

Spin-wait’s latency is about half of condition variable. This is a reasonable trade-off for spending a bunch of CPU cycles spinning.

Latency comparison between spin-wait and condition variable

Final Thoughts

So which approach did I end up using? Neither.

Apparently Thread A doesn’t really need the “right” answer from Thread B. It just needs “some” answer to make things look good.

Another lesson from the real world – it doesn’t matter if it works, as long as it looks good. 🙄

Source

The source and the data sheet can be downloaded here.

Tools: Visual Studio 2008, Boost 1.41, Window 7, Intel I5-750 (quad core)

Const Reference To Temporary Is Useless

C++ is always full of surprises. This week, I learned that C++ has (yet another) special feature that allows the lifetime of a temporary (rvalue) to be lengthen on the stack if it is bound to a reference-to-const.

It means that the following example is legal C++. In this example, the temporary i’s lifetime is extended across its original scope.

int f()
{
	int i = 0;
	return i;
}
void main()
{
	int const &cr = f(); // lifespan extended
	int &cr = f(); // illegal (crash) because f() does not return lvalue
}

This is supposed to minimize the number of copies performs.

More Ammo To Blow Your Leg Away

You would think that this feature prevents the temporary from being copied. Apparently not so.

If you run the following program on VC9.0 and VC10.0, obj destructor is called twice, implies that it was copied. [sidenote: gcc 4.3 did not perform the copy]

#include <iostream>
class obj
{
public:
	~obj() { std::cout <<"destroyed" << std::endl; }
};
void main()
{
	obj const &rco = obj();
}

This is because the standard allows the temporary to be fully copied, and renders the reference-to-const declaration useless.

Here’s the C++ standard’s guideline in section 8.5.3.

If the initializer expression is an rvalue, with T2 a class type, and “cv1 T1” is reference-compatible with “cv2 T2,” the reference is bound in one of the following ways (the choice is implementation defined):

The reference is bound to the object represented by the rvalue or to a subobject with that object

A temporary of type “cv1 T2” is created, and a constructor is called to copy the entire rvalue object into the temporary. This reference is bound to the temporary or to a subobject with the temporary.

No Better Than RVO

Dave Abraham’s copy ellision example demonstrates how smart modern compilers have become. Return Value Optimization works well in both named and unnamed temporaries, and compose nicely.

So why bother adding reference-to-const if it works just as well without it?

Final Thoughts

C++0x (C++0a?) is supposed to address this problem by enforcing the reference-to-const to always bind to the rvalue, instead of its copy. But if a copy can be elided, I expect RVO would have done it as well.

The Lost-Wakeup Problem

In multi-threaded programming, the lost-wakeup problem is a subtle problem that causes a thread to miss a wake-up due to race condition.

In the obvious case, the thread that misses a wake-up would sleep forever and no work is performed.

But often, a thread that missed a wake-up could be recovered by a subsequent wake-up. To the outside world, the post-condition looks identical, and the lost-wakeup problem stays under the radar.

This is scary because the worst kind of software bug you can have is the kind that exists, and you know nothing about it, including its existence.

A Simple Locked Queue

Consider a simple thread-safe queue that supports push() and pop().

The function pop() guarantees to return a item. If the queue is empty, it would block and wait on a “not empty” condition until an item exists.

The function push() guarantees that the item is inserted in the front of the queue. It notifies a thread the “not empty” condition when the queue goes from “empty” to “not empty”.

Guess what? This queue suffers from the lost-wakeup problem.


// generic thread-safe queue that supports push() and pop()
template <typename T>
class locked_queue : noncopyable
{
public:
	// push an item into the end of the queue
	void push(T const &item)
	{
		mutex::scoped_lock lock(m_mutex);
		m_queue.push(item);
		// queue went from empty to one, notify a thread
		if(m_queue.size() == 1)
		{
			m_notEmptyCondition.notify_one();
		}
	}
	// pop the oldest item out of the queue
	// blocks until there are item to be popped
	T pop()
	{
		mutex::scoped_lock lock(m_mutex);
		while(m_queue.empty() == true)
			m_notEmptyCondition.wait(lock);

		T item = m_queue.front();
		m_queue.pop();
		return item;
	}
private:
	queue<T> m_queue;
	mutex m_mutex;
	condition_variable m_notEmptyCondition;
};

The Test Case

To prove that locked_queue suffers the lost-wakeup problem, we can write a small test case to simulate a race condition.

The test case involves a producer thread, and two consumer threads. Two consumers are up and waiting for items. The producer thread then pushes two items in the queue. If the queue works properly, since two items are pushed into the queue, both consumer thread should wakeup and completes its function.

// producer that pushes two items to a queue
void producer(locked_queue_t *q)
{
	q->push(1);
	q->push(2);
}
// consumer that pops an item from the queue
void consumer(locked_queue_t *q)
{
	int i = q->pop();
	std::cout << i << std::endl;
}
int main()
{
	locked_queue_t lq;
	// set up two consumers to wait for items
	thread c1(bind(&consumer, &lq));
	thread c2(bind(&consumer, &lq));
	// sleep for a second to ensure c1 and c2 are ready
	// not the best code, but good enough for a test case
	Sleep(1000);
	// create the producer to push in two items
	thread p(bind(&producer, &lq));
	c1.join();
	c2.join();
	p.join();

	return 0;
}

Here’s the output of the test case on my machine.

1
//wait forever and does not exit

Test Case Step-By-Step Replay

Here’s the step-by-step replay on exactly what happened in our test case.

1. First, we create two consumer threads C1 and C2. Since the queue is initially empty, C1 and C2 must block and wait on the “not empty” condition.

Step 1

2. Then producer thread P starts. P immediately pushes an item into the queue, and then notifies a thread.

Step 2

3. This step is the key to the lost-wakeup. After P notifies a thread with the “not empty” condition, it beats the consumer thread and immediately add another item to the queue. Notice that the queue size is now 2, and because the queue was not empty, the “not empty” condition was not fired. At this point, two threads are waiting for data, and only one would be woken up.

Step 3

4. At last, C1 was woken up. It acquires the lock, pops the queue, and completes its operation. However, C2 will stay asleep forever.

Step 4

The Solution

The root of the problem is from a premature optimization where the “not empty” condition is only notified when the queue size goes from 0 to 1. We can simply change push() to always call notify_all(), and this problem goes away.

	void push(T const &item)
	{
		mutex::scoped_lock lock(m_mutex);
		m_queue.push(item);
		// always notify regardless of the queue size
		m_notEmptyCondition.notify_all();
	}

Final Thoughts

The original locked_queue works perfectly if there is only one producer and one consumer. If there is more than one producer or consumer, lost-wakeup problem exists.

What I’ve demonstrated here is only one variant of the lost-wakeup problem. In practice, there can be far more complex and subtle.

To avoid lost-wakeup, here’s two best practices to follow.

  • Its best to always signal all threads that are waiting for a condition. It might appear to be wasteful, but it is a lot better than losing a wake-up.
  • A timed wait is an alternative way to solve this problem. This would force a thread to wait up periodically to check for the condition, regardless whether or not it has been signaled.

Source

The source and the data sheet can be downloaded here.

Tools: Visual Studio 2008, Boost 1.41, Window 7, Intel I5-750 (quad core)

Volatile, Another Abused C++ Keyword

Volatile is a confusing keyword. In the C++ standard, volatile is described as a type qualifier, and it is often paired up with const.

But unlike const, volatile actually has side effects. I was told that this keyword is used to disable compiler optimizations, and to replace locks in multi-threading programming.

It sounds like a lot of voodoo involved, but I am rarely given a satisfying explanation. I don’t believe in magic, so I spent some time researching it. In the end, I reached a frightening conclusion.

If you use volatile for multi-threaded programming, you are doing it wrong!

Standard’s Definition

In C++ standard, it says this.

In general, the semantics of volatile are intended to be the same in C++ as they are in C.

That certainly wasn’t informative. So I dug up the C99 standard.

This is the C99 definition of volatile – section 6.7.3 Item 6.

An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine… Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine…

And for the footnote, it adds this.

A volatile declaration may be used to describe an object corresponding to a memory-mapped input/output port or an object accessed by an asynchronously interrupting function. Actions on objects so declared shall not be “optimized out” by an implementation or reordered except as permitted by the rules for evaluating expressions.

Ideally, to program in a multi-threaded environment,  we need a concurrency memory model that addresses atomicity, visibility and memory ordering. In C99 standard, volatile is required to address the issue of visibility, but does nothing to address atomicity and memory ordering.

Instead, volatile is intended for memory-mapped I/O ports.

It’s Not News

It’s funny that once you know what to look for, answers will magically appear out of nowhere.

There have been numerous articles on why volatile is not intended for multi-threaded programming. I just happened to discover it 10 years too late.

More Confusion

Since VS8.0, Microsoft decided to go beyond the standard and gave volatile acquire and release semantics. This way, it is actually works like a memory barrier and is useful for multi-threaded programming. This behavior is not portable.

Final Thoughts

Until C++0x defines a concurrency memory model, use a memory barrier instead of volatile for multi-threaded programming.

C++ pitfalls never cease to amaze me. I think C++FQA needs a new section on volatile.