STL Performance Comparison (round 2): VC9, VC10, STLPort


Last year, I did a performance comparison for VC7, VC9 (with and without SCL), and STLPort. Now that VC10 is out, I wonder if it is worth the upgrade.

So I dusted off the benchmark code from last year and upgraded the solution to VC10. This time, I would like to see how VC9, VC10, and STLPort 5.2.1.

VC8 and VC9’s Secure SCL “feature” was disastrous to many C++ programmers who cares about performance. So this test is done with Secure SCL disabled.

With all the C++0x language upgrades and performance claims in VC10, I expect improvements.

The Results

Recall: The stress test I wrote last year benchmark against 1. performance under growing container sizes, and 2. running a large number of operation while keeping container size a constant.

Recall:  The test for vector involves three operations – insertion, iterator traversal, and copy.

VC10 actually got a bit slower compare to VC9. Oops..

Performance of vector in STLPort is still leading by a mile.

Recall: The test for string involves three operations – string copy, substring search, and concatenation.

VC10 is performing as well as STLPort in large strings.

VC10 small strings are now better optimized than STLPort. Very impressive!

Recall: The test for map involves insertion, search, and deletion.

It appears that the performance of map in VC9 and VC10 are identical.

Same as above, nothing has changed here.

Recall: The test for Deque comes with a twist. The deque is implemented is as a priority queue through make_heap(), push_heap() and pop_heap(). Random items are inserted and removed from the queue upon each iteration.

VC10 is leading in the deque performance.

STLPort is still leading in small deque size. However, VC10 shows improvement against VC9.

Conclusion

STL implementation in VC10 definitely shows  some improvements over its predecessor. It has shrunk the gap against STLPort. But at the same time, it still have a bit more to go.

There is an average of 2.5% improvement comparing STLPort compiled with VC9 and STLPort compiled with VC10. So upgrading to VC10 will provide a performance gain even for those who don’t use STL.

I wasn’t disappointed or impressed by the improvements. So I guess it was within my expectations.

Source and data sheet can be downloaded here.

Tools: Visual Studio 2008 (VC9), Visual Studio 2010 (VC10), STLport 5.2.1

Machine Specification: Intel i5-750 with 4GB of RAM. Window 7.

Troublesome Locks

At my company, we develop a large software with fairly high concurrency. In general, asynchronous message passing is the choice for thread communication for its simplicity. When performance matters, we use locks and mutexes to share data among threads.

As the software grows, subtle bugs from dead-locks and race conditions are appearing around code that was designed with locks and mutexes.

I can say confidently that almost all of these bugs are caused by poor interface design or poor locking granularity. But I also feel that a well-designed thread-safe interface is extremely difficult to achieve.

Always Chasing The Perfect Granularity

To have a thread-safe interface, it requires far more than just putting a lock on every member function. Consider a supposedly thread-safe queue below, is this class really useful in a multi-threaded environment?

// thread-safe queue that has a scoped_lock guarding every member function
template <typename T>
class ts_queue()
{
public:
	T& front();
	void push_front(T const &);
	void pop_front();
	bool empty() const;
};

Unfortunately no. Here’s a simple code below where it falls apart. Say f() is called by multiple threads, if the the empty() check returns false, there is no guarantee that front() and pop_front() will succeed.

Even if front() succeeds, there is no guarantee that pop_front() will succeed. The ts_queue class interface is subject to race condition, and there is nothing you can do about it without an interface change.

//f may be called by multiple threads
void f(ts_queue<int> const &tsq)
{
	if(tsq.empty() == false)
	{
		int i = tsq.front(); // this might fail even if empty() was false
		tsq.pop_front(); // this might fail even if front() succeed
	}
}

The problem is that mutexes are not composable. You can’t just call a thread-safe function one after another to achieve thread-safety. In the ts_queue example, the interface granularity is too small to perform the task required by f().

On the other hand, it would be easy to use some global mutex to lock down the entire ts_queue class. But when the granularity is too large, you lose performance by excessive locking. In that case, the multi-core processors will never be fully utilized.

Some programmers try to solve it by exposing the internal mutex to the interface, and pun the granularity problem to their users. But if the designers can’t solve the problem, is it fair to expect their users to be able to solve it? Also, do you really want to expose concurrency?

So in the process of designing a thread-safe class, I feel that I am spending majority of the time chasing the perfect granularity.

Full of Guidelines

In addition to the problem of granularity, you need to follow many guidelines to maintain a well designed thread-safe interface.

  • You can’t pass reference/pointers of member variables out to external interface, whether it is through return value, or through an out parameter. Anytime a member variable is passed to an external interface, it could be cached and therefore lose the mutex protection.
  • You can’t pass reference/pointers member variables into functions that you don’t control. The reason is same as before, except this is far more tricky to achieve.
  • You can’t call any unknown code while holding onto a lock. That includes library functions, plug-ins, callbacks, virtual functions, and more.
  • You must always remember to lock mutexes in a fixed order to avoid the ABBA deadlocks.
  • There are more…

And if you are programming in C++, you need to watch out for all of the above while wrestling with the most engineered programming language in the world. 🙂

Final Thoughts

Locks and mutexes are just too complex for mortals to master.

And until C++0x defines a memory model for C++, lock-free programming is not even worth trying.

I will stop complaining now.

Const Reference To Temporary Is Useless

C++ is always full of surprises. This week, I learned that C++ has (yet another) special feature that allows the lifetime of a temporary (rvalue) to be lengthen on the stack if it is bound to a reference-to-const.

It means that the following example is legal C++. In this example, the temporary i’s lifetime is extended across its original scope.

int f()
{
	int i = 0;
	return i;
}
void main()
{
	int const &cr = f(); // lifespan extended
	int &cr = f(); // illegal (crash) because f() does not return lvalue
}

This is supposed to minimize the number of copies performs.

More Ammo To Blow Your Leg Away

You would think that this feature prevents the temporary from being copied. Apparently not so.

If you run the following program on VC9.0 and VC10.0, obj destructor is called twice, implies that it was copied. [sidenote: gcc 4.3 did not perform the copy]

#include <iostream>
class obj
{
public:
	~obj() { std::cout <<"destroyed" << std::endl; }
};
void main()
{
	obj const &rco = obj();
}

This is because the standard allows the temporary to be fully copied, and renders the reference-to-const declaration useless.

Here’s the C++ standard’s guideline in section 8.5.3.

If the initializer expression is an rvalue, with T2 a class type, and “cv1 T1” is reference-compatible with “cv2 T2,” the reference is bound in one of the following ways (the choice is implementation defined):

The reference is bound to the object represented by the rvalue or to a subobject with that object

A temporary of type “cv1 T2” is created, and a constructor is called to copy the entire rvalue object into the temporary. This reference is bound to the temporary or to a subobject with the temporary.

No Better Than RVO

Dave Abraham’s copy ellision example demonstrates how smart modern compilers have become. Return Value Optimization works well in both named and unnamed temporaries, and compose nicely.

So why bother adding reference-to-const if it works just as well without it?

Final Thoughts

C++0x (C++0a?) is supposed to address this problem by enforcing the reference-to-const to always bind to the rvalue, instead of its copy. But if a copy can be elided, I expect RVO would have done it as well.

Volatile, Another Abused C++ Keyword

Volatile is a confusing keyword. In the C++ standard, volatile is described as a type qualifier, and it is often paired up with const.

But unlike const, volatile actually has side effects. I was told that this keyword is used to disable compiler optimizations, and to replace locks in multi-threading programming.

It sounds like a lot of voodoo involved, but I am rarely given a satisfying explanation. I don’t believe in magic, so I spent some time researching it. In the end, I reached a frightening conclusion.

If you use volatile for multi-threaded programming, you are doing it wrong!

Standard’s Definition

In C++ standard, it says this.

In general, the semantics of volatile are intended to be the same in C++ as they are in C.

That certainly wasn’t informative. So I dug up the C99 standard.

This is the C99 definition of volatile – section 6.7.3 Item 6.

An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine… Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine…

And for the footnote, it adds this.

A volatile declaration may be used to describe an object corresponding to a memory-mapped input/output port or an object accessed by an asynchronously interrupting function. Actions on objects so declared shall not be “optimized out” by an implementation or reordered except as permitted by the rules for evaluating expressions.

Ideally, to program in a multi-threaded environment,  we need a concurrency memory model that addresses atomicity, visibility and memory ordering. In C99 standard, volatile is required to address the issue of visibility, but does nothing to address atomicity and memory ordering.

Instead, volatile is intended for memory-mapped I/O ports.

It’s Not News

It’s funny that once you know what to look for, answers will magically appear out of nowhere.

There have been numerous articles on why volatile is not intended for multi-threaded programming. I just happened to discover it 10 years too late.

More Confusion

Since VS8.0, Microsoft decided to go beyond the standard and gave volatile acquire and release semantics. This way, it is actually works like a memory barrier and is useful for multi-threaded programming. This behavior is not portable.

Final Thoughts

Until C++0x defines a concurrency memory model, use a memory barrier instead of volatile for multi-threaded programming.

C++ pitfalls never cease to amaze me. I think C++FQA needs a new section on volatile.

C++ for_each Hurts My Head

I admit that this is my personal pet peeve. I hate C++ for_each.

The foreach statement is a syntactic sugar that traverse a collection of items.

It offers better clarity than C style for loops because it does not require an index counter, or an iterator.

foreach item in collection
	do stuff to item

This features is available in most modern programming languages, and they are very well-done.

C++ doesn’t support foreach natively. Instead, it is done in a template function, and it hurts my brain.

foreach – the C++ Way

Let’s take a look at the VC90 implementation of for_each in STL.

// TEMPLATE FUNCTION for_each
template<class _InIt,
	class _Fn1> inline
	_Fn1 for_each(_InIt _First, _InIt _Last, _Fn1 _Func)
	{	// perform function for each element
	_DEBUG_RANGE(_First, _Last);
	_DEBUG_POINTER(_Func);
	_CHECKED_BASE_TYPE(_InIt) _ChkFirst(_CHECKED_BASE(_First));
	_CHECKED_BASE_TYPE(_InIt) _ChkLast(_CHECKED_BASE(_Last));
	for (; _ChkFirst != _ChkLast; ++_ChkFirst)
		_Func(*_ChkFirst);
	return (_Func);
	}

[Update: The for_each code is modified to match completely with VC90. I didn’t do a good job before, and it was pointed out by my colleague]

C++ for_each is a template function that takes in two iterators that defines the first and last of a sequence, and a function pointer or an operator() that performs on an item of the sequence.

In Scott Meyer’s Effective STL, he has argued that for_each makes code more direct and straightforward.

So here is an example from directly from the book.

// function object that performs action on an integer
// adaptable as a function object
class DoSomething : public unary_function<int,void>
{
	void operator()(int x) {...}
	...
};
int main()
{
	deque<int> di; // a deque of integer
	...
	DoSomething d; // function object

	// oh god, what's this
	for_each<DequeIntIter, DoSomething&>(
		di.begin(),
		di.end(),
		d);
}

No matter how many times I look at this, it is not intuitive to me. The template syntax is obscure, and I don’t get a sense that there is a loop in the code.

I tried to get used to it, I really did. But after all these years, I have finally given up.

Strangely, some of my colleagues feel the same way.

Maybe it is not just me …

Going back to one of my CS courses in my M.S, I remember reading a study on how programmers review code.

The idea is to analyze a programmer’s eye movement during code review. This eye movement is called the scan pattern.

The results were very interesting. In general, programmers tend to spend a third of the time scanning most of the function. And then, they will spend the remaining time going back and forth within a loop.

In hindsight, this is not surprising. Loops often holds the most complex logic in a function. To understand the logic, programmers need to spend time “unrolling” the loop mentally to see the side effects of each iteration.

If the study is accurate, then it certainly explains why C++ for_each is difficult to read.

Conclusion

I will avoid C++ for_each at all cost.