Rvalue Reference Speeds Up Your Code

One of the most exciting feature to me in C++0x is Rvalue Reference. It is designed to eliminate spurious copying of objects and to solve the Forwarding Problem.

Rvalue Reference is a small technical extension to C++, but it could seriously increase the performance of existing C++ application for free or minimal changes.

GCC 4.3 and beyond has Rvalue Reference support. All it requires is a compiler option “-std=gnu++0x”, and all the supported experimental features will be enabled.

The C++ committee has a nice introduction to Rvalue Reference.

Too Many Copies

Rvalue Reference by itself isn’t too useful to the daily C++ programmers. Chances are the library writers will have the bulk of the work on optimizing their containers, and it will be mostly invisible to the end user. But nevertheless, the end user will need to supply two functions to take advantage of this optimization. They are Move Copy Constructor and Move Assignment Operator.

Consider a POD class that holds an index value for sorting, and a buffer that holds some data. It has an overloaded copy constructor and assignment operator to keep track of number of copying done on the object. Since it is designed to be used in a standard container, I will also supply a sorting predicate.

// global static variable to keep track of the number
// of times Pod has been copied.
static int Copies = 0;
// Plain Old Data
class Pod
{
public:
 	// constructor
	Pod() : m_index(0)
	{
		m_buffer.resize(1000); // holds 1k of data
	}
	Pod(Pod const & pod)
		: m_index(pod.m_index)
		, m_buffer(pod.m_buffer)
	{
		++Copies;
	}
	// not the best operator=, for demo purposes
	Pod &operator=(Pod const &pod)
	{
		m_index = pod.m_index;
		m_buffer = pod.m_buffer;
		++Copies;
	}
	int m_index; // index used for sorting
	std::vector<unsigned char> m_buffer; // some buffer
};
// Sorting Predicate for the POD
struct PodGreaterThan
{
	bool operator() (Pod const & lhs, Pod const & rhs) const
	{
		if(lhs.m_index > rhs.m_index) { return true; }
		return false;
	}
};

Now, we would like to utilize this data in a vector, and sort them.

std::vector<Pod> manyPods(1000000); // a million pod
std::tr1::mt19937 eng;
std::tr1::uniform_int<int> unif(1, 0x7fffffff);

 // assign a random integer to each POD m_index
for (std::size_t i = 0; i < manyPods.size(); ++i)
	manyPods[i].m_index = unif(eng);

// sort the million pods
std::sort(manyPods.begin(), manyPods.end(), PodGreaterThan());

If you run this code in GCC 4.3.4, you will find out POD has been deep copied 17362106 times during the course of std::sort. That’s quite a bit of copies! Let’s see what Rvalue Reference can do for us.

Move Copy Constructor and Move Assignment Operator

Many of the copies in the previous example are likely to be temporary objects. Regardless of the sort algorithm, it is likely that it will spend much time swapping position between two items. And we know, the standard swap logic involves a temporary copy, and it has a very short lifespan. Rvalue Reference could be used to minimize this overhead.

Following Dave Abraham’s guideline on the move semantics.

A move assignment operator “steals” the value of its argument, leaving that argument in a destructible and assignable state, and preserves any user-visible side-effects on the left-hand-side.

We want to utilize the move semantics on temporary objects because they are short lived and side-effects are not a concern.

So let’s supply a Move Copy Constructor and Move Assignment Operator to the class Pod.

// move copy constructor
Pod(Pod && pod)
{
	m_index = pod.m_index;
	// give a hint to tell the library to "move" the vector if possible
	m_buffer = std::move(pod.m_buffer);
}
// move assignment operator
Pod & operator=(Pod && pod)
{
	m_index = pod.m_index;
	// give a hint to tell the library to "move" the vector if possible
	m_buffer = std::move(pod.m_buffer);
}

In these method, it is similar to the typical deep copy, except we are invoking std::move on the buffer. std::move might look like magic, but it works just like a static_cast from std::vector to std::vector&&. This is the crucial point of Rvalue Reference. By casting the vector to vector &&, it will invoke the Move functions of the vector automatically, if one exist. If the Move functions do not exist, it will fall back to the default non-move functions (deep copy). This is the beauty of the Rvalue Reference overloading rule.

GCC 4.3.4 vector is “Move Ready”. Here’s an example of their Move Assignment Operator.

#ifdef __GXX_EXPERIMENTAL_CXX0X__
vector&
operator=(vector&& __x)
{
	// NB: DR 675.
	this->clear();
	this->swap(__x);
	return *this;
}
#endif

*On a side note, Line 6 is very interesting because it gets around a bug that is caused by a delay in the deletion of temp object.

Speed Up after the Move Upgrade

Now we run the same code again after supporting Move Copy Constructor and Move Assignment Operator in Pod.

The number of copies dropped by two-thirds.

The number of copies falls by two-thirds.

std::sort runs roughly twice as fast as before.

By simply supplying a Move Constructor and Move Assignment operator to Pod, the sort routine runs twice as fast. That also implies that we’ve been wasting half the time in the sort routine moving temporary objects around.

Conclusion

Rvalue Reference will provide a nice performance gain to users of STL containers.

To end user, all that’s required is a Move Constructor and Move Assignment Operator.

Thoughts

It would be even nicer if they can supply default Move functions to all objects. It appears that it has been proposed. I look forward to see its progress.

Source

The source can be downloaded here.

Tools: GCC 4.3.4, Cygwin 1.7, Window 7 64bit

Memory Alignment Problems

Memory alignment is the way data types are arranged and accessed in memory. There are numerous tutorial on the internet that covers the topic in depth. This is not one of them. This is just a post that gathers my thoughts on how little I knew about the topic. 🙂

I’ve read Write Portable Code awhile ago, and there are some recommended practices that I follow to avoid alignment issues. Along the lines of the recommended practices is to avoid using memory overlay and bit-fields.

By avoiding those features, I don’t deal with memory alignment issues too often. But at the same time, I also avoided understanding those memory alignment issues in the first place.

In the past several weeks, I have worked with low level programmer that uses those features. After working with alignment bugs that we have encountered, I feel like I need to take Programming 101 again.

Bus Error on Solaris

My co-worker was developing a cross-platform software in C that receives and routes data from the software I developed(vague on purpose). We integrated our code, and they worked fine under the Linux system. Then he ported the code over to the Solaris machine, it quickly gave him a “Bus Error”.

// code has been drastically simplified for demonstration purposes

// some byte array
char p[5000];

//...many lines later

// extract a 4 byte integer from the buffer, and get a Bus Error from Solaris
int *intData= (int *) ((char *)&p);
int fourByteInteger = *intData;

It smells like an alignment issue. By de-referencing of variable intData, it probably caused a misaligned memory access. We didn’t know the exact detail, but my co-worker changed it to memcpy (one of the recommended practice from Writing Portable Code), and everything is happy.

So it is an alignment issue. But this leaves a bigger question, why does it work on the Linux system?

Unaligned Memory Access on Intel Architecture

The Linux system uses an Intel processor, and the Solaris system uses a SPARC processor.

Turns out that Intel Architecture allows unaligned memory access, with a small performance penalty. I’ve been sheltered under the Intel Architecture for so long that I took unaligned memory access “feature” for granted.

So this leads to another question, how much is the penalty, and how to detect unaligned memory access?

Finding out the penalty isn’t a difficult task. You can force an unaligned access by upconverting a byte pointer into a pointer of a larger type. Here is some pseudo-code.

// loop through the entire array by assuming that the void
// pointer is of type T
template<typename T>
void LoopThroughData( void *data, uint32_t size )
{
	T *dataT = (T*) data;
	T *dataTEnd = dataT + size/sizeof(T);

	while( dataT != dataTEnd )
	{
		(*dataT)*=2;
		dataT++;
	}
}
...
char buffer[2000];
char *bufferWithOffset = buffer + someOffset;
// loop through the array by assuming that it is an integer array
LoopThroughData<int>((void *)bufferWithOffset, 2000);

Here’s some plots that shows the penalty of unaligned access in Intel architecture.

Convert a byte array with 32 bit integer array with different offset values.

Convert a byte array with 64 bit integer array with different offset values.

The plot shows that there is a 2% performance penalty on unaligned 32 bit integer access and 3.6% performance penalty on unaligned 64 bit integer access.
I am using an Intel I5-750 processor. The penalty ratio is likely to be different across the Intel processor family.

Defense Against Unaligned Memory Access

Regardless of architecture, we should avoid unaligned memory access. Say that you can’t use memcpy for some performance reason, there is a compiler specific macro that can help detect unaligned memory.

In Visual Studio, there is __alignof that returns the alignment requirement of a given type. In GCC, the equivalent routine is__alignof__. With this tool, I wrote a small C++ routine that will determine whether a given pointer meet its alignment requirement.

template <typename T>
bool CheckIfDataIsAligned(T *p)
{
	if(((uintptr_t)p % __alignof(T)) == 0)
	{
		return true;
	}
	return false;
}

If your compiler does not support any variant of alignof, there is a clever solution that implement it in terms of through offsetof.

Bit-field Padding Problem

Another problem I encountered recently is data structure padding. My co-worker defined a bitfield to extract message from devices.

// code simplified for demonstration purposes.
// method 1
struct BitFields
{
	uint32_t a : 16;
	uint32_t b : 16;
	uint8_t c;
	uint8_t d;
};
// method 2
struct BitFields2
{
	uint16_t a;
	uint16_t b;
	uint8_t c;
	uint8_t d;
};

I am simplifying the code here. The story is a bit more complicated. There are many messages defined, and he is using the size of the messages to determine the offset to read from a large buffer. He found out that if he uses method 1 for his bit-fields, things are completely out of sync. If he uses method 2, everything works.

If you run the sizeof() operator on both object, object defined with method 1 will be bigger than method 2. This is because compiler have the tendency to align a structure to the nearest multiple of the largest member alignment value. So in the case of method 1, the largest method is uint32_t, and causes a 2 byte padding at the end of the structure.

Defense Against Bit-field Padding

Regardless of how much understanding I have on bit-fields, mistakes can always be made. I came up with two personal guideline to follow next time I define a bit-field.

1. Use unnamed bit-field if padding is intended.

2. Use Static Assert to validate structure sizes to prevent unaware paddings.

struct BitFields
{
	uint32_t a : 16;
	uint32_t b : 16;
	uint8_t c;
	uint8_t d;
	uint8_t : 8; // use unnamed bit-field if it is intentional
	uint8_t : 8; // use unnamed bit-field if it is intentional
};
// static assertion to guard against unaware padding
BOOST_STATIC_ASSERT(sizeof(BitFields) == 8);

struct BitFields2
{
	uint16_t a;
	uint16_t b;
	uint8_t c;
	uint8_t d;
};
// static assertion to guard against unaware padding
BOOST_STATIC_ASSERT(sizeof(BitFields) == 6);

On a side note, I know that #pragma pack(n) can also remove padding. But #pragma pack(n) only gives programmer partial control over a structure’s alignment. Compiler can still choose to align object less than n if n is greater than 1.

Source

The source and spreadsheet can downloaded here.

Compiler: Visual Studio 2008

Machine Specification: Intel I5-750, Window 7 64 Bit.

STL String Crashes When HID = 0

Awhile ago, we upgraded our compiler to VC90 (Visual Studio 2008), we found out that Has Iterator Debugging (HID) and Secure SCL in VC9.0 were severely degrading our product’s performance. So we took the time to rebuild everything by disabling those features.

This is done by defining preprocessor macros _SECURE_SCL = 0 and _HAS_ITERATOR_DEBUGGING = 0.

Soon after, we experienced some strange crashes in Debug build that makes no sense to us.

Crashes at std::_Container_base_secure::_Orphan_all()

Here’s one found by my co-worker. [Note: The code was simplified for demonstration purposes]

#define _HAS_ITERATOR_DEBUGGING 0 //turn off Has Iterator Debugging
#include <string>
#include <algorithm>
using namespace std;
int main()
{
	string abc = "abc";

	// Method 1: Crashes upon exit with an access violation
	string dot_abc = "." + abc;

	// Method 2: Works
	//string dot_abc = string(".") + abc;

	string buffer = ".abc";

	// Works without the search call here
	search(buffer.begin(), buffer.end(), dot_abc.begin(), dot_abc.end());

	return 0;
}

If you choose Method 1, you will get an access violation upon the destruction of the string class.

msvcp90d.dll!std::_Container_base_secure::_Orphan_all()  Line 223 + 0x5 bytes    C++
msvcp90d.dll!std::_Container_base_secure::~_Container_base_secure()  Line 115    C++
msvcp90d.dll!std::_String_base::~_String_base()  + 0x11 bytes    C++
msvcp90d.dll!std::_String_val<unsigned short, std::allocator<unsigned short> >::~_String_val<unsigned short,std::allocator<unsigned short> >()  + 0x11 bytes    C++
msvcp90d.dll!std::basic_string<char, std::char_traits<char>,std::allocator<char> >::~basic_string<char,std::char_traits<char>,std::allocator<char> >()  Line 917 + 0xf bytes    C++

However, if you choose Method 2, it will exit gracefully. And both method works under Release build.

The first alarming thing from the call stack is the fact that we are calling into msvcp90d.dll. Strings, unlike other STL containers, is separately compiled into another DLL since VC80.

Remember, to turn off HID and Secure SCL, it is required that all binaries linked by a translation unit to have the same HID and Secure SCL settings. After some online search, it is clear that msvcp90d.dll is built with HID = 1.

Yikes! Since we can’t build msvcp90d.dll, there isn’t much we can do. But Microsoft isn’t stupid, they clearly have worked around some of the problems because Method 2 does work.

Stepping In std::string

In C++, the devil is in the details. Method 1 and Method 2 appears to be functionally equvialent, they are vastly different underneath.

// Method 1
string dot_abc = "." + abc;

At a glance, Method 1 should invoke the operator+ with const char * as the left argument, and std::string as the right argument. After stepping into the function call, it calls into an operator+ in that constructs a basic_string object.

//string L27 operator +(const char *, std::string)
template<class _Elem,
	class _Traits,
	class _Alloc> inline
	basic_string<_Elem, _Traits, _Alloc> __CLRCALL_OR_CDECL operator+(
		const _Elem *_Left,
		const basic_string<_Elem, _Traits, _Alloc>& _Right)
	{	// return NTCS + string
	return (basic_string<_Elem, _Traits, _Alloc>(_Left) += _Right);
	}

It calls a copy constructor that takes in _Left (which is “.”) in this case, and performs operator+= with _Right (which is std::string abc).

// xstring L661 cctor(const char*)
__CLR_OR_THIS_CALL basic_string(const _Elem *_Ptr)
	: _Mybase()
	{	// construct from [_Ptr, <null>)
	_Tidy();
	assign(_Ptr);
	}

In method 2, the operation is different. First, a copy constructor is invoked to create a temp string.

// xstring L798 cctor(const char *)
__CLR_OR_THIS_CALL basic_string(const _Elem *_Ptr, _NO_DEBUG_PLACEHOLDER)
	: _Mybase()
	{	// construct from [_Ptr, <null>)
	this->_Myfirstiter = _IGNORE_MYITERLIST;
	_Tidy();
	assign(_Ptr);
	}

Then it will invoke operator+ with std::string as the left and right argument.

// string L17 operator +(std::string const &, std::string const &)
template<class _Elem,
	class _Traits,
	class _Alloc> inline
	basic_string<_Elem, _Traits, _Alloc> __CLRCALL_OR_CDECL operator+(
		const basic_string<_Elem, _Traits, _Alloc>& _Left,
		const basic_string<_Elem, _Traits, _Alloc>& _Right)
	{	// return string + string
	return (basic_string<_Elem, _Traits, _Alloc>(_Left) += _Right);
	}

Notice anything strange?

For the operation where “.” is copied into a std::string, the copy constructor invoked by Method 1 and Method 2 are different! In method 2, it has a different signature, and there is an extra line in the copy constructor – this->_Myfirstiter = _IGNORE_MYITERLIST.

This is probably one of Visual Studio’s work around to allow programs compiled with HID=0 to safely invoke the std::string library in msvcp90d.dll. Unfortunately, there are loop holes in their patch that fails in Method 1.

Conclusion

If you want to turn off HID and Secure SCL for performance reason, be careful with the string library. There are quite a few bugs in VC9.0 that crashes on perfectly legal C++ code. The example above is one of several scenarios that we’ve found. We have also seen similar crashes on certain usage of stringstream.

On a side note, a co-worker of mine filed this bug in Microsoft Connect. They closed the bug shortly, and told him that it has been fixed in VC10 Beta 2. Basically, they are suggesting that we should upgrade our compiler to a beta version, and pay them more money when it officially comes out. Great customer support, M$.

How to Post Source Code on WordPress.com

Since most of the posts in this blog are programming related, I need to post source code from time to time. I’ve been trying to find a way to post source code in wordpress.com such that it will be syntax highlighting and all the goodies.

I tried to use the “/code” tag in the wordpress editor, but it outputs in some awkward font.

/Code button will insert around the text. The font chosen by this theme is not legible.

#include int main() { std::cout<<"You can't read me anyway"<<std::endl; return 0; }

Use the Sourcecode Tag

So after some googling, I finally found a support page in wordpress. Apparently all you have to do is to wrap your source code in

[sourcecode language=”???”] [/sourcecode]

Replace ??? with the language of your choice. For example, C++ will use “cpp”. See the support page for the complete list.

#include <iostream>
int main()
{
   std::cout << "Ah, much better" << std::endl;
   return 0;
}

 
It is quite simple!

Prefer Simple Mutex Over Recursive Mutex

For those of us who work in the multi-threaded environment, the concept of mutex (“mutual exclusion”) should be no stranger. There are two common categories of mutex – simple and recursive.

Simple mutex can only be locked once, and if the same thread tries to lock the same mutex again, it will deadlock.

Recursive mutex can be locked by the same thread numerous times, and the same thread must unlock the same number of times before the critical region can be accessed by another thread.

Recursive mutex is common. In fact, mutexes are recursive by default in Windows.

With all that said, I believe recursive mutex should be avoid as much as possible.

Obvious Overhead

Since recursive mutex tracks more information than a simple mutex, it should be no surprise that it is more expensive.

Here is a sample test case with multiple threads accessing a critical region. The mutex implementation is from Boost.Thread library.

Red dotted line indicates the cost of a single simple mutex. Blue line indicates the cost of multiple recursive mutexes.

A single recursive mutex is quite a bit more expensive than a simple mutex. If recursive mutex is used, it is probably going to be locked more than once by the same thread (otherwise, why would you use it?). The graph also demonstrates the cost of recursively locking the mutex again, up to five times. As you can see, it is linear, and it is definitely not free.

The term “recursive mutex” might sound fancy, but they are nothing but redundant locks. After the critical region is locked, each additional recursive mutex adds nothing but overhead.

Some might argue that this cost is negligible. This might be true depending on the application, but this is just the surface of the problem.

The Real Problem

The truth is that locks and mutexes are difficult to use correctly. Until we have STM, there is no silver bullet.

Recursive mutex makes locking easy by taking away the programmer’s responsibility of tracking mutexes. Programmers can add recursive mutex to existing functions and member functions easily without refactoring code. This would not have been possible with simple mutex. However, this advantage is merely an illusion.

Don’t believe it? Here’s the thought of David Butenhof, the person who introduced recursive mutex in POSIX thread.

Here’s some of his powerful arguments:

1. Recursive mutex allows you to lose track of your locking scheme and scope. As your call stack becomes deeper, you will have less sense on how long you’ve been locking the critical region.

2. Recursive mutex are reference counted, and releasing it does not imply that the critical region is freed. So how would you know how to unlock?

I’ve learned a lot from his post. His argument addresses the fundamental design issue in software that use recursive mutex.

To avoid using recursive mutex, we have to consider refactoring.

Refactoring

In functional programming, a typical refactoring technique to break up recursive locking is by splitting a function into two components– outer and inner. The outer function holds the lock, and invoke the inner functions to perform its task. The inner functions is an internal function that is not exposed as an API.

In Object Oriented programming, this technique can also be applied in form of Pimpl idiom or private functions.

Consider this example:

Class A has two public functions – One and Two. One calls Two. Both methods are thread-safe, they both lock the same recursive mutex, which is a private member of Class A. Below demonstrates that the functional programming refactoring technique can be applied in OO programming.

A::One calls A::Two, they both lock on the same mutex member variable.

Refactor the implemention of A::Two into A::Two_Impl. A::One and A::Two nows calls A::Two_Impl. Recursive mutex becomes unnecessary, and can be refactored into a simple mutex.

Refactoring – A More Complicated Scenario

When more than one classes are involved, things gets more complicated. The technique above will not hold.

Here’s a more complicated (yet realistic) example:

Class A has two public functions – One and Three. Both functions are thread-safe, and lock the same recursive mutex, which is a private member variable of A. Class B has one public function – Two.

A::One calls B::Two, and B::Two calls A::Three.

A more complicated scenario that involves more than one class.

Now we have a mess. The possible call paths grows exponentially as the call stack gets deeper.

In the example, there are three possible paths.

1. A::One, B::Two, A::Three

2. B::Two, A::Three

3. A::Three

If we apply the previous technique, we would refactor A::Three into A::Three and A::Three_Impl. That won’t work here because class B expects A::Three_Impl to be a public interface.

There is no fix-all solution here, but here’s some suggestion.

1. Refactor A and B such that they do not circular reference each other. In other word, A can reference B, or B can reference A, but not at the same time.

2. Merge A and B in one class, or move B::Two into class A.

Use Recursive Mutex If It Is Absolutely Positively Necessary

Unfortunately, in the real world, things get even more complicated. The complicated example above only involves two classes. But what if more classes are involved, the number of call paths grows exponentially, and becomes unrealistic to refactor?

As a discipline of computer science, I would argue that for every solution with recursive mutex, there must be a solution for simple mutex.

As a software engineer who is under the gun with a deadline… I say just hack it and use recursive mutex. We are shipping tomorrow! 🙂

But as David Butenhof said, recursive mutex is a hack. Instead of relying on recursive mutex up front, we should avoid it until it is absolutely necessary.

Conclusion

Recursive mutex is dangerous because you lose sense of locking scope. It costs more than a simple mutex.

Consider to refactor the code to avoid the use of recursive mutex.

Use recursive mutex if it can’t be refactored economically.

Source

The source and the spreadsheet can be downloaded here.

Tools: Visual Studio 2008, Boost 1.39, Window XP SP3 32bit

ASKLDJD

recording a programmer's thought

Author: Alan Ning

Rvalue Reference Speeds Up Your Code

Too Many Copies

Move Copy Constructor and Move Assignment Operator

Speed Up after the Move Upgrade

Conclusion

Thoughts

Source

Memory Alignment Problems

Bus Error on Solaris

Unaligned Memory Access on Intel Architecture

Defense Against Unaligned Memory Access

Bit-field Padding Problem

Defense Against Bit-field Padding

Source

STL String Crashes When HID = 0

Crashes at std::_Container_base_secure::_Orphan_all()

Stepping In std::string

Conclusion

How to Post Source Code on WordPress.com

Use the Sourcecode Tag

Prefer Simple Mutex Over Recursive Mutex

Obvious Overhead

The Real Problem

Refactoring

Refactoring – A More Complicated Scenario

Use Recursive Mutex If It Is Absolutely Positively Necessary

Conclusion

Source