Enforce Alignment to Avoid False Sharing

I have been working on a C++ TCP server that utilizes Windows IO Completion Ports. So far, the toughest challenge has been maintaining the scalability of the server. Among all the concurrency problems, the one I absolutely try to avoid is false sharing, where one CPU modifies a piece of data that invalidates another CPU’s cache line.

The symptom of false sharing is extremely difficult to detect. As preventive measure, I grouped all shared data carefully into a single object so I can easily visualize the potential contentions. And I add padding accordingly if I think contention exists.

Then I came across a chapter in Windows Via C/C++, it provided a cleaner solution.

Just align them to different cache line

My TCP server follows the proactor pattern, so I have a I/O thread pool to handle send and receive requests and dispatch events. Naturally, the threads have some piece of data that they share in read, write or both.

Here’s just a dummy example.

class CSharedData
{
public:
	CSharedData() : data1(0), data2(0), data3(0) {}
	unsigned int data1; // read write
	unsigned int data2; // read write
	unsigned int data3; // read write
};

Since my processor’s cache line is 64 bytes, the data structure above is definitely going to cause contention,  say data1 is updated by one thread, and data2 is read by another. To solve this, just simply force every read write data member to be in different cache line through __declspec(align(#)).

class __declspec(align(64)) CSharedData
{
public:
	CSharedData() : data1(0), data2(0), data3(0) {}
	__declspec(align(64))
		unsigned int data1;
	__declspec(align(64))
		unsigned int data2;
	__declspec(align(64))
		unsigned int data3;
};

Thoughts

With __declspec(align(#)), you can even specify the alignment of the data structure itself. This is very useful for putting shared objects in containers like std::vector. See Okef’s std::vector of Aligned Elements for why this is a bad idea.

It would be nice if the alignment can be changed at runtime base on processor spec. I know it doesn’t make sense technically, but it is on my wishlist. 🙂

Memory Alignment Problems

Memory alignment is the way data types are arranged and accessed in memory. There are numerous tutorial on the internet that covers the topic in depth. This is not one of them. This is just a post that gathers my thoughts on how little I knew about the topic. 🙂

I’ve read Write Portable Code awhile ago, and there are some recommended practices that I follow to avoid alignment issues. Along the lines of the recommended practices is to avoid using memory overlay and bit-fields.

By avoiding those features, I don’t deal with memory alignment issues too often. But at the same time, I also avoided understanding those memory alignment issues in the first place.

In the past several weeks, I have worked with low level programmer that uses those features. After working with alignment bugs that we have encountered, I feel like I need to take Programming 101 again.

Bus Error on Solaris

My co-worker was developing a cross-platform software in C that receives and routes data from the software I developed(vague on purpose). We integrated our code, and they worked fine under the Linux system. Then he ported the code over to the Solaris machine, it quickly gave him a “Bus Error”.

// code has been drastically simplified for demonstration purposes

// some byte array
char p[5000];

//...many lines later

// extract a 4 byte integer from the buffer, and get a Bus Error from Solaris
int *intData= (int *) ((char *)&p);
int fourByteInteger = *intData;

It smells like an alignment issue. By de-referencing of variable intData, it probably caused a misaligned memory access. We didn’t know the exact detail, but my co-worker changed it to memcpy (one of the recommended practice from Writing Portable Code), and everything is happy.

So it is an alignment issue. But this leaves a bigger question, why does it work on the Linux system?

Unaligned Memory Access on Intel Architecture

The Linux system uses an Intel processor, and the Solaris system uses a SPARC processor.

Turns out that Intel Architecture allows unaligned memory access, with a small performance penalty. I’ve been sheltered under the Intel Architecture for so long that I took unaligned memory access “feature” for granted.

So this leads to another question, how much is the penalty, and how to detect unaligned memory access?

Finding out the penalty isn’t a difficult task. You can force an unaligned access by upconverting a byte pointer into a pointer of a larger type. Here is some pseudo-code.

// loop through the entire array by assuming that the void
// pointer is of type T
template<typename T>
void LoopThroughData( void *data, uint32_t size )
{
	T *dataT = (T*) data;
	T *dataTEnd = dataT + size/sizeof(T);

	while( dataT != dataTEnd )
	{
		(*dataT)*=2;
		dataT++;
	}
}
...
char buffer[2000];
char *bufferWithOffset = buffer + someOffset;
// loop through the array by assuming that it is an integer array
LoopThroughData<int>((void *)bufferWithOffset, 2000);

Here’s some plots that shows the penalty of unaligned access in Intel architecture.

Convert a byte array with 32 bit integer array with different offset values.

Convert a byte array with 64 bit integer array with different offset values.

The plot shows that there is a 2% performance penalty on unaligned 32 bit integer access and 3.6% performance penalty on unaligned 64 bit integer access.
I am using an Intel I5-750 processor. The penalty ratio is likely to be different across the Intel processor family.

Defense Against Unaligned Memory Access

Regardless of architecture, we should avoid unaligned memory access. Say that you can’t use memcpy for some performance reason, there is a compiler specific macro that can help detect unaligned memory.

In Visual Studio, there is __alignof that returns the alignment requirement of a given type. In GCC, the equivalent routine is__alignof__. With this tool, I wrote a small C++ routine that will determine whether a given pointer meet its alignment requirement.

template <typename T>
bool CheckIfDataIsAligned(T *p)
{
	if(((uintptr_t)p % __alignof(T)) == 0)
	{
		return true;
	}
	return false;
}

If your compiler does not support any variant of alignof, there is a clever solution that implement it in terms of through offsetof.

Bit-field Padding Problem

Another problem I encountered recently is data structure padding. My co-worker defined a bitfield to extract message from devices.

// code simplified for demonstration purposes.
// method 1
struct BitFields
{
	uint32_t a : 16;
	uint32_t b : 16;
	uint8_t c;
	uint8_t d;
};
// method 2
struct BitFields2
{
	uint16_t a;
	uint16_t b;
	uint8_t c;
	uint8_t d;
};

I am simplifying the code here. The story is a bit more complicated. There are many messages defined, and he is using the size of the messages to determine the offset to read from a large buffer. He found out that if he uses method 1 for his bit-fields, things are completely out of sync. If he uses method 2, everything works.

If you run the sizeof() operator on both object, object defined with method 1 will be bigger than method 2. This is because compiler have the tendency to align a structure to the nearest multiple of the largest member alignment value. So in the case of method 1, the largest method is uint32_t, and causes a 2 byte padding at the end of the structure.

Defense Against Bit-field Padding

Regardless of how much understanding I have on bit-fields, mistakes can always be made. I came up with two personal guideline to follow next time I define a bit-field.

1. Use unnamed bit-field if padding is intended.

2. Use Static Assert to validate structure sizes to prevent unaware paddings.

struct BitFields
{
	uint32_t a : 16;
	uint32_t b : 16;
	uint8_t c;
	uint8_t d;
	uint8_t : 8; // use unnamed bit-field if it is intentional
	uint8_t : 8; // use unnamed bit-field if it is intentional
};
// static assertion to guard against unaware padding
BOOST_STATIC_ASSERT(sizeof(BitFields) == 8);

struct BitFields2
{
	uint16_t a;
	uint16_t b;
	uint8_t c;
	uint8_t d;
};
// static assertion to guard against unaware padding
BOOST_STATIC_ASSERT(sizeof(BitFields) == 6);

On a side note, I know that #pragma pack(n) can also remove padding. But #pragma pack(n) only gives programmer partial control over a structure’s alignment. Compiler can still choose to align object less than n if n is greater than 1.

Source

The source and spreadsheet can downloaded here.

Compiler: Visual Studio 2008

Machine Specification: Intel I5-750, Window 7 64 Bit.