IOCP Server Library

So I wrote C++ library that provides a scalable TCP server using Windows I/O Completion Port (IOCP).

Couple weeks ago, I started studying IOCP to improve the scalability of a C++ application that may handle thousands of TCP/IP data stream.

It didn’t take long for me to realize why IOCP has the reputation of being difficult to learn. IOCP tutorial online usually fall into the category of difficult to read, overly simplified, or just plain wrong.

Worse yet, Winsock2 contains a mix of confusing APIs that perform very similar functions with subtle differences. I spent a few days just to decide whether I should use WSAAccept, accept or AcceptEx to accept a connection.

Eventually, I stumbled onto two books that helped me out a great deal – Windows Via C++ and Network Programming For Windows.

The Library

The library interface is rather simple. It follows the Proactor design pattern where user supplies a completion handler and event notifications flow through the completion handler asynchronously.

Everyone uses echo server as tutorial. So what the heck, here’s mine. 🙂

class CEchoHandler : public CIocpHandler
{
public:
	virtual void OnReceiveData(
        uint64_t clientId,
        std::vector<uint8_t> const &data)
	{
        // echo data back directly to the connected client
		std::vector<uint8_t> d(data);
		GetIocpServer().Send(clientId, d);
	}
}
void main()
{
    // create a handler that echos data back
	boost::shared_ptr h(new CEchoHandler());
    try
    {
        // bind to port 50000 with the server
        CIocpServer *echoServer = new CIocpServer(50000,h);

        char c;
        std::cin >> c; // enter a key to exit

        delete echoServer;
    }
    // RAII constructor that throws different exceptions upon failure
    catch(...)
    {
    }
}

[10/27/2010 10:34AM EST]
Update: Moved “delete echoServer;” to within the try block per co-worker’s suggestion.

Focus

Of course, there are more to the IOCP server than the code snippet above.

Here are my area of focus when designing the library.

  1. Scalability – By ensuring that there are minimum number of critical section in the library.
  2. TCP Graceful shutdown – Allow user to perform TCP graceful shutdown and simplify the TCP half-closed state.
  3. RAII – A WYSIWYG constructor and a lenient destructor that allows you to do ungraceful shutdown.

Here is a screenshot of the CPU utilization of the echo server at 300 concurrent loopback file transfer.

IOCP Server scalability upon Intel I5-750 (quad-core)

 

License

IOCPServer is released under the Boost Software License 1.0.

Download

For latest version, please see the Projects page.

IOCPServer is tested under the following configurations.

OS: Window XP, Window 7.

Compiler: Visual Studio 2009 with Boost 1.40

Build Type: ANSI, Unicode.

Enforce Alignment to Avoid False Sharing

I have been working on a C++ TCP server that utilizes Windows IO Completion Ports. So far, the toughest challenge has been maintaining the scalability of the server. Among all the concurrency problems, the one I absolutely try to avoid is false sharing, where one CPU modifies a piece of data that invalidates another CPU’s cache line.

The symptom of false sharing is extremely difficult to detect. As preventive measure, I grouped all shared data carefully into a single object so I can easily visualize the potential contentions. And I add padding accordingly if I think contention exists.

Then I came across a chapter in Windows Via C/C++, it provided a cleaner solution.

Just align them to different cache line

My TCP server follows the proactor pattern, so I have a I/O thread pool to handle send and receive requests and dispatch events. Naturally, the threads have some piece of data that they share in read, write or both.

Here’s just a dummy example.

class CSharedData
{
public:
	CSharedData() : data1(0), data2(0), data3(0) {}
	unsigned int data1; // read write
	unsigned int data2; // read write
	unsigned int data3; // read write
};

Since my processor’s cache line is 64 bytes, the data structure above is definitely going to cause contention,  say data1 is updated by one thread, and data2 is read by another. To solve this, just simply force every read write data member to be in different cache line through __declspec(align(#)).

class __declspec(align(64)) CSharedData
{
public:
	CSharedData() : data1(0), data2(0), data3(0) {}
	__declspec(align(64))
		unsigned int data1;
	__declspec(align(64))
		unsigned int data2;
	__declspec(align(64))
		unsigned int data3;
};

Thoughts

With __declspec(align(#)), you can even specify the alignment of the data structure itself. This is very useful for putting shared objects in containers like std::vector. See Okef’s std::vector of Aligned Elements for why this is a bad idea.

It would be nice if the alignment can be changed at runtime base on processor spec. I know it doesn’t make sense technically, but it is on my wishlist. 🙂