IOCP Server 1.2 Released

It has almost been 10 months since I published the IOCP server library. This little library has been very useful to me over several C++ projects. It offers impressive scalability on multi-core processors and a simple interface for TCP graceful shutdown (unlike boost ASIO).

I recently found a crash bug where an utility function was copying a non-NULL terminated IP address string. This bug is specific to Unicode build, and has no affect on ANSI build.

Also, I ran CPPCheck against the library. I am proud to say that CPPCheck has reported only one style warning, and this has also been addressed.

Anyway, you may get the latest version of IOCP Server here. Enjoy!

Plotting GetThreadTimes

As a side project, I have been developing an application to monitor the performance of our main product (hence the lack of update). In my experience, our large C++ programs have not worked well under most profilers (either too slow, or too resource intensive). The PGO technique I posted last year works well within the scope of a library, but does not provide performance data in a system wide perspective. ProcExp from SysInternal provides good performance insights with threads CPU cycles and context switches, but the results are not recordable. So I started by duplicating ProcExp’s thread performance output through GetThreadTimes.

Plotting The Data

GetThreadTimes is a very powerful function that provides the cumulative CPU and kernel time consumed by a particular thread. The resolution is known to be coarse, and has a tendency to overlook very short execution (e.g. thread that couldn’t fully consume its quantum). A post online noticed the granularity of measurement, and suggested a 1 second sampling time to get sufficiently accurate result.

Below is a timing report from GetThreadTimes sampling once a second. The target application is Media Player Classic playing a 720p H.264 video.

GetThreadTimes sampled at one second interval.

Sadly, the plot is almost completely indecipherable. There are large fluctuations from reading to reading, and thread times collected are occasionally zero.

So I spent a lot of time optimizing the data. After many trials and errors, I finally decided to smooth on the data points with a 15 second rolling average. The resulting plot looks far more consumable.

Same data as before, but with 15 second rolling average.

Final Thoughts

After digging into Windows Internal, I still couldn’t grasp the output of GetThreadTimes. Without averaging, the data is extremely difficult to consume.

By smoothing out the data with a rolling average, the data plot became very practical. In fact, this method has already uncovered a performance bug during overnight runs.

STL and erase()

The inconsistency of erase() method in STL containers has always bothered me.

Say if you would want to loop through a std::list, print its items, and erase the item from the list, you could do the following.

std::list<int> l(6); // add in 6 zeros
std::list<int>::iterator itr = l.begin();
while(l.end() != itr)
{
	std::cout << *itr << std::endl;
	itr = l.erase(itr);
}

Pretty straight forward.

The erase() method will remove the item pointed by the current iterator, and move the iterator to the next element.

Now what if you want to do the same to a std::set?

std::set<int> s; // add in 6 zeros
//do some initialization here
std::set<int>::iterator s_itr = s.begin();
while(s.end() != s_itr)
{
	std::cout << *s_itr << std::endl;
	s_itr = s.erase(s_itr); // COMPILER ERROR!
}

Accck! This won’t compile under gcc because erase() method in an associative container returns void. To work around the problem, you need to use the post-increment operator of the iterator within the erase() method.

std::set<int> s; // add in 6 zeros
//do some initialization here
std::set<int>::iterator s_itr = s.begin();
while(s.end() != s_itr)
{
	std::cout << *s_itr << std::endl;
	s.erase(s_itr++); // use post-increment
}

In C++, even erasing an item from a STL container is subtle and error-proning.

The Standard

To make matter worse, Visual Studio’s erase() method implementation violates the standard, and is consistent across all containers. This confuses a lot of people.

As far as I can see, this inconsistency has been deemed as a defect by LGW since 1999. Hell, even Josuttis mentioned a few dissatisfactions in his infamous STL Bible.

In the current C++0x standard, it looks like this issue will finally be put to rest. So… just a few more years of pain before the next compiler upgrade.

Shared Bathroom Problem

I came across an interesting synchronization problem called the shared bathroom problem. The shared resource is the bathroom. It needs to be shared among males and females. However, the bathroom resource sharing protocol must be implemented in the following manner:

Mutual exclusion: persons of opposite sex may not occupy the bathroom simultaneously,
Fairness: everyone must wait in line in a civil manner without cutting in line.

Rule #1 lays out similar characteristics compare to the reader/writer problem. But the rules here are slightly relaxed, where there could be multiple male or female occupying the resource at once.

To ensure starvation-freedom, rule #2 enforces fairness such that any person will eventually enter the bathroom.

A Jab at the Problem

I have just started (re)learning Java professionally, so I might as well write it in Java.

Mapping the problem to implementation is fairly straight forward. There are two classes of threads – MaleThread and FemaleThread. Both types of threads will try to enter a resource (class) called Bathroom. Bathroom is consisted of two API – enter(Person p) and leave(Person p), where Person is just a data class that indicates holds the gender and thread information.

So an example of a male thread entering the bathroom could be the following:

Person male = new Person(Thread.currentThread(), Gender.MALE);
bathroom.enter(male);
// do some private business
bathroom.leave(male);

From rule #1, the Bathroom class has the characteristics of a mutex, where pending threads need to be blocked and unblocked efficiently. For this, I chose LockSupport for thread synchronization. For the fairness requirement in rule #2, ConcurrentLinkedQueue is used for the FIFO behavior.

class Bathroom
{
  private AtomicInteger numMale = new AtomicInteger(0);
  private AtomicInteger numFemale = new AtomicInteger(0);
  private Queue waiters = new ConcurrentLinkedQueue();

  public void enter(Person person)
  {
    boolean wasInterrupted = false;

    waiters.add(person);

    // Block while not first in line, or if there is opposite sex in the
    // bathroom.
    while (waiters.peek() != person
        || getOppositeGenderCount(person.getGender()).get() > 0)
    {
      LockSupport.park();

      // ignore interrupts while waiting
      if (Thread.interrupted())
        wasInterrupted = true;
    }

    // Increment the current sex count since this person is first in line
    // and there is no opposite sex in the bathroom.
    getGenderCount(person.getGender()).getAndIncrement();

	// Remove its from the waiting line. This is the linearization point.
    waiters.remove();

	// wake up the next waiter
    Person nextPersonInLine = waiters.peek();
    if (nextPersonInLine != null)
    {
      LockSupport.unpark(nextPersonInLine.getOwnerThread());
    }

    // reassert interrupt status on exit
    if (wasInterrupted)
    {
      person.getOwnerThread().interrupt();
    }
  }

  public void leave(Person person)
  {
	// decrement the gender count, and wake up the next waiter
    getGenderCount(person.getGender()).getAndDecrement();

    Person nextPersonInLine = waiters.peek();
    if (nextPersonInLine != null)
    {
      LockSupport.unpark(nextPersonInLine.getOwnerThread());
    }
  }

  private AtomicInteger getOppositeGenderCount(Gender gender)
  {
    if (gender == Gender.MALE)
      return numFemale;
    else if (gender == Gender.FEMALE)
      return numMale;

    // something is wrong here.
    return numMale;
  }

  private AtomicInteger getGenderCount(Gender gender)
  {
    if (gender == Gender.MALE)
      return numMale;
    else if (gender == Gender.FEMALE)
      return numFemale;
    // something is wrong here.
    return numMale;
  }
}
public class Person
{
  public Person(Thread ownerThread, Gender sex)
  {
    super();
    this.ownerThread = ownerThread;
    this.sex = sex;
  }
  private final Thread ownerThread;
  private final Gender sex;

  public Thread getOwnerThread()
  {
    return ownerThread;
  }

  public Gender getGender()
  {
    return sex;
  }
}

public enum Gender
{
  FEMALE, MALE
}

Some Explanation

The bathroom itself is lock-free. From the high level, every person who tries to enter a bathroom goes through a “test and set” process for two conditions – 1. he/she is the first person in the waiting line, and 2. no opposite sex is in the bathroom. As soon as these two conditions are valid, the person will enter the bathroom by incrementing his gender count, and leave the waiting line. This is the linearization point of the enter() method.

The leave() method simply decrements the gender count and notify the next in line to wake up and re-test the conditions to enter the bathroom.

For suspending and waking up threads, LockSupport’s park() and unpark() method were used. The park() method is simple, where threads are suspended until a “permit” is available through the call to unpark(). However, the unpark() function has a subtle yet beautiful post-condition. It guarantees the following:

Make available the permit for the given thread, if it was not already available. If the thread was blocked on park() then it will unblock. Otherwise, its next call to park() is guaranteed not to block.

Because unpark() guarantees unblocking the next call to park(), the bathroom class does not suffer from lost-wakeup.

Thoughts

To prove the code’s correctness, I was able to identify the linearization point of enter() and leave(). It also passes my unit test. However, I do not claim to have complete confident in the correctness of the algorithm. After all, data race detection is a NP hard problem.

I found two similar blog posts on this problem. This solution is different in the sense that only guarantees starvation-freedom (as oppose to fairness, which is more restrictive). Another solution guarantees fairness (also uses FIFO queue), and is written in a lock-based manner.

Performance was not measured here, as it is only a proof of concept.

The bathroom resource is not recursive (not re-entrant). After all, recursive lock is evil.

The source can be downloaded here. (Written with Java 1.6)

Some thoughts on Java

For the past month, I have been working on a server side web service in Java. Although I have used Java in numerous courses throughout college and grad school, this is the first time I have programmed in it professionally.

Compare to C++, Java is a much cleaner language. Here’s a few things I enjoy so far.

Importing and exporting libraries is fairly trivial (no declspec and name mangling issues).
A memory model that provides visibility and ordering guarantees (unlike C++ volatile)
Safe casting without worrying about memory layout (no memory alignment or padding).
Garbage collection allows for slightly sloppier exit sequence. (vs. memory corruption)

In general, difficult things I deal with in C++ are still difficult in Java (such as concurrency). But Java’s cleaner design provides an invisible guiding hands to prevent me from shooting my foot.

Complaints

Deep copying in Java is an unpleasant experience. Java does not provide a default copy constructor, hence all copy constructor must be hand-crafted. The clone interface counter-intuitively provides a shallow copy, and is avoided by best practices.

Over the years with C++, I have formed a habit of cleanly separating business logic and data. Under such a setup, data is coded in form of POD, the default copy constructor (and = operator) provides deep copying for free. In general, deep copying of POD is a trivial operation, and generally has no-throw guarantee.

Unfortunately, Java does not have a well defined concept of POD. In fact, even the simplest object in Java, java.lang.Object, is non-trivial. Each Object is associated with a monitor object, and deep copying a monitor object under a multi-threaded environment is almost never desirable. Since the internal monitor object is not exposed in the Object interface, you can never truly perform a deep copy of any Java objects.

And without an automated POD copy constructors, the copy-and-swap idiom does not transfer as nicely in Java.

More Thoughts

If the lack of deep copy is the only thing I can complain about Java, it is not a bad experience so far.

ASKLDJD

recording a programmer's thought

Author: Alan Ning