Shared Bathroom Problem

I came across an interesting synchronization problem called the shared bathroom problem. The shared resource is the bathroom. It needs to be shared among males and females. However, the bathroom resource sharing protocol must be implemented in the following manner:

  1. Mutual exclusion: persons of opposite sex may not occupy the bathroom simultaneously,
  2. Fairness: everyone must wait in line in a civil manner without cutting in line.

Rule #1 lays out similar characteristics compare to the reader/writer problem. But the rules here are slightly relaxed, where there could be multiple male or female occupying the resource at once.

To ensure starvation-freedom, rule #2 enforces fairness such that any person will eventually enter the bathroom.

A Jab at the Problem

I have just started (re)learning Java professionally, so I might as well write it in Java.

Mapping the problem to implementation is fairly straight forward. There are two classes of threads – MaleThread and FemaleThread. Both types of threads will try to enter a resource (class) called Bathroom. Bathroom is consisted of two API – enter(Person p) and leave(Person p), where Person is just a data class that indicates holds the gender and thread information.

So an example of a male thread entering the bathroom could be the following:

Person male = new Person(Thread.currentThread(), Gender.MALE);
// do some private business

From rule #1, the Bathroom class has the characteristics of a mutex, where pending threads need to be blocked and unblocked efficiently. For this, I chose LockSupport for thread synchronization. For the fairness requirement in rule #2, ConcurrentLinkedQueue is used for the FIFO behavior.

class Bathroom
  private AtomicInteger numMale = new AtomicInteger(0);
  private AtomicInteger numFemale = new AtomicInteger(0);
  private Queue waiters = new ConcurrentLinkedQueue();

  public void enter(Person person)
    boolean wasInterrupted = false;


    // Block while not first in line, or if there is opposite sex in the
    // bathroom.
    while (waiters.peek() != person
        || getOppositeGenderCount(person.getGender()).get() > 0)

      // ignore interrupts while waiting
      if (Thread.interrupted())
        wasInterrupted = true;

    // Increment the current sex count since this person is first in line
    // and there is no opposite sex in the bathroom.

	// Remove its from the waiting line. This is the linearization point.

	// wake up the next waiter
    Person nextPersonInLine = waiters.peek();
    if (nextPersonInLine != null)

    // reassert interrupt status on exit
    if (wasInterrupted)

  public void leave(Person person)
	// decrement the gender count, and wake up the next waiter

    Person nextPersonInLine = waiters.peek();
    if (nextPersonInLine != null)

  private AtomicInteger getOppositeGenderCount(Gender gender)
    if (gender == Gender.MALE)
      return numFemale;
    else if (gender == Gender.FEMALE)
      return numMale;

    // something is wrong here.
    return numMale;

  private AtomicInteger getGenderCount(Gender gender)
    if (gender == Gender.MALE)
      return numMale;
    else if (gender == Gender.FEMALE)
      return numFemale;
    // something is wrong here.
    return numMale;
public class Person
  public Person(Thread ownerThread, Gender sex)
    this.ownerThread = ownerThread; = sex;
  private final Thread ownerThread;
  private final Gender sex;

  public Thread getOwnerThread()
    return ownerThread;

  public Gender getGender()
    return sex;

public enum Gender

Some Explanation

The bathroom itself is lock-free. From the high level, every person who tries to enter a bathroom goes through a “test and set” process for two conditions – 1. he/she is the first person in the waiting line, and 2. no opposite sex is in the bathroom. As soon as these two conditions are valid, the person will enter the bathroom by incrementing his gender count, and leave the waiting line. This is the linearization point of the enter() method.

The leave() method simply decrements the gender count and notify the next in line to wake up and re-test the conditions to enter the bathroom.

For suspending and waking up threads, LockSupport’s park() and unpark() method were used. The park() method is simple, where threads are suspended until a “permit” is available through the call to unpark(). However, the unpark() function has a subtle yet beautiful post-condition. It guarantees the following:

Make available the permit for the given thread, if it was not already available. If the thread was blocked on park() then it will unblock. Otherwise, its next call to park() is guaranteed not to block.

Because unpark() guarantees unblocking the next call to park(), the bathroom class does not suffer from lost-wakeup.


To prove the code’s correctness, I was able to identify the linearization point of enter() and leave(). It also passes my unit test. However, I do not claim to have complete confident in the correctness of the algorithm. After all, data race detection is a NP hard problem.

I found two similar blog posts on this problem. This solution is different in the sense that only guarantees starvation-freedom (as oppose to fairness, which is more restrictive). Another solution guarantees fairness (also uses FIFO queue), and is written in a lock-based manner.

Performance was not measured here, as it is only a proof of concept.

The bathroom resource is not recursive (not re-entrant). After all, recursive lock is evil.

The source can be downloaded here. (Written with Java 1.6)

Some thoughts on Java

For the past month, I have been working on a server side web service in Java. Although I have used Java in numerous courses throughout college and grad school, this is the first time I have programmed in it professionally.

Compare to C++, Java is a much cleaner language. Here’s a few things I enjoy so far.

  • Importing and exporting libraries is fairly trivial (no declspec and name mangling issues).
  • A memory model that provides visibility and ordering guarantees (unlike C++ volatile)
  • Safe casting without worrying about memory layout (no memory alignment or padding).
  • Garbage collection allows for slightly sloppier exit sequence. (vs. memory corruption)

In general, difficult things I deal with in C++ are still difficult in Java (such as concurrency). But Java’s cleaner design provides an invisible guiding hands to prevent me from shooting my foot.


Deep copying in Java is an unpleasant experience. Java does not provide a default copy constructor, hence all copy constructor must be hand-crafted. The clone interface counter-intuitively provides a shallow copy, and is avoided by best practices.

Over the years with C++, I have formed a habit of cleanly separating business logic and data. Under such a setup, data is coded in form of POD, the default copy constructor (and = operator) provides deep copying for free. In general, deep copying of POD is a trivial operation, and generally has no-throw guarantee.

Unfortunately, Java does not have a well defined concept of POD. In fact, even the simplest object in Java, java.lang.Object, is non-trivial. Each Object is associated with a monitor object, and deep copying a monitor object under a multi-threaded environment is almost never desirable. Since the internal monitor object is not exposed in the Object interface, you can never truly perform a deep copy of any Java objects.

And without an automated POD copy constructors, the copy-and-swap idiom does not transfer as nicely in Java.

More Thoughts

If the lack of deep copy is the only thing I can complain about Java, it is not a bad experience so far.


Heap Performance Counters

I came across an interesting Microsoft Support article on heap performance counters. Apparently there is a registry setting that enables heap counters on Perfmon. This allows users to profile various aspect of heaps in a process.

Perfmon.exe displays these counters when the following registry key is set:

One of the counter that caught my attention is Heap Lock Contention, which is the number of collisions per sec on the heap lock. I learned of heap contention awhile ago from Windows via C/C++, but I have never been able measure it.


In 2009, I wrote some test code to benchmark Low Fragmentation Heap (LFH). Recall that the original test is single-threaded program that randomly allocates and deallocates various size buffers a number of times.

With minor touch-ups, I customized the test code to run with two threads in parallel. So I kicked off the modified test and added a Heap Lock Contention counter on the main process heap.

Create a Heap Lock Contention counter on the main process heap.

The lock contention counter gathered some very interesting results. The test program with default allocator generated about 15 collision per second on the heap lock.

Heap Contention Counter under Performance Monitor

I re-ran the test program to use LFH allocator (switchable through a command line argument). The LFH allocator results in 50% less contention compare to the default allocator in Window XP.

Heap Contention on LFH vs. Default Allocator (lower is better)

Final Thoughts

I could not get this counter to work properly under Window 7. Microsoft mentioned that only Windows Server 2003, Windows Vista, and Windows Server 2008 are enhanced.

If heap lock contention is a problem, Windows via C/C++ recommends to create a separate heap for allocation intensive classes with a custom new/delete operator.

LFH outperforms the default allocator under Window XP. The heap contention counter confirms my original test result in 2009.

Tools: Visual Studio 2008 (VC9), Boost 1.45, Window XP SP3 (32 bit)

The source program can be downloaded here.

Performance Comparison on Reader-Writer Locks

Recently, I have been playing around with reader-writer (RW) locks. I have never encountered RW locks in practice, but I have read that they could be inefficient in practice, and often results in more harm than good.

Recall that traditional mutex ensures that only one thread may enter a critical region. But if the critical region is being written infrequently, it is possible to exploit this concurrency by allowing multiple reader with RW locks.

So when exactly should RW lock be used in place of traditional mutex? To answer this question, I wrote a benchmark program to understand the scalability of RW locks.

Boost shared_mutex Benchmark

Since C++ is my primary programming language at work, I started by picking on shared_mutex of the Boost threading library.

In my benchmark, I focus primarily on two variables – the writer frequency, and the hold time of the mutex.

For implementation, there are 4 worker threads (for my quad-core CPU) working with a critical region that approximate e. At each iteration, one of the threads has a certain probability to become a writer. My goal is to see the performance change as the writing frequency increases.

And to control the hold time of the mutex, each thread will performs a certain number of iterations called E. As E becomes larger, the hold time of the mutex increases.

At E = 1, even when there are zero contention, the overhead completely wipes out any performance gain of the concurrent readers.

E=1 shows the overhead of shared_mutex

At E = 50, the longer hold time pays off slightly under low contention. However, the performance degrades rapidly as contention increases.

E=50, longer hold times allows shared_mutex to scale slightly better.


As you can see, the results are very disappointing. Boost shared_mutex only offers performance gain under extremely low contention with large hold time. The large hold time is unrealistic in practice because most programmers are taught to minimize their critical region.


SRW Lock Benchmark

Since Vista, Microsoft has released a new set of synchronization API called Slim Reader Writer (SRW) Locks. These locks are heavily optimized for performance, but can’t be lock recursively (I hate recursive lock anyway), and is not upgradable.

I was curious to see if SRW performs any better, so I added SRW into my benchmark.


SRW outperforms Boost mutex and shared_mutex even under the shortest hold time.


At longer mutex hold time, SRW degrades similarly to shared_mutex with a lower overhead.

Although SRW offers similar scalability compare to boost shared_mutex, it has lower overhead, outperforms boost shared_mutex in almost all cases.

Final Thoughts

After looking into the implementation of boost shared_mutex, I realize that its lock-free algorithm is complex and tracks many states. This implementation has so much overhead that it is impractical.

SRW offers has far lower overhead, and can be useful under low contention. Unfortunately, it is only available for Vista and beyond.

Neither mutex type offer real performance advantage when contention goes beyond 2%. Somehow, I speculate that Amdahl’s Law is playing a part here. The chart looks very much like the inverse of speedup graph I plotted last year.

The source and datasheet can be download here.

Tools: Visual Studio 2008 (VC9), Boost 1.45

Machine Specification: Intel i5-750 with 4GB of RAM. Window 7 64bit.

Length of a Thread Quantum

In Windows, threads run in units of “quantums”. After a thread completes its quantum, Windows may choose to run another thread based on priority or thread states.

This quantum settings is located in the registry called Win32PrioritySeparation. It is a wacky matrix that is represented in a bitfield.

Window XP and Vista uses short variable quantum settings. Thread owned by a process with a foreground window are assigned 18 quantums, and background window (e.g. services) are assigned 6 quantums. The Window Server edition uses 36 quantums for all threads.

So how long exactly is one quantum?

One Quantum

Although the length of a quantum is not exposed to developers, Windows Internal explained that the value is located in a kernel variable called KiCyclesPerClockQuantum. You can extract the value through Windbg with a command “dd nt!KiCyclesPerClockQuantum  l1“.

Alternatively, the book devised a method to calculate the value manually. Below is a program I wrote following the described algorithm.

	CPdhQuery procInfo_frequency(std::tstring(
		_T("\Processor Information(0,0)\Processor Frequency"))

	// Step 1: Get the CPU speed in MHz
	__int64 cpuSpeedHz = static_cast<__int64>(

	// Step 2: Convert it to Hz
	cpuSpeedHz *= 1000000;

	DWORD timeAdjustment = 0;
	DWORD clockInterval100Ns = 0;
	BOOL timeAdjustmentDisabled = 0;

	// Step 3: Get the frequency of the clock interrupt. This value is
	// dependent on your processor type.

	// Step 4: Get the rate of the clock fires per second.
	double clockIntervalPerSecond =

	// Step 5: Get the number of cycles elapsed per clock interval.
	double cyclesPerClockInterval = cpuSpeedHz * clockIntervalPerSecond;

	// Step 6: A quantum is 1/3 of a clock interval.
	__int64 clockCyclePerQuantum =
		static_cast<__int64>(cyclesPerClockInterval / 3);

	// Step 7: The quantum length in time
	double quantumLengthSec =
		static_cast<double>(clockCyclePerQuantum) /

		<< _T("Clock Cycles Per Quantum = ")
		<< clockCyclePerQuantum
		<< std::endl;

		<< _T("Duration Per Quantum = ")
		<< quantumLengthSec
		<< _T(" second")
		<< std::endl;
catch(CPdhQuery::CException const &e)
	tcout << e.What() << std::endl;
Clock Cycles Per Quantum = 13873688
Duration Per Quantum = 0.00520003 second


The quantum value provides insight on how often a thread may be preempted.

This information can be surprising useful. I recently used it to roughly estimate a thread’s response time, and correctly determined a device driver issue.

The output of my program differs slightly (~3%) from the readings in the kernel. It appears that the processor frequency in performance counter is different from the reading in PRCB.


The full source can be found here.

Tools: Visual Studio 2008, Window 7 64bit, Intel I5-750 (quad core)

shared_ptr and NULL

The interface of shared_ptr is carefully designed such that it has the syntax of a raw C pointer. So naturally, shared_ptr is comparable against NULL.

shared_ptr<SomeClass> sc;
if(sc != NULL)  { } // is it not NULL
if(sc == NULL)  { } // is it NULL

But NULL is really an abused integer. How would you implement such a comparison?

This is C++.  There is always devil in the details.

Obvious, but wrong solution

Attempt #1:

An obvious solution is to implement an operator== and operator != to compare against a pointer type of its type parameter.

template<typename T>
class shared_ptr
{ //...
   bool operator==(T *p) const // compare against T* and check for equality
      if(px_ == p)
         return true;
      return false;
   bool operator!=(T *p) const { /*inverse of == */}
   T* px_;

Why it fails

Yes, this will correctly support the NULL comparison listed above, but there are four other ways in C/C++ to check a pointer for NULL.

The comparison operator fails if the comparison order is reversed, or if implicit boolean conversion is used.

shared_ptr<SomeClass> sc;
if(NULL != sc) {} // no such conversion supported
if(NULL == sc) {}
if(sc) {} // fails the implicit conversion to bool
if(!sc) {}

And it really doesn’t make sense to compare a shared_ptr with a raw pointer.

shared_ptr<SomeClass> sc;
if(rp != sc) {} // doesn't make sense
if(rp == sc) {} // doesn't make sense

So operator== and operator!= provide a poor coverage to this problem. We need something better.

More sophisticated almost solutions

Attempt #2

So what about operator bool? Maybe we can convert the shared_ptr to a boolean type by return false if it is NULL, and return true otherwise.

template<typename T>
class shared_ptr
{ //...
   operator bool() const // conversion to bool
      if(NULL == px_)
         return false; // implicit conversion to false if NULL
      return true; // implicit conversion to true otherwise
   T* px_;

Why it fails

Although this solution supports all six ways of NULL comparison mentioned before, it comes with a bit of baggage.

Thanks to an implicit bool-to-integer promotion, you can now do stuff like this.

shared_ptr<SomeClass> sc;
float f = sc;  // this actually compiles
int i = sc;     // do not want!

Attempt #3

How about operator T*, where shared_ptr implicitly converts to a pointer type of its type parameter?

template<typename T>
class shared_ptr
{ //...
   operator T*() const // conversion to T*
      return px_;
   T* px_;

Why it fails

This solves the problem of implicit integer promotion, but opens a major hole. Now your shared_ptr is actually “leaky” and deletable. This behavior allows shared_ptr to be easily abused and misused.

shared_ptr<SomeClass> sp;
SomeClass *rp;
rp = sp; // uh oh, reference count leak
delete sp; // OMG! heap corruption

The Boost Solution

Here’s the solution chosen by boost library (similar solution also observed in VC10).

template<typename T>
class shared_ptr
{ //...
   typedef T * shared_ptr<T>::*unspecified_bool_type;
   operator unspecified_bool_type() const // never throws
       return px_ == 0? 0: &shared_ptr<T>::px_;
   T* px_;

This solution is very clever. It implicitly converts the shared_ptr into “a pointer to member variable”. Based on the NULLness of the shared_ptr, it will either return 0 or a pointer to member variable of type T*.

With this implementation, shared_ptr manages to support the six ways of checking for NULL, avoids the dangerous comparisons, and has no integer promotion side effects.

Is the boost solution perfect? Of course not. The code is confusing, and you can still do some crazy stuff.

shared_ptr<SomeClass> sp(new SomeClass);

// Grab the shared_ptr's "pointer to its member variable"
shared_ptr<SomeClass>::unspecified_bool_type ubt = sp;

// Extract the shared_ptr's inner pointer member in the most obscure way
SomeClass *innerPointer = sp.*ubt;

Final Thoughts

For such an innocent comparison, the depth of the solution is astonishing. It is amazing to see how far C++ library writers are willing to go to work around the nastiness of the language.

After figuring this out, I later learned that this technique is called the Safe Bool Idiom. (As usual, google is useless if you don’t know what you are looking for).

C++0x will address this mess with the explicit conversion operator.

Convert boost::posix_time::ptime to Windows FILETIME

When writing platform independent libraries at work, I use boost posix_time as the primary mechanism to generate timestamps. But when integrating the platform independent libraries to Windows world, the interface requires everything to be converted Windows FILETIME.

Recall, Windows FILETIME is a 64 bit structure that represents the number of 100-nanosecond intervals since January 1, 1601 (UTC).

Boost posix_time library has an API called from_ftime<ptime>(FILETIME ft), where it can create a ptime object from a Windows FILETIME.

Strangely, it’s counterpart does not exist. In other word, there is no to_ftime.


I really dislike writing this type of basic time conversion routine. It has probably been done before, and I am probably reinventing the wheel (a common disease in my profession).

Believe it or not, I could not find a solution online. At least, I found out I am not the first person to want to do this.

Anyway, here’s one way to do it.

#include &lt;boost/date_time/posix_time/posix_time.hpp&gt;
#include &lt;boost/date_time/gregorian/gregorian.hpp&gt;
#include &lt;windows.h&gt;
#include &lt;boost/cstdint.hpp&gt;

FILETIME PtimeToFileTime(boost::posix_time::ptime const &amp;pt)
	// extract the date from boost::posix_time to SYSTEMTIME
	boost::gregorian::date::ymd_type ymd =;

	st.wYear = ymd.year;
	st.wMonth = ymd.month;
	st.wDay =;
	st.wDayOfWeek =;

	// Now extract the hour/min/second field from time_duration
	boost::posix_time::time_duration td = pt.time_of_day();
	st.wHour = static_cast&lt;WORD&gt;(td.hours());
	st.wMinute = static_cast&lt;WORD&gt;(td.minutes());
	st.wSecond = static_cast&lt;WORD&gt;(td.seconds());

	// Although ptime has a fractional second field, SYSTEMTIME millisecond
	// field is 16 bit, and will not store microsecond. We will treat this
	// field separately later.
	st.wMilliseconds = 0;

	// Convert SYSTEMTIME to FILETIME structure
	SystemTimeToFileTime(&amp;st, &amp;ft);

	// Now we are almost done. The FILETIME has date, and time. It is
	// only missing fractional second.

	// Extract the raw FILETIME into a 64 bit integer.
	boost::uint64_t _100nsSince1601 = ft.dwHighDateTime;
	_100nsSince1601 &lt;&lt;=32;
	_100nsSince1601 |= ft.dwLowDateTime;

	// Add in the fractional second, which is in microsecond * 10 to get
	// 100s of nanosecond
	_100nsSince1601 += td.fractional_seconds()*10;

	// Now put the time back inside filetime.
	ft.dwHighDateTime = _100nsSince1601 &gt;&gt; 32;
	ft.dwLowDateTime = _100nsSince1601 &amp; 0x00000000FFFFFFFF;

	return ft;

And here’s how I verified it.

  1. Create a ptime object, and convert it to FILETIME with the routine above.
  2. Then use from_ftime<ptime>(FILETIME ft) to convert the generated FILETIME into another ptime object.
  3. Verify that the two ptime object is identical.
boost::posix_time::ptime now =

FILETIME ft = PtimeToFileTime(now);

std::cout &lt;&lt; boost::posix_time::to_iso_extended_string(now) &lt;&lt; std::endl;

boost::posix_time::ptime clone =

std::cout &lt;&lt; boost::posix_time::to_iso_extended_string(clone) &lt;&lt; std::endl;


On a side note

The routine PtimeToFileTime does not validate its input.

The year_month_day() routine could contain invalid/uninitialized fields.

SystemTimeToFileTime could fail.

I will leave that as an exercise.

Warning Level per Header File

A common guideline in best coding practices is to never ignore compiler warnings, and always use the highest warning level possible. For Microsoft C/C++ compilers, this mean level 4 (/W4).

But just because your code compiles cleanly under level 4, it doesn’t mean external libraries would too.

Let’s take Boost library as an example. It is arguably the most well written C++ library in the world. Yet it is not cleared of level 4 warning until 1.40.

The truth is that level 4 warning level is harsh (and often silly), and some code are just not fit to make the cut.

Use L3 for the Uglies

For ugly header files that aren’t designed with level 4 warning in mind, just compile them with level 3 with the #pragma warning(push,3) command.

Here’s an example.

#pragma warning( push, 3 ) // boost make_shared has L4 warnings, so use L3.
	#include &lt;boost/make_shared.hpp&gt;
#pragma warning(pop) // resume to original warning level (4)

An often suggested alternative is to use #pragma warning(disable:xyz), where xyz is the warning number.
This solution is clumsy because it requires you to find out every single warning emitted from every external header file, and then disable them one at a time. I am too busy (lazy) for that. 🙂

true != true?

A co-worker was struggling with an urgent bug, and came by my office to ask an odd question.

Is it possible for true != true in C++?

Last time I checked, 1 is equal to 1. So I stopped by her cubical to see this magical event.

Is it true?

She told me that the code has been recompiled from scratch, and both debug and release build exhibit the same behavior.

Variable b is initialized to be true, and Visual Studio run-time checks didn’t catch anything strange.

Stepping through the code in Visual Studio 9, here’s what we saw.

Variable b is true, so it should pass the satisfy the first condition.

The first case failed, and went to the false case instead.

Wow, she’s right. This is quite something.

Diving in

C++ is a language well designed to shoot your foot. In the standard, bool is an integral type that may be 1 or more bytes, and can be either true, false or undefined.

Experience tells me that very likely, b is not true. Visual Studio is not displaying the truth.

To show this, just print out the value of b.

std::cout &lt;&lt; std::hex &lt;&lt; b &lt;&lt; std::endl;

prints 0xcd

Ah ha, so b is an uninitialized variable, and falls under the category of “undefined” in the standard.


Visual Studio does have runtime checks against accessing uninitialized variables, but it can be easily fooled.

Runtime check fails below for VC 8, 9, and 10.

&lt;pre&gt;#include &lt;iostream&gt;

struct SBool {	bool b; };
SBool GetBool()
	SBool s;
	return s;
int main()
	bool b = GetBool().b;
	if(true == b)
		std::cout &lt;&lt; &quot;true&quot;&lt;&lt; std::endl;
		std::cout &lt;&lt; &quot;false&quot; &lt;&lt; std::endl;
	std::cout &lt;&lt; std::hex &lt;&lt; b &lt;&lt; std::endl;

	return 0;

A PDH Helper Class – CPdhQuery

I have been writing a custom profiling tool for a specific Windows application. Windows has various SDK to access available profiling information. Some of the profiling data are available through straightforward APIs in kernel32.dll (e.g. GetThreadTimes). Others can be collected through the PDH interfaces in pdh.lib.

The documentation on the PDH interface has lots of information, but the sparse sample code makes it difficult to put the whole picture together.

Worse yet, some of the sample code are buggy. For example, the PdhGetFormattedCounterArray example can’t handle a context switch query – “\Thread(*)\Context Switches/sec”, and barfs out an error 0xc0000bba.


I wrote a class called CPdhQuery to simplify the PDH interface. The constructor takes in a PDH counter path. The design is RAII, and only has one public function called CollectQueryData. You can call it based on your sampling interval. Any PDH failure will result in an exception with a translated message.

#include &lt;windows.h&gt;
#include &lt;pdh.h&gt;
#include &lt;pdhmsg.h&gt;
#include &lt;string&gt;
#include &lt;map&gt;
#include &lt;sstream&gt;
#include &lt;vector&gt;
#include &lt;tchar.h&gt;
#include &lt;iostream&gt;
#pragma comment(lib, &quot;pdh.lib&quot;)

namespace std
	typedef std::basic_string&lt;TCHAR&gt; tstring;
	typedef std::basic_ostream&lt;TCHAR&gt; tostream;
	typedef std::basic_istream&lt;TCHAR&gt; tistream;
	typedef std::basic_ostringstream&lt;TCHAR&gt; tostringstream;
	typedef std::basic_istringstream&lt;TCHAR&gt; tistringstream;
	typedef std::basic_stringstream&lt;TCHAR&gt; tstringstream;
} // end namespace

#ifdef UNICODE
#define tcout std::wcout
#define tcout std::cout

class CPdhQuery

	// Inner exception class to report error.
	class CException
		CException(std::tstring const &amp; errorMsg) : m_errorMsg(errorMsg)	{}
		std::tstring What() const { return m_errorMsg; }
		std::tstring m_errorMsg;

	//! Constructor
	explicit CPdhQuery(std::tstring const &amp;counterPath)
		: m_pdhQuery(NULL)
		, m_pdhStatus(ERROR_SUCCESS)
		, m_pdhCounter(NULL)
		, m_counterPath(counterPath)
		if (m_pdhStatus = PdhOpenQuery(NULL, 0, &amp;m_pdhQuery))
			throw CException(GetErrorString(m_pdhStatus));

		// Specify a counter object with a wildcard for the instance.
		if (m_pdhStatus = PdhAddCounter(
			throw CException(GetErrorString(m_pdhStatus));

	//! Destructor. The counter and query handle will be closed.
		m_pdhCounter = NULL;
		if (m_pdhQuery)

	//! Collect all the data since the last sampling period.
	std::map&lt;std::tstring, double&gt; CollectQueryData()
		std::map&lt;std::tstring, double&gt; collectedData;

			// Collect the sampling data. This might cause
			// PdhGetFormattedCounterArray to fail because some query type
			// requires two collections (or more?). If such scenario is
			// detected, the while loop will retry.
			if (m_pdhStatus = PdhCollectQueryData(m_pdhQuery))
				throw CException(GetErrorString(m_pdhStatus));

			// Size of the pItems buffer
			DWORD bufferSize= 0;

			// Number of items in the pItems buffer
			DWORD itemCount = 0;


			// Call PdhGetFormattedCounterArray once to retrieve the buffer
			// size and item count. As long as the buffer size is zero, this
			// function should return PDH_MORE_DATA with the appropriate
			// buffer size.
			m_pdhStatus = PdhGetFormattedCounterArray(

			// If the returned value is nto PDH_MORE_DATA, the function
			// has failed.
			if (PDH_MORE_DATA != m_pdhStatus)
				throw CException(GetErrorString(m_pdhStatus));

			std::vector&lt;unsigned char&gt; buffer(bufferSize);
			pdhItems = (PDH_FMT_COUNTERVALUE_ITEM *)(&amp;buffer[0]);

			m_pdhStatus = PdhGetFormattedCounterArray(

			if (ERROR_SUCCESS != m_pdhStatus)

			// Everything is good, mine the data.
			for (DWORD i = 0; i &lt; itemCount; i++)

			pdhItems = NULL;
			bufferSize = itemCount = 0;
		return collectedData;

	//! Helper function that translate the PDH error code into
	//! an useful message.
	std::tstring GetErrorString(PDH_STATUS errorCode)
		HANDLE hPdhLibrary = NULL;
		LPTSTR pMessage = NULL;
		DWORD_PTR pArgs[] = { (DWORD_PTR)m_searchInstance.c_str() };
		std::tstring errorString;

		hPdhLibrary = LoadLibrary(_T(&quot;pdh.dll&quot;));
		if (NULL == hPdhLibrary)
			std::tstringstream ss;
				&lt;&lt; _T(&quot;Format message failed with &quot;)
				&lt;&lt; std::hex &lt;&lt; GetLastError();
			return ss.str();

			std::tstringstream ss;
				&lt;&lt; m_counterPath
				&lt;&lt; _T(&quot; &quot;)
				&lt;&lt; _T(&quot;Format message failed with &quot;)
				&lt;&lt; std::hex
				&lt;&lt; GetLastError()
				&lt;&lt; std::endl;
			errorString = ss.str();
			errorString += m_counterPath;
			errorString += _T(&quot; &quot;);
			errorString += pMessage;

		return errorString;

	PDH_HQUERY m_pdhQuery;
	PDH_STATUS m_pdhStatus;
	PDH_HCOUNTER m_pdhCounter;
	std::tstring m_searchInstance;
	std::tstring m_counterPath;

void DumpMap(std::map&lt;std::tstring, double&gt; const &amp;m)
	std::map&lt;std::tstring, double&gt;::const_iterator itr = m.begin();
	while(m.end() != itr)
		tcout &lt;&lt; itr-&gt;first &lt;&lt; &quot; &quot; &lt;&lt; itr-&gt;second &lt;&lt; std::endl;

void main()
		// uncomment to try different counter paths
		CPdhQuery pdhQuery(
			std::tstring(_T(&quot;\Thread(*)\Context Switches/sec&quot;))
			//std::tstring(_T(&quot;\Thread(firefox/0)\Context Switches/sec&quot;))
			//tstring(L&quot;\Processor(*)\% Processor Time&quot;)
		for(int i=0; i&lt;100; ++i)
	catch (CPdhQuery::CException const &amp;e)
		tcout &lt;&lt; e.What() &lt;&lt; std::endl;


Tested on Window 7 x64, Visual Studio 2008 SP1

Build Type: Unicode and ANSI.