Heap Performance Counters

I came across an interesting Microsoft Support article on heap performance counters. Apparently there is a registry setting that enables heap counters on Perfmon. This allows users to profile various aspect of heaps in a process.

Perfmon.exe displays these counters when the following registry key is set:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\PerfProc\Performance
“DisplayHeapPerfObject”=dword:00000001

One of the counter that caught my attention is Heap Lock Contention, which is the number of collisions per sec on the heap lock. I learned of heap contention awhile ago from Windows via C/C++, but I have never been able measure it.

Experiment

In 2009, I wrote some test code to benchmark Low Fragmentation Heap (LFH). Recall that the original test is single-threaded program that randomly allocates and deallocates various size buffers a number of times.

With minor touch-ups, I customized the test code to run with two threads in parallel. So I kicked off the modified test and added a Heap Lock Contention counter on the main process heap.

Create a Heap Lock Contention counter on the main process heap.

The lock contention counter gathered some very interesting results. The test program with default allocator generated about 15 collision per second on the heap lock.

Heap Contention Counter under Performance Monitor

I re-ran the test program to use LFH allocator (switchable through a command line argument). The LFH allocator results in 50% less contention compare to the default allocator in Window XP.

Heap Contention on LFH vs. Default Allocator (lower is better)

Final Thoughts

I could not get this counter to work properly under Window 7. Microsoft mentioned that only Windows Server 2003, Windows Vista, and Windows Server 2008 are enhanced.

If heap lock contention is a problem, Windows via C/C++ recommends to create a separate heap for allocation intensive classes with a custom new/delete operator.

LFH outperforms the default allocator under Window XP. The heap contention counter confirms my original test result in 2009.

Tools: Visual Studio 2008 (VC9), Boost 1.45, Window XP SP3 (32 bit)

The source program can be downloaded here.

Performance Comparison on Reader-Writer Locks

Recently, I have been playing around with reader-writer (RW) locks. I have never encountered RW locks in practice, but I have read that they could be inefficient in practice, and often results in more harm than good.

Recall that traditional mutex ensures that only one thread may enter a critical region. But if the critical region is being written infrequently, it is possible to exploit this concurrency by allowing multiple reader with RW locks.

So when exactly should RW lock be used in place of traditional mutex? To answer this question, I wrote a benchmark program to understand the scalability of RW locks.

Boost shared_mutex Benchmark

Since C++ is my primary programming language at work, I started by picking on shared_mutex of the Boost threading library.

In my benchmark, I focus primarily on two variables – the writer frequency, and the hold time of the mutex.

For implementation, there are 4 worker threads (for my quad-core CPU) working with a critical region that approximate e. At each iteration, one of the threads has a certain probability to become a writer. My goal is to see the performance change as the writing frequency increases.

And to control the hold time of the mutex, each thread will performs a certain number of iterations called E. As E becomes larger, the hold time of the mutex increases.

At E = 1, even when there are zero contention, the overhead completely wipes out any performance gain of the concurrent readers.

E=1 shows the overhead of shared_mutex

At E = 50, the longer hold time pays off slightly under low contention. However, the performance degrades rapidly as contention increases.

E=50, longer hold times allows shared_mutex to scale slightly better.

As you can see, the results are very disappointing. Boost shared_mutex only offers performance gain under extremely low contention with large hold time. The large hold time is unrealistic in practice because most programmers are taught to minimize their critical region.

SRW Lock Benchmark

Since Vista, Microsoft has released a new set of synchronization API called Slim Reader Writer (SRW) Locks. These locks are heavily optimized for performance, but can’t be lock recursively (I hate recursive lock anyway), and is not upgradable.

I was curious to see if SRW performs any better, so I added SRW into my benchmark.

SRW outperforms Boost mutex and shared_mutex even under the shortest hold time.

At longer mutex hold time, SRW degrades similarly to shared_mutex with a lower overhead.

Although SRW offers similar scalability compare to boost shared_mutex, it has lower overhead, outperforms boost shared_mutex in almost all cases.

Final Thoughts

After looking into the implementation of boost shared_mutex, I realize that its lock-free algorithm is complex and tracks many states. This implementation has so much overhead that it is impractical.

SRW offers has far lower overhead, and can be useful under low contention. Unfortunately, it is only available for Vista and beyond.

Neither mutex type offer real performance advantage when contention goes beyond 2%. Somehow, I speculate that Amdahl’s Law is playing a part here. The chart looks very much like the inverse of speedup graph I plotted last year.

The source and datasheet can be download here.

Tools: Visual Studio 2008 (VC9), Boost 1.45

Machine Specification: Intel i5-750 with 4GB of RAM. Window 7 64bit.

Length of a Thread Quantum

In Windows, threads run in units of “quantums”. After a thread completes its quantum, Windows may choose to run another thread based on priority or thread states.

This quantum settings is located in the registry called Win32PrioritySeparation. It is a wacky matrix that is represented in a bitfield.

Window XP and Vista uses short variable quantum settings. Thread owned by a process with a foreground window are assigned 18 quantums, and background window (e.g. services) are assigned 6 quantums. The Window Server edition uses 36 quantums for all threads.

So how long exactly is one quantum?

One Quantum

Although the length of a quantum is not exposed to developers, Windows Internal explained that the value is located in a kernel variable called KiCyclesPerClockQuantum. You can extract the value through Windbg with a command “dd nt!KiCyclesPerClockQuantum l1“.

Alternatively, the book devised a method to calculate the value manually. Below is a program I wrote following the described algorithm.

try
{
	CPdhQuery procInfo_frequency(std::tstring(
		_T("\\Processor Information(0,0)\\Processor Frequency"))
		);

	// Step 1: Get the CPU speed in MHz
	__int64 cpuSpeedHz = static_cast<__int64>(
		procInfo_frequency.CollectSingleData()
		);

	// Step 2: Convert it to Hz
	cpuSpeedHz *= 1000000;

	DWORD timeAdjustment = 0;
	DWORD clockInterval100Ns = 0;
	BOOL timeAdjustmentDisabled = 0;

	// Step 3: Get the frequency of the clock interrupt. This value is
	// dependent on your processor type.
	GetSystemTimeAdjustment(
		&timeAdjustment,
		&clockInterval100Ns,
		&timeAdjustmentDisabled);

	// Step 4: Get the rate of the clock fires per second.
	double clockIntervalPerSecond =
		static_cast<double>(clockInterval100Ns)/10000000;

	// Step 5: Get the number of cycles elapsed per clock interval.
	double cyclesPerClockInterval = cpuSpeedHz * clockIntervalPerSecond;

	// Step 6: A quantum is 1/3 of a clock interval.
	__int64 clockCyclePerQuantum =
		static_cast<__int64>(cyclesPerClockInterval / 3);

	// Step 7: The quantum length in time
	double quantumLengthSec =
		static_cast<double>(clockCyclePerQuantum) /
			static_cast<double>(cpuSpeedHz);

	tcout
		<< _T("Clock Cycles Per Quantum = ")
		<< clockCyclePerQuantum
		<< std::endl;

	tcout
		<< _T("Duration Per Quantum = ")
		<< quantumLengthSec
		<< _T(" second")
		<< std::endl;
}
catch(CPdhQuery::CException const &e)
{
	tcout << e.What() << std::endl;
}

Clock Cycles Per Quantum = 13873688
Duration Per Quantum = 0.00520003 second

Thoughts

The quantum value provides insight on how often a thread may be preempted.

This information can be surprising useful. I recently used it to roughly estimate a thread’s response time, and correctly determined a device driver issue.

The output of my program differs slightly (~3%) from the readings in the kernel. It appears that the processor frequency in performance counter is different from the reading in PRCB.

Download

The full source can be found here.

Tools: Visual Studio 2008, Window 7 64bit, Intel I5-750 (quad core)

shared_ptr and NULL

The interface of shared_ptr is carefully designed such that it has the syntax of a raw C pointer. So naturally, shared_ptr is comparable against NULL.

shared_ptr<SomeClass> sc;
//...
if(sc != NULL)  { } // is it not NULL
if(sc == NULL)  { } // is it NULL

But NULL is really an abused integer. How would you implement such a comparison?

This is C++. There is always devil in the details.

Obvious, but wrong solution

Attempt #1:

An obvious solution is to implement an operator== and operator != to compare against a pointer type of its type parameter.

template<typename T>
class shared_ptr
{ //...
   bool operator==(T *p) const // compare against T* and check for equality
   {
      if(px_ == p)
         return true;
      return false;
   }
   bool operator!=(T *p) const { /*inverse of == */}
   T* px_;
}

Why it fails

Yes, this will correctly support the NULL comparison listed above, but there are four other ways in C/C++ to check a pointer for NULL.

The comparison operator fails if the comparison order is reversed, or if implicit boolean conversion is used.

shared_ptr<SomeClass> sc;
//...
if(NULL != sc) {} // no such conversion supported
if(NULL == sc) {}
if(sc) {} // fails the implicit conversion to bool
if(!sc) {}

And it really doesn’t make sense to compare a shared_ptr with a raw pointer.

shared_ptr<SomeClass> sc;
SomeClass*rp;
//...
if(rp != sc) {} // doesn't make sense
if(rp == sc) {} // doesn't make sense

So operator== and operator!= provide a poor coverage to this problem. We need something better.

The Boost Solution

Here’s the solution chosen by boost library (similar solution also observed in VC10).

template<typename T>
class shared_ptr
{ //...
   typedef T * shared_ptr<T>::*unspecified_bool_type;
   operator unspecified_bool_type() const // never throws
   {
       return px_ == 0? 0: &shared_ptr<T>::px_;
   }
   T* px_;
}

This solution is very clever. It implicitly converts the shared_ptr into “a pointer to member variable”. Based on the NULLness of the shared_ptr, it will either return 0 or a pointer to member variable of type T*.

With this implementation, shared_ptr manages to support the six ways of checking for NULL, avoids the dangerous comparisons, and has no integer promotion side effects.

Is the boost solution perfect? Of course not. The code is confusing, and you can still do some crazy stuff.

shared_ptr<SomeClass> sp(new SomeClass);

// Grab the shared_ptr's "pointer to its member variable"
shared_ptr<SomeClass>::unspecified_bool_type ubt = sp;

// Extract the shared_ptr's inner pointer member in the most obscure way
SomeClass *innerPointer = sp.*ubt;

Final Thoughts

For such an innocent comparison, the depth of the solution is astonishing. It is amazing to see how far C++ library writers are willing to go to work around the nastiness of the language.

After figuring this out, I later learned that this technique is called the Safe Bool Idiom. (As usual, google is useless if you don’t know what you are looking for).

C++0x will address this mess with the explicit conversion operator.

Convert boost::posix_time::ptime to Windows FILETIME

When writing platform independent libraries at work, I use boost posix_time as the primary mechanism to generate timestamps. But when integrating the platform independent libraries to Windows world, the interface requires everything to be converted Windows FILETIME.

Recall, Windows FILETIME is a 64 bit structure that represents the number of 100-nanosecond intervals since January 1, 1601 (UTC).

Boost posix_time library has an API called from_ftime<ptime>(FILETIME ft), where it can create a ptime object from a Windows FILETIME.

Strangely, it’s counterpart does not exist. In other word, there is no to_ftime.

Code

I really dislike writing this type of basic time conversion routine. It has probably been done before, and I am probably reinventing the wheel (a common disease in my profession).

Believe it or not, I could not find a solution online. At least, I found out I am not the first person to want to do this.

Anyway, here’s one way to do it.

#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/date_time/gregorian/gregorian.hpp>
#include <windows.h>
#include <boost/cstdint.hpp>

FILETIME PtimeToFileTime(boost::posix_time::ptime const &pt)
{
	// extract the date from boost::posix_time to SYSTEMTIME
	SYSTEMTIME st;
	boost::gregorian::date::ymd_type ymd = pt.date().year_month_day();

	st.wYear = ymd.year;
	st.wMonth = ymd.month;
	st.wDay = ymd.day;
	st.wDayOfWeek = pt.date().day_of_week();

	// Now extract the hour/min/second field from time_duration
	boost::posix_time::time_duration td = pt.time_of_day();
	st.wHour = static_cast<WORD>(td.hours());
	st.wMinute = static_cast<WORD>(td.minutes());
	st.wSecond = static_cast<WORD>(td.seconds());

	// Although ptime has a fractional second field, SYSTEMTIME millisecond
	// field is 16 bit, and will not store microsecond. We will treat this
	// field separately later.
	st.wMilliseconds = 0;

	// Convert SYSTEMTIME to FILETIME structure
	FILETIME ft;
	SystemTimeToFileTime(&st, &ft);

	// Now we are almost done. The FILETIME has date, and time. It is
	// only missing fractional second.

	// Extract the raw FILETIME into a 64 bit integer.
	boost::uint64_t _100nsSince1601 = ft.dwHighDateTime;
	_100nsSince1601 <<=32;
	_100nsSince1601 |= ft.dwLowDateTime;

	// Add in the fractional second, which is in microsecond * 10 to get
	// 100s of nanosecond
	_100nsSince1601 += td.fractional_seconds()*10;

	// Now put the time back inside filetime.
	ft.dwHighDateTime = _100nsSince1601 >> 32;
	ft.dwLowDateTime = _100nsSince1601 & 0x00000000FFFFFFFF;

	return ft;
}

And here’s how I verified it.

Create a ptime object, and convert it to FILETIME with the routine above.
Then use from_ftime<ptime>(FILETIME ft) to convert the generated FILETIME into another ptime object.
Verify that the two ptime object is identical.

boost::posix_time::ptime now =
	boost::posix_time::microsec_clock::universal_time();

FILETIME ft = PtimeToFileTime(now);

std::cout << boost::posix_time::to_iso_extended_string(now) << std::endl;

boost::posix_time::ptime clone =
	boost::posix_time::from_ftime<boost::posix_time::ptime>(ft);

std::cout << boost::posix_time::to_iso_extended_string(clone) << std::endl;

Output:
2011-02-04T06:09:30.723805
2011-02-04T06:09:30.723805

On a side note

The routine PtimeToFileTime does not validate its input.

The year_month_day() routine could contain invalid/uninitialized fields.

SystemTimeToFileTime could fail.

I will leave that as an exercise.

ASKLDJD

recording a programmer's thought

Author: Alan Ning

Heap Performance Counters

Experiment

Final Thoughts

Performance Comparison on Reader-Writer Locks

Boost shared_mutex Benchmark

SRW Lock Benchmark

Final Thoughts

Length of a Thread Quantum

One Quantum

Thoughts

Download

shared_ptr and NULL

Obvious, but wrong solution

More sophisticated almost solutions

The Boost Solution

Final Thoughts

Convert boost::posix_time::ptime to Windows FILETIME

Code

On a side note