Vector, Map, and the Constant k

Here is what I was taught in my data structure class.

  • Vector runs in O(n) for find, insert, and erase. O(1) if you only insert and erase at the end. O(1) for random access.
  • Map uses a balanced tree structure and runs in O(log n) for find, insert, and erase.
  • There is a constant k in front of the big-O notation. Since k is a constant, O(log n) will beat O(n) after some value of n.

But is it really that simple?

Not So Constant

In Cliff Click’s video on modern hardware, he showed that latency from cache misses is the dominating factor in today’s software performance.

To show this, we can write up a simple function that performs linear search on different sizes of arrays. To keep it fair, we will loop back to the beginning of the array to normalize the number of iterations.

struct OneKBChunk // 1KB, no padding
{
	int key;
	char buffer[1020];
};
bool LinearFind(unsigned int kBytes)
{
	OneKBChunk * chunks = new OneKBChunk[kBytes]; //array of 1KB chunks
	bool found = false;
	int index = 0;
	for(unsigned int i=0; i<10000000; ++i) //arbitrary value
	{
		// prevent compiler from optimizing away the loop
		if(chunks[index].key == 0xdeadbeef)
			found = true;
		// loop back to the beginning of the array
		++index;
		if(index >= kBytes)
			index = 0;
	}
	delete [] chunks;
	return found;
}

Here are the timing on various array sizes.

Red dots mark the different layers of cache.

The processor on my machine is an Intel I5-750. It is a quad-core processor with 32KB of L1, 256 KB of L2, and 8MB of L3.

The graph demonstrates the penalty on different level of cache misses.

We know that linear search runs in k*O(n). In reality, k really depends on n.

Map and Cache Misses

As Raymond Chen said, vector is about the most cache friendly data structure out there. Trees like data structure uses pointers heavily, and have horrible cache locality.

To see the impact of poor cache locality, here is an example that searches the map.

The program is run twice, once without padding, and another with 2KB padding between each insertion. The padding is used to create some distance between each item inserted into the map.

void MapFind(unsigned int kBytes, int timesToRun)
{
	std::map<int, OneKBChunk *> chunks;
	//std::vector<OneKBChunk*> padding;

	for(unsigned int n=0; n<kBytes; ++n)
	{
		chunks.insert(std::make_pair(n, new OneKBChunk));

		//2KB padding in between each insert to the tree
		//for(unsigned int p=0; p<2; ++p)
		//	padding.push_back(new OneKBChunk);
	}
	// run find() arbitrary number of times
	for(unsigned int i=0; i<timesToRun; ++i)
		chunks.find(0xdeadbeef);

	// .. delete padding here
}

The elements of the maps in both run are identical, and were run in same number of iterations.

The graph shows the impact of cache misses on map.

Again, similar to the example before, the constant k isn’t really constant.

In reality, the cache misses with map are far more unpredictable than my simple example. As the program runs longer, the items in the map could end up very far apart in memory.

Conclusion

The constant k includes cache effects, which is not nearly as constant as as how I was taught in my data structure class.

On a side note, my Computer Architecture professor once said “The understanding of cache locality will determine whether you are a $50K programmer or a $100K programmer (2002 dollar)”.

Source

The source and the spreadsheet can be downloaded here.

Const-Overcorrectness

Const-correctness is popular among C++ programmers. Most believe that it is a good thing.

The general idea is simple. Things that are immutable should be declared so. This way, you can prevent error during compile-time.

But C++ is full of pitfalls. There are little guidance on how const-correctness should be achieved.

Recently, I have run into code that are over-corrected.

Just Put It Everywhere

To some, the idea of const-correctness seems to be achieved by simply adding const everywhere.

Here is an example. (rewritten for demonstration purposes)

// simple structure that holds location information
class CLocationData
{
public:
	const float GetAltitude() const;
	const float GetLatitude() const;
	const float GetLongitude() const;
	const std::string GetName() const;

	// Comparison predicate. No complain on it being member func.
	bool operator()(
		CLocationData const *const lhs,
		CLocationData const *const rhs) const;
};

There are 13 use of const here, and 6 of them are misused in my opinion.

Const-Uselessness

Const provides restrictions. It tells you whether or not mutation may occurs.

Consider the member function CLocationData::GetAltitude() from before.

const float GetAltitude() const;

The const on the right is a qualifier that applies to CLocationData::GetAltitude().

In design-by-contract perspective, it provides a powerful promiseĀ  – by calling GetAltitude(), the instance of CLocationData pointed by the this pointer will not be mutated.*

This is a great use of const.

The const on the left is a qualifier that applies to float that is returned by value. This implies the float that is copied and returned is immutable.

In the perspective of the caller, it provides absolutely no value because the float was copied, and can not cause any mutation in the internals of CLocationData.

// user can still do this
float a = CLocationData().GetAltitude();

This usage of const completely useless except for some obscure scenario, and only leads of confusion.

* Assuming mutable is not used.

Oversharing of Implementation Details

Here is another use of const in the example.

	bool operator()(
		CLocationData const *const lhs,
		CLocationData const *const rhs) const;

Again, the const on the right indicates that the function is purely used as comparison, and does not mutate the data behind the this pointer. This is good.

So what about about CLocationData const *const lhs?

When it comes to pointer value, you read the declaration from right-to-left. So this is read as – the variable lhs is a constant pointer to a constant CLocationData object.

The const on the left says that the data pointed by lhs is immutable. This is a powerful and important promise, therefore a good use of const.

The const on the right says that the pointer value itself will not change. To the caller’s perspective, it is completely useless because the pointer value is “copied” in the first place, and can not have any side effects on the caller side.

Instead, the const on the right is a restriction imposed on the implementation of CLocationData::operator(). This is an implementation detail that does not need to be shared, and therefore it is a bad use of const.

Conclusion

When returning by value on non-pointer and non-reference types, applying const qualifier on the return value is almost always redundant.

Const imposes a restriction. If this restriction does not apply to the caller, don’t apply it. There is a good chance that it is just implementation details. In design-by-contract perspective, the contract is unnecessarily verbose.

Here’s the example after it has been sanitized.

// simple structure that holds location information
class CLocationData
{
public:
	float GetAltitude() const;
	float GetLatitude() const;
	float GetLongitude() const;
	std::string GetName() const;

	// Comparison predicate. No complain on it being member func.
	bool operator()(
		CLocationData const *lhs,
		CLocationData const *rhs) const;
};