Protobuf over JNI

On a recent project, I have been working with JNI to wrap up a C++ library into Java. Like most usages of JNI, it is not for performance, but for compatibility.

JNI itself is a beast. It is quite a challenge to get both the performance and the correctness aspects of its APIs right. While its programming style is close to C, exceptions needs to be checked frequently. Its API naming schemes are full of misleading traps that often lead to memory leaks.

It is so complicated that if you want to pass data through JNI, you want to stick with primitive types. That’s because passing complex data structure into JNI is rather painful.

Unfortunately, my project requires passing complex data structures from Java into C++. To solve this problem, I turned to Protobuf for help.

Pojo Over JNI

To get a taste of basic JNI, here’s an example. Say I want to pass the following data structure from Java into C++ through JNI.

// A POJO that we pass from Java into JNI layer.
public class CustomPojo {
   private String s;
   private int i;
   private double d;
   private boolean b;
   private String[] sa;
   private int[] ia;

   public String getS() { return s; }
   public void setS(String s) {this.s = s;}
   public int getI() {return i;}
   public void setI(int i) {this.i = i;}


Since the members within CustomPojo are private, it requires the native side to reach back through individual method calls. Here’s an example of how it would look in C++.

// A set of cached class and method IDs used for optimization.
   jclass customPojoClass;
   jmethodID midGetI;
   jmethodID midGetD;
   jmethodID midIsB;
   jmethodID midGetS;
// Cache all the necessary fields for the first time.
void init(JNIEnv *env, jobject customPojoObject)
   customPojoClass = env->GetObjectClass(customPojoObject);
   midGetI = env->GetMethodID(customPojoClass, "getI", "()I");
   midGetD = env->GetMethodID(customPojoClass, "getD", "()D");
   midIsB = env->GetMethodID(customPojoClass, "isB", "()Z");
   midGetS = env->GetMethodID(
     customPojoClass, "getS", "()Ljava/lang/String;");
   // This is not productive level code. In practice, rigorous error checking
   // is needed here per JNI best practice.

JNIEXPORT void JNICALL Java_com_askldjd_blog_HelloWorld_passCustomArgumentIn(
  JNIEnv *env,
  jobject obj,
  jobject customPojoObject)
  if(customPojoClass == NULL) { init(env, customPojoObject); }

  int i =  env->CallIntMethod(customPojoObject, midGetI);
  double d =  env->CallDoubleMethod(customPojoObject, midGetD);
  bool b =  (bool)env->CallBooleanMethod(customPojoObject, midIsB);
  jstring stringObj = (jstring)env->CallObjectMethod(
    customPojoObject, midGetS);
  char const *nativeString = env->GetStringUTFChars(stringObj, 0);
  env->ReleaseStringUTFChars(stringObj, nativeString);
  // This is not productive level code. In practice, rigorous error checking
  // is needed here per JNI best practice.

To follow the best performance practice, JNI requires caching the class and method IDs. And to access each fields, we need a seperate JNI call to invoke the individual accessors.

As this point, it should be clear that passing complex data through JNI is non-trivial.

Protobuf over JNI

Alternatively, we can use Protobuf as the JNI messaging medium between Java and C++. This way, the communication channel through JNI is strictly through byte arrays. In this approach, the JNI complexity is reduced to a simple byte array access. Therefore the code verbosity and the potential for programming error is drastically reduced,

Here is the same example as above, but with Protobuf over JNI. First, we redefine CustomPojo into a Protobuf message.

option java_outer_classname = "CustomProtoPojo";
message PojoMessage
   required string s = 1;
   required int32 i = 2;
   required double d = 3;
   required bool b = 4;
   repeated string sa = 5;
   repeated int32 ia = 6;

Now instead of passing a complicated data structure through the JNI interface, we can encode the Protobuf message into a byte array through JNI, and decode it in C++.

// Interface definition for javah
public class JniPojo{
   static { System.loadLibrary("jni");   }
   public native void passProtoArgumentIn(byte[] byteArray);
// Pass a protobuf encoded byte array into JNI layer.
private static void testProtoJNI() {
   JniPojo jniPojo= new JniPojo();
   Builder pb = CustomProtoPojo.PojoMessage.newBuilder()
      .setS("Secret POJO message").setI(i).setD(42.5566)
   byte[] ba =;

// Take in a protobuf encoded byte array and decode it into data structure.
JNIEXPORT void JNICALL Java_com_askldjd_blog_HelloWorld_passProtoArgumentIn(
   JNIEnv *env,
   jobject obj,
   jbyteArray buffer)
   using namespace com::askldjd::blog;
   PojoMessage pm;
   jbyte *bufferElems = env->GetByteArrayElements(buffer, 0);
   int len = env->GetArrayLength(buffer);
   try {
      // handle message here
   } catch (...) {}
   env->ReleaseByteArrayElements(buffer, bufferElems, JNI_ABORT);
  // This is not productive level code. In practice, rigorous error checking
  // is needed here per JNI best practice.


Conceptually, protobuf over JNI should be more expensive. After all, protobuf is first encoded in Java, deep copied in JNI, and then decoded in C++. In practice however, the performance of the protobuf-JNI approach is 7% faster than the POJO-JNI approach.


Pojo-JNI vs. Protobuf-JNI over 10 million JNI calls.


Protobuf is a good medium to pass complex data structure through JNI. Compare to the handcrafted reach-back JNI code, protobuf over JNI has far lower code complexity while having equal or better performance.

This approach is great for passing low volume traffic of complex data over JNI. For high volume traffic, it is best to avoid complex data altogether and stay within primitive types.

This only covers synchronous JNI call. Asynchronous JNI callback is a topic (nightmare) for another day.


Source files for the benchmark can be downloaded here.

Tools: Win7 64bit, Java7u10, VS2008, protobuf 2.4.1

Windows Wildcard Path Expansion

For the past few days, I’ve been working on a small Protobuf Compiler plugin. During the course of development, I was stuck on a trivial yet annoying problem – path expansion.

This trivial problem led me to a wild goose chase. From branch diff’ing to debugging, it took me several days to figure this out.

Long Story Short

Here’s a small program called wildcard_input.


int main(int argc, char **argv)
	for(int i=0; i<argc; ++i)
		std::cout << argv[i] << std::endl;
	return 0;

Now I invoke the following command.

wildcard_input *

This is the output from Win7 and Ubuntu 12.4.

Running from Window 7


Running from Ubuntu 12.04

Apparently path expansion is performed through the shell by default under Linux, whereas it is left to the program to handle under Windows. I prefer the Linux behavior, but others might disagree.

Windows Path Expansion

To get the automatic path expansion behavior in Windows, you need to add a linker options/link setargv.obj in Visual Studio.


With this option, wildcard paths are now expanded properly.


Same program, compiled with the new linker option /link setargv.obj



  1. Filed misleading bug to the Protobuf team.
  2. Diff’ed 4 different branches for code delta
  3. Hours of proto compiler debugging and Stackoverflow browsing.

… Alan

STL and erase()

The inconsistency of erase() method in STL containers has always bothered me.

Say if you would want to loop through a std::list, print its items, and erase the item from the list, you could do the following.

std::list<int> l(6); // add in 6 zeros
std::list<int>::iterator itr = l.begin();
while(l.end() != itr)
	std::cout << *itr << std::endl;
	itr = l.erase(itr);

Pretty straight forward.

The erase() method will remove the item pointed by the current iterator, and move the iterator to the next element.

Now what if you want to do the same to a std::set?

std::set<int> s; // add in 6 zeros
//do some initialization here
std::set<int>::iterator s_itr = s.begin();
while(s.end() != s_itr)
	std::cout << *s_itr << std::endl;
	s_itr = s.erase(s_itr); // COMPILER ERROR!

Accck! This won’t compile under gcc because erase() method in an associative container returns void. To work around the problem, you need to use the post-increment operator of the iterator within the erase() method.

std::set<int> s; // add in 6 zeros
//do some initialization here
std::set<int>::iterator s_itr = s.begin();
while(s.end() != s_itr)
	std::cout << *s_itr << std::endl;
	s.erase(s_itr++); // use post-increment

In C++, even erasing an item from a STL container is subtle and error-proning.

The Standard

To make matter worse, Visual Studio’s erase() method implementation violates the standard, and is consistent across all containers. This confuses a lot of people.

As far as I can see, this inconsistency has been deemed as a defect by LGW since 1999. Hell, even Josuttis mentioned a few dissatisfactions in his infamous STL Bible.

In the current C++0x standard, it looks like this issue will finally be put to rest. So… just a few more years of pain before the next compiler upgrade.

Performance Comparison on Reader-Writer Locks

Recently, I have been playing around with reader-writer (RW) locks. I have never encountered RW locks in practice, but I have read that they could be inefficient in practice, and often results in more harm than good.

Recall that traditional mutex ensures that only one thread may enter a critical region. But if the critical region is being written infrequently, it is possible to exploit this concurrency by allowing multiple reader with RW locks.

So when exactly should RW lock be used in place of traditional mutex? To answer this question, I wrote a benchmark program to understand the scalability of RW locks.

Boost shared_mutex Benchmark

Since C++ is my primary programming language at work, I started by picking on shared_mutex of the Boost threading library.

In my benchmark, I focus primarily on two variables – the writer frequency, and the hold time of the mutex.

For implementation, there are 4 worker threads (for my quad-core CPU) working with a critical region that approximate e. At each iteration, one of the threads has a certain probability to become a writer. My goal is to see the performance change as the writing frequency increases.

And to control the hold time of the mutex, each thread will performs a certain number of iterations called E. As E becomes larger, the hold time of the mutex increases.

At E = 1, even when there are zero contention, the overhead completely wipes out any performance gain of the concurrent readers.

E=1 shows the overhead of shared_mutex

At E = 50, the longer hold time pays off slightly under low contention. However, the performance degrades rapidly as contention increases.

E=50, longer hold times allows shared_mutex to scale slightly better.


As you can see, the results are very disappointing. Boost shared_mutex only offers performance gain under extremely low contention with large hold time. The large hold time is unrealistic in practice because most programmers are taught to minimize their critical region.


SRW Lock Benchmark

Since Vista, Microsoft has released a new set of synchronization API called Slim Reader Writer (SRW) Locks. These locks are heavily optimized for performance, but can’t be lock recursively (I hate recursive lock anyway), and is not upgradable.

I was curious to see if SRW performs any better, so I added SRW into my benchmark.


SRW outperforms Boost mutex and shared_mutex even under the shortest hold time.


At longer mutex hold time, SRW degrades similarly to shared_mutex with a lower overhead.

Although SRW offers similar scalability compare to boost shared_mutex, it has lower overhead, outperforms boost shared_mutex in almost all cases.

Final Thoughts

After looking into the implementation of boost shared_mutex, I realize that its lock-free algorithm is complex and tracks many states. This implementation has so much overhead that it is impractical.

SRW offers has far lower overhead, and can be useful under low contention. Unfortunately, it is only available for Vista and beyond.

Neither mutex type offer real performance advantage when contention goes beyond 2%. Somehow, I speculate that Amdahl’s Law is playing a part here. The chart looks very much like the inverse of speedup graph I plotted last year.

The source and datasheet can be download here.

Tools: Visual Studio 2008 (VC9), Boost 1.45

Machine Specification: Intel i5-750 with 4GB of RAM. Window 7 64bit.

shared_ptr and NULL

The interface of shared_ptr is carefully designed such that it has the syntax of a raw C pointer. So naturally, shared_ptr is comparable against NULL.

shared_ptr<SomeClass> sc;
if(sc != NULL)  { } // is it not NULL
if(sc == NULL)  { } // is it NULL

But NULL is really an abused integer. How would you implement such a comparison?

This is C++.  There is always devil in the details.

Obvious, but wrong solution

Attempt #1:

An obvious solution is to implement an operator== and operator != to compare against a pointer type of its type parameter.

template<typename T>
class shared_ptr
{ //...
   bool operator==(T *p) const // compare against T* and check for equality
      if(px_ == p)
         return true;
      return false;
   bool operator!=(T *p) const { /*inverse of == */}
   T* px_;

Why it fails

Yes, this will correctly support the NULL comparison listed above, but there are four other ways in C/C++ to check a pointer for NULL.

The comparison operator fails if the comparison order is reversed, or if implicit boolean conversion is used.

shared_ptr<SomeClass> sc;
if(NULL != sc) {} // no such conversion supported
if(NULL == sc) {}
if(sc) {} // fails the implicit conversion to bool
if(!sc) {}

And it really doesn’t make sense to compare a shared_ptr with a raw pointer.

shared_ptr<SomeClass> sc;
if(rp != sc) {} // doesn't make sense
if(rp == sc) {} // doesn't make sense

So operator== and operator!= provide a poor coverage to this problem. We need something better.

More sophisticated almost solutions

Attempt #2

So what about operator bool? Maybe we can convert the shared_ptr to a boolean type by return false if it is NULL, and return true otherwise.

template<typename T>
class shared_ptr
{ //...
   operator bool() const // conversion to bool
      if(NULL == px_)
         return false; // implicit conversion to false if NULL
      return true; // implicit conversion to true otherwise
   T* px_;

Why it fails

Although this solution supports all six ways of NULL comparison mentioned before, it comes with a bit of baggage.

Thanks to an implicit bool-to-integer promotion, you can now do stuff like this.

shared_ptr<SomeClass> sc;
float f = sc;  // this actually compiles
int i = sc;     // do not want!

Attempt #3

How about operator T*, where shared_ptr implicitly converts to a pointer type of its type parameter?

template<typename T>
class shared_ptr
{ //...
   operator T*() const // conversion to T*
      return px_;
   T* px_;

Why it fails

This solves the problem of implicit integer promotion, but opens a major hole. Now your shared_ptr is actually “leaky” and deletable. This behavior allows shared_ptr to be easily abused and misused.

shared_ptr<SomeClass> sp;
SomeClass *rp;
rp = sp; // uh oh, reference count leak
delete sp; // OMG! heap corruption

The Boost Solution

Here’s the solution chosen by boost library (similar solution also observed in VC10).

template<typename T>
class shared_ptr
{ //...
   typedef T * shared_ptr<T>::*unspecified_bool_type;
   operator unspecified_bool_type() const // never throws
       return px_ == 0? 0: &shared_ptr<T>::px_;
   T* px_;

This solution is very clever. It implicitly converts the shared_ptr into “a pointer to member variable”. Based on the NULLness of the shared_ptr, it will either return 0 or a pointer to member variable of type T*.

With this implementation, shared_ptr manages to support the six ways of checking for NULL, avoids the dangerous comparisons, and has no integer promotion side effects.

Is the boost solution perfect? Of course not. The code is confusing, and you can still do some crazy stuff.

shared_ptr<SomeClass> sp(new SomeClass);

// Grab the shared_ptr's "pointer to its member variable"
shared_ptr<SomeClass>::unspecified_bool_type ubt = sp;

// Extract the shared_ptr's inner pointer member in the most obscure way
SomeClass *innerPointer = sp.*ubt;

Final Thoughts

For such an innocent comparison, the depth of the solution is astonishing. It is amazing to see how far C++ library writers are willing to go to work around the nastiness of the language.

After figuring this out, I later learned that this technique is called the Safe Bool Idiom. (As usual, google is useless if you don’t know what you are looking for).

C++0x will address this mess with the explicit conversion operator.

true != true?

A co-worker was struggling with an urgent bug, and came by my office to ask an odd question.

Is it possible for true != true in C++?

Last time I checked, 1 is equal to 1. So I stopped by her cubical to see this magical event.

Is it true?

She told me that the code has been recompiled from scratch, and both debug and release build exhibit the same behavior.

Variable b is initialized to be true, and Visual Studio run-time checks didn’t catch anything strange.

Stepping through the code in Visual Studio 9, here’s what we saw.

Variable b is true, so it should pass the satisfy the first condition.

The first case failed, and went to the false case instead.

Wow, she’s right. This is quite something.

Diving in

C++ is a language well designed to shoot your foot. In the standard, bool is an integral type that may be 1 or more bytes, and can be either true, false or undefined.

Experience tells me that very likely, b is not true. Visual Studio is not displaying the truth.

To show this, just print out the value of b.

std::cout << std::hex << b << std::endl;

prints 0xcd

Ah ha, so b is an uninitialized variable, and falls under the category of “undefined” in the standard.


Visual Studio does have runtime checks against accessing uninitialized variables, but it can be easily fooled.

Runtime check fails below for VC 8, 9, and 10.

<pre>#include <iostream>

struct SBool {	bool b; };
SBool GetBool()
	SBool s;
	return s;
int main()
	bool b = GetBool().b;
	if(true == b)
		std::cout << "true"<< std::endl;
		std::cout << "false" << std::endl;
	std::cout << std::hex << b << std::endl;

	return 0;

IOCP Server 1.1 Released

While stressing a TCP server application, I found a nasty bug with the IOCP server library.

After handling 100,000 connections or so, the TCP server stops accepting connections. The output from TCPView shows that clients are still trying to connect to the server, but the connection was never established.

I was able to verify that all existing connections are unaffected. Therefore, the IO completion port is still functional. So I concluded that it is not a non-page pool issue, and has something to do with the handling of the accept completion status.

The Cause

The bug is simple, but it takes half a day to reproduce. Here’s the code snippet that causes the problem.

void CWorkerThread::HandleAccept( CIocpContext &acceptContext, DWORD bytesTransferred )
	// Update the socket option with SO_UPDATE_ACCEPT_CONTEXT so that
	// getpeername will work on the accept socket.
		(char *)&m_iocpData.m_listenSocket,
		) != 0)
		if(m_iocpData.m_iocpHandler != NULL)
			// This shouldn't happen, but if it does, report the error.
			// Since the connection has not been established, it is not
			// necessary to notify the client to remove any connections.
	... // more code here
	acceptContext.m_socket = CreateOverlappedSocket();
	if(INVALID_SOCKET != acceptContext.m_socket)
	... // more code here

See that innocent little “return” statement when setsockopt() fails, I foolishly concluded that “This shouldn’t happen”. And naturally, since it should never happen, I never thought about properly handling the error case.

Apparently in the real world, some connections comes and goes so quickly that immediately after accepting the connection, it has already been disconnected. setsockopt() would fail with error 10057, and the return statement causes the “accept chain” to break.

The fix is to remove the “return” statement and move on with life.


Along with this fix, I also removed an unnecessary event per Len Holgate’s suggestion. However, I have not yet removed the mutex in ConnectionManager. This require a slight redesign, and a bit more thoughts.

I can see myself maintaining this library for awhile, so I created a Projects page to host the different versions.


For latest version, please see the Projects page.

IOCP Server Library

So I wrote C++ library that provides a scalable TCP server using Windows I/O Completion Port (IOCP).

Couple weeks ago, I started studying IOCP to improve the scalability of a C++ application that may handle thousands of TCP/IP data stream.

It didn’t take long for me to realize why IOCP has the reputation of being difficult to learn. IOCP tutorial online usually fall into the category of difficult to read, overly simplified, or just plain wrong.

Worse yet, Winsock2 contains a mix of confusing APIs that perform very similar functions with subtle differences. I spent a few days just to decide whether I should use WSAAccept, accept or AcceptEx to accept a connection.

Eventually, I stumbled onto two books that helped me out a great deal – Windows Via C++ and Network Programming For Windows.

The Library

The library interface is rather simple. It follows the Proactor design pattern where user supplies a completion handler and event notifications flow through the completion handler asynchronously.

Everyone uses echo server as tutorial. So what the heck, here’s mine. :)

class CEchoHandler : public CIocpHandler
	virtual void OnReceiveData(
        uint64_t clientId,
        std::vector<uint8_t> const &data)
        // echo data back directly to the connected client
		std::vector<uint8_t> d(data);
		GetIocpServer().Send(clientId, d);
void main()
    // create a handler that echos data back
	boost::shared_ptr h(new CEchoHandler());
        // bind to port 50000 with the server
        CIocpServer *echoServer = new CIocpServer(50000,h);

        char c;
        std::cin >> c; // enter a key to exit

        delete echoServer;
    // RAII constructor that throws different exceptions upon failure

[10/27/2010 10:34AM EST]
Update: Moved “delete echoServer;” to within the try block per co-worker’s suggestion.


Of course, there are more to the IOCP server than the code snippet above.

Here are my area of focus when designing the library.

  1. Scalability – By ensuring that there are minimum number of critical section in the library.
  2. TCP Graceful shutdown – Allow user to perform TCP graceful shutdown and simplify the TCP half-closed state.
  3. RAII – A WYSIWYG constructor and a lenient destructor that allows you to do ungraceful shutdown.

Here is a screenshot of the CPU utilization of the echo server at 300 concurrent loopback file transfer.

IOCP Server scalability upon Intel I5-750 (quad-core)



IOCPServer is released under the Boost Software License 1.0.


For latest version, please see the Projects page.

IOCPServer is tested under the following configurations.

OS: Window XP, Window 7.

Compiler: Visual Studio 2009 with Boost 1.40

Build Type: ANSI, Unicode.

Enforce Alignment to Avoid False Sharing

I have been working on a C++ TCP server that utilizes Windows IO Completion Ports. So far, the toughest challenge has been maintaining the scalability of the server. Among all the concurrency problems, the one I absolutely try to avoid is false sharing, where one CPU modifies a piece of data that invalidates another CPU’s cache line.

The symptom of false sharing is extremely difficult to detect. As preventive measure, I grouped all shared data carefully into a single object so I can easily visualize the potential contentions. And I add padding accordingly if I think contention exists.

Then I came across a chapter in Windows Via C/C++, it provided a cleaner solution.

Just align them to different cache line

My TCP server follows the proactor pattern, so I have a I/O thread pool to handle send and receive requests and dispatch events. Naturally, the threads have some piece of data that they share in read, write or both.

Here’s just a dummy example.

class CSharedData
	CSharedData() : data1(0), data2(0), data3(0) {}
	unsigned int data1; // read write
	unsigned int data2; // read write
	unsigned int data3; // read write

Since my processor’s cache line is 64 bytes, the data structure above is definitely going to cause contention,  say data1 is updated by one thread, and data2 is read by another. To solve this, just simply force every read write data member to be in different cache line through __declspec(align(#)).

class __declspec(align(64)) CSharedData
	CSharedData() : data1(0), data2(0), data3(0) {}
		unsigned int data1;
		unsigned int data2;
		unsigned int data3;


With __declspec(align(#)), you can even specify the alignment of the data structure itself. This is very useful for putting shared objects in containers like std::vector. See Okef’s std::vector of Aligned Elements for why this is a bad idea.

It would be nice if the alignment can be changed at runtime base on processor spec. I know it doesn’t make sense technically, but it is on my wishlist. :)

Shallow Constness

Once awhile, I see programmers who are new to C++ frustrated by the use of the const qualifiers on member functions. These frustrations usually reduce to the following example.

struct X { int i; };
class Y
	Y() { m2 = &m1; } // m2 points to m1
	X *M1() const { return &m1; } // This won't compile.
	X *M2() const { return m2; }  // This does.
	X m1;
	X *m2;

When it comes to this, there are two camps of programmers.

  1. C++ is so inconsistent! M2() is fine, but why won’t M1() compile? I am clearly not modifying m1.
  2. C++ is so inconsistent! M1() is fine, but why would M2() compile? This is clearly a constness loophole because people can modify the content of m2.

Believe it or not, C++ is actually very consistent. It is just not very intuitive.

The “this” Pointer

The behavior can be traced back to the this pointer, and the side effects of the const qualifier on the member function.

In the C++ standard section

… If a member function is declared const, the type of this is T const*, if the member function is declared volatile, the type of this is T volatile *, and if the member function is declared const volatile, the type of this is  T const volatile *.

So in the example, the this pointer has type Y const *, which reads pointer to a const Y object.

Expand for Details

Now that we know the type of the this pointer, we can expand M1() and M2().

Let’s start with M1(). Since the this pointer is of type Y const *, this->m1 will inherit the const qualifier, and is of type X const.

X *M1() const
   // this has type Y const * ;
   X const tmp = this->m1; // this->m1 has type X const;
   X const *tmpAddr = &tmp;// &this->m1 has type X const *;
   X *tmp2 = tmpAddr;      // Can't compile! Can't copy X const * to X *.
   return &tmp2;

In line 6, the compiler fails to copy X const * to X *. In other words, the compiler can’t convert a “pointer to a const X” to a “pointer to X”. This is consistent with the definition of the const qualifier. Hence, M1 fails to compile.

For M2(), we can expand the function in a similar way.

X *M2() const
   // this has type Y const * ;
   X *tmp = this->m2; // this->m2 has type X * const;
   return tmp;

Unlike M1, it is perfectly legal to convert X * const to X*.  In other words, the compiler can copy a “const pointer to X” to a “pointer to X”. This is also consistent with the definition of the const qualifier.

But That’s Not The Point

Unfortunately, the answer above rarely satisfies the frustrated programmers. They are trying to follow the guidelines of const-correctness, and this behavior, although consistent, is ambiguous and makes no sense.

So here’s my recommendation – program defensively.

If you are going to return a member variable pointer (including smart ptr) or reference in a member function, never apply the const qualifier to the member function. Since C++ constness is shallow, the const qualifier only provides a false sense of security.  By assuming the worst, it will always be consistent and intuitive.