Protobuf over JNI

On a recent project, I have been working with JNI to wrap up a C++ library into Java. Like most usages of JNI, it is not for performance, but for compatibility.

JNI itself is a beast. It is quite a challenge to get both the performance and the correctness aspects of its APIs right. While its programming style is close to C, exceptions needs to be checked frequently. Its API naming schemes are full of misleading traps that often lead to memory leaks.

It is so complicated that if you want to pass data through JNI, you want to stick with primitive types. That’s because passing complex data structure into JNI is rather painful.

Unfortunately, my project requires passing complex data structures from Java into C++. To solve this problem, I turned to Protobuf for help.

Pojo Over JNI

To get a taste of basic JNI, here’s an example. Say I want to pass the following data structure from Java into C++ through JNI.

// A POJO that we pass from Java into JNI layer.
public class CustomPojo {
   private String s;
   private int i;
   private double d;
   private boolean b;
   private String[] sa;
   private int[] ia;

   public String getS() { return s; }
   public void setS(String s) {this.s = s;}
   public int getI() {return i;}
   public void setI(int i) {this.i = i;}

   ...
}

Since the members within CustomPojo are private, it requires the native side to reach back through individual method calls. Here’s an example of how it would look in C++.

// A set of cached class and method IDs used for optimization.
namespace
{
   jclass customPojoClass;
   jmethodID midGetI;
   jmethodID midGetD;
   jmethodID midIsB;
   jmethodID midGetS;
}
// Cache all the necessary fields for the first time.
void init(JNIEnv *env, jobject customPojoObject)
{
   customPojoClass = env->GetObjectClass(customPojoObject);
   midGetI = env->GetMethodID(customPojoClass, "getI", "()I");
   midGetD = env->GetMethodID(customPojoClass, "getD", "()D");
   midIsB = env->GetMethodID(customPojoClass, "isB", "()Z");
   midGetS = env->GetMethodID(
     customPojoClass, "getS", "()Ljava/lang/String;");
   // This is not productive level code. In practice, rigorous error checking
   // is needed here per JNI best practice.
}

JNIEXPORT void JNICALL Java_com_askldjd_blog_HelloWorld_passCustomArgumentIn(
  JNIEnv *env,
  jobject obj,
  jobject customPojoObject)
{
  if(customPojoClass == NULL) { init(env, customPojoObject); }

  int i =  env->CallIntMethod(customPojoObject, midGetI);
  double d =  env->CallDoubleMethod(customPojoObject, midGetD);
  bool b =  (bool)env->CallBooleanMethod(customPojoObject, midIsB);
  jstring stringObj = (jstring)env->CallObjectMethod(
    customPojoObject, midGetS);
  char const *nativeString = env->GetStringUTFChars(stringObj, 0);
  env->ReleaseStringUTFChars(stringObj, nativeString);
  // This is not productive level code. In practice, rigorous error checking
  // is needed here per JNI best practice.
}

To follow the best performance practice, JNI requires caching the class and method IDs. And to access each fields, we need a seperate JNI call to invoke the individual accessors.

As this point, it should be clear that passing complex data through JNI is non-trivial.

Protobuf over JNI

Alternatively, we can use Protobuf as the JNI messaging medium between Java and C++. This way, the communication channel through JNI is strictly through byte arrays. In this approach, the JNI complexity is reduced to a simple byte array access. Therefore the code verbosity and the potential for programming error is drastically reduced,

Here is the same example as above, but with Protobuf over JNI. First, we redefine CustomPojo into a Protobuf message.

option java_outer_classname = "CustomProtoPojo";
package com.askldjd.blog;
message PojoMessage
{
   required string s = 1;
   required int32 i = 2;
   required double d = 3;
   required bool b = 4;
   repeated string sa = 5;
   repeated int32 ia = 6;
}

Now instead of passing a complicated data structure through the JNI interface, we can encode the Protobuf message into a byte array through JNI, and decode it in C++.

// Interface definition for javah
public class JniPojo{
   static { System.loadLibrary("jni");   }
   public native void passProtoArgumentIn(byte[] byteArray);
}
// Pass a protobuf encoded byte array into JNI layer.
private static void testProtoJNI() {
   JniPojo jniPojo= new JniPojo();
   Builder pb = CustomProtoPojo.PojoMessage.newBuilder()
      .setS("Secret POJO message").setI(i).setD(42.5566)
      .setB(true).addIa(0);
   byte[] ba = pb.build().toByteArray();

   JniPojo.passProtoArgumentIn(ba);
}

// CPP JNI
// Take in a protobuf encoded byte array and decode it into data structure.
JNIEXPORT void JNICALL Java_com_askldjd_blog_HelloWorld_passProtoArgumentIn(
   JNIEnv *env,
   jobject obj,
   jbyteArray buffer)
{
   using namespace com::askldjd::blog;
   PojoMessage pm;
   jbyte *bufferElems = env->GetByteArrayElements(buffer, 0);
   int len = env->GetArrayLength(buffer);
   try {
      pm.ParseFromArray(reinterpret_cast(bufferElems),
         len);
      // handle message here
   } catch (...) {}
   env->ReleaseByteArrayElements(buffer, bufferElems, JNI_ABORT);
  // This is not productive level code. In practice, rigorous error checking
  // is needed here per JNI best practice.
}

Performance

Conceptually, protobuf over JNI should be more expensive. After all, protobuf is first encoded in Java, deep copied in JNI, and then decoded in C++. In practice however, the performance of the protobuf-JNI approach is 7% faster than the POJO-JNI approach.

Pojo-JNI vs. Protobuf-JNI over 10 million JNI calls.

Thoughts

Protobuf is a good medium to pass complex data structure through JNI. Compare to the handcrafted reach-back JNI code, protobuf over JNI has far lower code complexity while having equal or better performance.

This approach is great for passing low volume traffic of complex data over JNI. For high volume traffic, it is best to avoid complex data altogether and stay within primitive types.

This only covers synchronous JNI call. Asynchronous JNI callback is a topic (nightmare) for another day.

Source

Source files for the benchmark can be downloaded here.

Tools: Win7 64bit, Java7u10, VS2008, protobuf 2.4.1

Windows Wildcard Path Expansion

For the past few days, I’ve been working on a small Protobuf Compiler plugin. During the course of development, I was stuck on a trivial yet annoying problem – path expansion.

This trivial problem led me to a wild goose chase. From branch diff’ing to debugging, it took me several days to figure this out.

Long Story Short

Here’s a small program called wildcard_input.

#include

int main(int argc, char **argv)
{
	for(int i=0; i<argc; ++i)
	{
		std::cout << argv[i] << std::endl;
	}
	return 0;
}

Now I invoke the following command.

wildcard_input *

This is the output from Win7 and Ubuntu 12.4.

Running from Window 7

Running from Ubuntu 12.04

Apparently path expansion is performed through the shell by default under Linux, whereas it is left to the program to handle under Windows. I prefer the Linux behavior, but others might disagree.

Windows Path Expansion

To get the automatic path expansion behavior in Windows, you need to add a linker options – /link setargv.obj in Visual Studio.

With this option, wildcard paths are now expanded properly.

Same program, compiled with the new linker option /link setargv.obj

Retrospection

Filed misleading bug to the Protobuf team.
Diff’ed 4 different branches for code delta
Hours of proto compiler debugging and Stackoverflow browsing.

… Alan

STL and erase()

The inconsistency of erase() method in STL containers has always bothered me.

Say if you would want to loop through a std::list, print its items, and erase the item from the list, you could do the following.

std::list<int> l(6); // add in 6 zeros
std::list<int>::iterator itr = l.begin();
while(l.end() != itr)
{
	std::cout << *itr << std::endl;
	itr = l.erase(itr);
}

Pretty straight forward.

The erase() method will remove the item pointed by the current iterator, and move the iterator to the next element.

Now what if you want to do the same to a std::set?

std::set<int> s; // add in 6 zeros
//do some initialization here
std::set<int>::iterator s_itr = s.begin();
while(s.end() != s_itr)
{
	std::cout << *s_itr << std::endl;
	s_itr = s.erase(s_itr); // COMPILER ERROR!
}

Accck! This won’t compile under gcc because erase() method in an associative container returns void. To work around the problem, you need to use the post-increment operator of the iterator within the erase() method.

std::set<int> s; // add in 6 zeros
//do some initialization here
std::set<int>::iterator s_itr = s.begin();
while(s.end() != s_itr)
{
	std::cout << *s_itr << std::endl;
	s.erase(s_itr++); // use post-increment
}

In C++, even erasing an item from a STL container is subtle and error-proning.

The Standard

To make matter worse, Visual Studio’s erase() method implementation violates the standard, and is consistent across all containers. This confuses a lot of people.

As far as I can see, this inconsistency has been deemed as a defect by LGW since 1999. Hell, even Josuttis mentioned a few dissatisfactions in his infamous STL Bible.

In the current C++0x standard, it looks like this issue will finally be put to rest. So… just a few more years of pain before the next compiler upgrade.

Performance Comparison on Reader-Writer Locks

Recently, I have been playing around with reader-writer (RW) locks. I have never encountered RW locks in practice, but I have read that they could be inefficient in practice, and often results in more harm than good.

Recall that traditional mutex ensures that only one thread may enter a critical region. But if the critical region is being written infrequently, it is possible to exploit this concurrency by allowing multiple reader with RW locks.

So when exactly should RW lock be used in place of traditional mutex? To answer this question, I wrote a benchmark program to understand the scalability of RW locks.

Boost shared_mutex Benchmark

Since C++ is my primary programming language at work, I started by picking on shared_mutex of the Boost threading library.

In my benchmark, I focus primarily on two variables – the writer frequency, and the hold time of the mutex.

For implementation, there are 4 worker threads (for my quad-core CPU) working with a critical region that approximate e. At each iteration, one of the threads has a certain probability to become a writer. My goal is to see the performance change as the writing frequency increases.

And to control the hold time of the mutex, each thread will performs a certain number of iterations called E. As E becomes larger, the hold time of the mutex increases.

At E = 1, even when there are zero contention, the overhead completely wipes out any performance gain of the concurrent readers.

E=1 shows the overhead of shared_mutex

At E = 50, the longer hold time pays off slightly under low contention. However, the performance degrades rapidly as contention increases.

E=50, longer hold times allows shared_mutex to scale slightly better.

As you can see, the results are very disappointing. Boost shared_mutex only offers performance gain under extremely low contention with large hold time. The large hold time is unrealistic in practice because most programmers are taught to minimize their critical region.

SRW Lock Benchmark

Since Vista, Microsoft has released a new set of synchronization API called Slim Reader Writer (SRW) Locks. These locks are heavily optimized for performance, but can’t be lock recursively (I hate recursive lock anyway), and is not upgradable.

I was curious to see if SRW performs any better, so I added SRW into my benchmark.

SRW outperforms Boost mutex and shared_mutex even under the shortest hold time.

At longer mutex hold time, SRW degrades similarly to shared_mutex with a lower overhead.

Although SRW offers similar scalability compare to boost shared_mutex, it has lower overhead, outperforms boost shared_mutex in almost all cases.

Final Thoughts

After looking into the implementation of boost shared_mutex, I realize that its lock-free algorithm is complex and tracks many states. This implementation has so much overhead that it is impractical.

SRW offers has far lower overhead, and can be useful under low contention. Unfortunately, it is only available for Vista and beyond.

Neither mutex type offer real performance advantage when contention goes beyond 2%. Somehow, I speculate that Amdahl’s Law is playing a part here. The chart looks very much like the inverse of speedup graph I plotted last year.

The source and datasheet can be download here.

Tools: Visual Studio 2008 (VC9), Boost 1.45

Machine Specification: Intel i5-750 with 4GB of RAM. Window 7 64bit.

shared_ptr and NULL

The interface of shared_ptr is carefully designed such that it has the syntax of a raw C pointer. So naturally, shared_ptr is comparable against NULL.

shared_ptr<SomeClass> sc;
//...
if(sc != NULL)  { } // is it not NULL
if(sc == NULL)  { } // is it NULL

But NULL is really an abused integer. How would you implement such a comparison?

This is C++. There is always devil in the details.

Obvious, but wrong solution

Attempt #1:

An obvious solution is to implement an operator== and operator != to compare against a pointer type of its type parameter.

template<typename T>
class shared_ptr
{ //...
   bool operator==(T *p) const // compare against T* and check for equality
   {
      if(px_ == p)
         return true;
      return false;
   }
   bool operator!=(T *p) const { /*inverse of == */}
   T* px_;
}

Why it fails

Yes, this will correctly support the NULL comparison listed above, but there are four other ways in C/C++ to check a pointer for NULL.

The comparison operator fails if the comparison order is reversed, or if implicit boolean conversion is used.

shared_ptr<SomeClass> sc;
//...
if(NULL != sc) {} // no such conversion supported
if(NULL == sc) {}
if(sc) {} // fails the implicit conversion to bool
if(!sc) {}

And it really doesn’t make sense to compare a shared_ptr with a raw pointer.

shared_ptr<SomeClass> sc;
SomeClass*rp;
//...
if(rp != sc) {} // doesn't make sense
if(rp == sc) {} // doesn't make sense

So operator== and operator!= provide a poor coverage to this problem. We need something better.

The Boost Solution

Here’s the solution chosen by boost library (similar solution also observed in VC10).

template<typename T>
class shared_ptr
{ //...
   typedef T * shared_ptr<T>::*unspecified_bool_type;
   operator unspecified_bool_type() const // never throws
   {
       return px_ == 0? 0: &shared_ptr<T>::px_;
   }
   T* px_;
}

This solution is very clever. It implicitly converts the shared_ptr into “a pointer to member variable”. Based on the NULLness of the shared_ptr, it will either return 0 or a pointer to member variable of type T*.

With this implementation, shared_ptr manages to support the six ways of checking for NULL, avoids the dangerous comparisons, and has no integer promotion side effects.

Is the boost solution perfect? Of course not. The code is confusing, and you can still do some crazy stuff.

shared_ptr<SomeClass> sp(new SomeClass);

// Grab the shared_ptr's "pointer to its member variable"
shared_ptr<SomeClass>::unspecified_bool_type ubt = sp;

// Extract the shared_ptr's inner pointer member in the most obscure way
SomeClass *innerPointer = sp.*ubt;

Final Thoughts

For such an innocent comparison, the depth of the solution is astonishing. It is amazing to see how far C++ library writers are willing to go to work around the nastiness of the language.

After figuring this out, I later learned that this technique is called the Safe Bool Idiom. (As usual, google is useless if you don’t know what you are looking for).

C++0x will address this mess with the explicit conversion operator.

ASKLDJD

recording a programmer's thought

C++

Protobuf over JNI

Pojo Over JNI

Protobuf over JNI

Performance

Thoughts

Source

Windows Wildcard Path Expansion

Long Story Short

Windows Path Expansion

Retrospection

STL and erase()

The Standard

Performance Comparison on Reader-Writer Locks

Boost shared_mutex Benchmark

SRW Lock Benchmark

Final Thoughts

shared_ptr and NULL

Obvious, but wrong solution

More sophisticated almost solutions

The Boost Solution

Final Thoughts