A Reasonably Fast Python IP Sniffer

Recently my latest project came with a strange requirement – I need to route IP packets from Linux kernel space to user space. In other word, I need to write a IP packets sniffer similar to tcpdump or wireshark.

The project does not have high data rate requirement. So I chose Python for some rapid prototyping to get a feel for the problem.

Sniffing with Scapy … Slowly

My past experience with Python is that it often comes with magical one-liner that just finish my job. And this time, Python did not disappoint me. My co-worker’s Google-Fu quickly found that the Scapy package has a sniff feature, and yes, it is a one-liner. :)

On the first try, the above code functioned perfectly and I immediately saw all the incoming and outgoing packets as I browsed through different webpages to trigger http traffic.

So how about some stress test? For that, I browsed to the Ubuntu homepage and downloaded an Ubuntu ISO. The file is large, and the data rate is reasonably high for a quick test. Unfortunately, Scapy didn’t perform so well.


4.4 MBps data rate results in 100% usage.

It turns out that a ~4.4MBps (35Mbps) capture would consume close to 100% of my CPU. This is an unacceptable amount of overhead for just routing packets from kernel into user space.

Sniffing with Raw Socket

Since Scapy comes with too much overhead, the next step was to dive into a lower layer and implement a raw layer 2 socket. In user space, if an application creates a raw socket, the linux kernel will automatically forward a copy of the datagram of the same protocol number to the application. So if a layer 2 socket is implemented, the host application will receive all ethernet frames.

*Layer 2 socket is chosen because I need to sniff both incoming and outgoing packets on a network interface. L3 socket does not appear to provide this capabilities.

So how’s the performance this time around? It turned out to be surprisingly fast.


Same as before, but only use 16% of CPU.

With the same data rate, the previous 100% CPU consumption now goes down to only 16%.

This is more than fast enough for my application. Mission Accomplished!

Final Thoughts

If the Python raw socket was still too slow, the next step would be to re-write the raw socket in C.

Scapy comes with a lot of overhead in practice as a live packet sniffer. If you don’t need all the power of Scapy, an IP sniffer can be easily implemented in Python raw socket and provides fairly reasonable performance.


Binding a raw socket requires root permission. Therefore, the scripts need to run under root permission.

scapy_sniff.py – IP sniffer using Scapy

ip_sniff.py – IP sniffer using Python raw socket


All tests were run under Mint Linux 15 VirtualBox VM with Window 7 as host.

protoc-gen-docbook – Convert protobuf source into DocBook and PDF

Documentation has always been Protobuf‘s weakest area. Proto source files are expected to be used like an IDL. This works for simple interfaces, but falls apart as the interface increases in complexity with multiple layers of source files.

With the latest update Protobuf from 2.5.0, protobuf compiler is finally preserving the comments within the proto source files in its Descriptor definition. This opens a door to documenting proto files.


I spent a week or so writing a protobuf compiler plugin that will convert a proto source file into DocBook and PDF. The plugin is called protoc-gen-docbook, following the protobuf compiler’s plugin naming convention.

The results are very satisfying, and has became a useful tool in my development life. I have open sourced the project on Google Code under the New BSD License.

Here are the shortcut links to the project homepage. It contains much more details on the project.



Quick Start:


Sample Output:


Implementation Details:


As usual, feedback is always welcomed.

SimpleDateFormat is slow

Optimization is often full of surprises. Whether it is low level C or high level Javascript, I always learn something from the profiler. And more often than not, it is the little things that matter.

Java SimpleDateFormat is an powerful time formatter that can handle various date formats in different locales. This power comes at a price. It is slow – very slow.

I learned this last year when working with an Android application. The Android profiler pinpointed SimpleDateFormat to the cause of the poor refresh rate. The solution then was to simply adjusting the formatting frequency. By avoiding frequent calls to SimpleDateFormat, performance improved and that was the end of the story.

A year later, now I am working on a NetBeans platform based project and I am facing the same situation again. This time I need to parse several millions timestamp from a log file and convert them to epoch milliseconds. Unlike last year, avoiding the formatter is not an option.

Roll Your Own

The problem is CS-101 simple. I need to parse a timestamp in the format – yyyyMMddHHmmss into epoch time in millisecond. The timestamp is always in GMT.

Here’s the sample code with SimpleDateFormat. This method is called getTimestamp1.

private static final SimpleDateFormat cachedSdf
   = new SimpleDateFormat("yyyyMMddHHmmss");
static {
// Parse date and time into epoch millisecond
public static long getTimestamp1(String date, String time) {

   if (date.isEmpty() == false && time.isEmpty() == false) {
      try {
         return cachedSdf.parse(date + time).getTime();
      } catch (ParseException e) {
   return 0;

Alternatively, here’s the do-it-myself version. I call it getTimestamp2.

private static final Calendar CachedCalendar = new GregorianCalendar();
static {
public static long getTimestamp2(String date, String time) {

   try {
      int y = Integer.parseInt(date.substring(0, 4));
      int m = Integer.parseInt(date.substring(4, 6));
      int d = Integer.parseInt(date.substring(6, 8));
      int h = Integer.parseInt(time.substring(0, 2));
      int mm = Integer.parseInt(time.substring(2, 4));
      int s = Integer.parseInt(time.substring(4, 6));

      CachedCalendar.set(y, m, d, h, mm, s);

      if (CachedCalendar.get(Calendar.YEAR) != y) {
         return 0;
      if (CachedCalendar.get(Calendar.MONTH) != m) {
         return 0;
      if (CachedCalendar.get(Calendar.DATE) != d) {
         return 0;

      if (h < 0 || m > 23) {
         return 0;
      if (mm < 0 || mm > 59) {
         return 0;
      if (s < 0 || s > 59) {
         return 0;
      return CachedCalendar.getTime().getTime();
   } catch (Exception e) {
      return 0;


Here are the results using the NetBeans profiler.

The CPU time decreased from 1.13s to 0.288s, which is roughly a ~75% reduction.

CPU time: on getTimestamp1 vs. getTimestamp2

CPU time: getTimestamp1 vs. getTimestamp2

The total object allocations decreased by ~3kBytes (per call).

Final Thoughts

Although simple, the performance improvement here was more significant than any other fancy optimization I’ve done on the application.

getTimestamp2 can be 10-15% faster if you replace Integer.parseInt with another solution.

SimpleDateTime is slow. If you need the performance, it may be worthwhile to roll your own solution.

As always, benchmarking Java is hard. These performance numbers should only be used in relative terms.

Source & Tools

The source code can be found here.

JDK7u15 64 bit, Win7 64bit, NetBeans 7.3.

Protobuf over JNI

On a recent project, I have been working with JNI to wrap up a C++ library into Java. Like most usages of JNI, it is not for performance, but for compatibility.

JNI itself is a beast. It is quite a challenge to get both the performance and the correctness aspects of its APIs right. While its programming style is close to C, exceptions needs to be checked frequently. Its API naming schemes are full of misleading traps that often lead to memory leaks.

It is so complicated that if you want to pass data through JNI, you want to stick with primitive types. That’s because passing complex data structure into JNI is rather painful.

Unfortunately, my project requires passing complex data structures from Java into C++. To solve this problem, I turned to Protobuf for help.

Pojo Over JNI

To get a taste of basic JNI, here’s an example. Say I want to pass the following data structure from Java into C++ through JNI.

// A POJO that we pass from Java into JNI layer.
public class CustomPojo {
   private String s;
   private int i;
   private double d;
   private boolean b;
   private String[] sa;
   private int[] ia;

   public String getS() { return s; }
   public void setS(String s) {this.s = s;}
   public int getI() {return i;}
   public void setI(int i) {this.i = i;}


Since the members within CustomPojo are private, it requires the native side to reach back through individual method calls. Here’s an example of how it would look in C++.

// A set of cached class and method IDs used for optimization.
   jclass customPojoClass;
   jmethodID midGetI;
   jmethodID midGetD;
   jmethodID midIsB;
   jmethodID midGetS;
// Cache all the necessary fields for the first time.
void init(JNIEnv *env, jobject customPojoObject)
   customPojoClass = env->GetObjectClass(customPojoObject);
   midGetI = env->GetMethodID(customPojoClass, "getI", "()I");
   midGetD = env->GetMethodID(customPojoClass, "getD", "()D");
   midIsB = env->GetMethodID(customPojoClass, "isB", "()Z");
   midGetS = env->GetMethodID(
     customPojoClass, "getS", "()Ljava/lang/String;");
   // This is not productive level code. In practice, rigorous error checking
   // is needed here per JNI best practice.

JNIEXPORT void JNICALL Java_com_askldjd_blog_HelloWorld_passCustomArgumentIn(
  JNIEnv *env,
  jobject obj,
  jobject customPojoObject)
  if(customPojoClass == NULL) { init(env, customPojoObject); }

  int i =  env->CallIntMethod(customPojoObject, midGetI);
  double d =  env->CallDoubleMethod(customPojoObject, midGetD);
  bool b =  (bool)env->CallBooleanMethod(customPojoObject, midIsB);
  jstring stringObj = (jstring)env->CallObjectMethod(
    customPojoObject, midGetS);
  char const *nativeString = env->GetStringUTFChars(stringObj, 0);
  env->ReleaseStringUTFChars(stringObj, nativeString);
  // This is not productive level code. In practice, rigorous error checking
  // is needed here per JNI best practice.

To follow the best performance practice, JNI requires caching the class and method IDs. And to access each fields, we need a seperate JNI call to invoke the individual accessors.

As this point, it should be clear that passing complex data through JNI is non-trivial.

Protobuf over JNI

Alternatively, we can use Protobuf as the JNI messaging medium between Java and C++. This way, the communication channel through JNI is strictly through byte arrays. In this approach, the JNI complexity is reduced to a simple byte array access. Therefore the code verbosity and the potential for programming error is drastically reduced,

Here is the same example as above, but with Protobuf over JNI. First, we redefine CustomPojo into a Protobuf message.

option java_outer_classname = "CustomProtoPojo";
package com.askldjd.blog;
message PojoMessage
   required string s = 1;
   required int32 i = 2;
   required double d = 3;
   required bool b = 4;
   repeated string sa = 5;
   repeated int32 ia = 6;

Now instead of passing a complicated data structure through the JNI interface, we can encode the Protobuf message into a byte array through JNI, and decode it in C++.

// Interface definition for javah
public class JniPojo{
   static { System.loadLibrary("jni");   }
   public native void passProtoArgumentIn(byte[] byteArray);
// Pass a protobuf encoded byte array into JNI layer.
private static void testProtoJNI() {
   JniPojo jniPojo= new JniPojo();
   Builder pb = CustomProtoPojo.PojoMessage.newBuilder()
      .setS("Secret POJO message").setI(i).setD(42.5566)
   byte[] ba = pb.build().toByteArray();

// Take in a protobuf encoded byte array and decode it into data structure.
JNIEXPORT void JNICALL Java_com_askldjd_blog_HelloWorld_passProtoArgumentIn(
   JNIEnv *env,
   jobject obj,
   jbyteArray buffer)
   using namespace com::askldjd::blog;
   PojoMessage pm;
   jbyte *bufferElems = env->GetByteArrayElements(buffer, 0);
   int len = env->GetArrayLength(buffer);
   try {
      // handle message here
   } catch (...) {}
   env->ReleaseByteArrayElements(buffer, bufferElems, JNI_ABORT);
  // This is not productive level code. In practice, rigorous error checking
  // is needed here per JNI best practice.


Conceptually, protobuf over JNI should be more expensive. After all, protobuf is first encoded in Java, deep copied in JNI, and then decoded in C++. In practice however, the performance of the protobuf-JNI approach is 7% faster than the POJO-JNI approach.


Pojo-JNI vs. Protobuf-JNI over 10 million JNI calls.


Protobuf is a good medium to pass complex data structure through JNI. Compare to the handcrafted reach-back JNI code, protobuf over JNI has far lower code complexity while having equal or better performance.

This approach is great for passing low volume traffic of complex data over JNI. For high volume traffic, it is best to avoid complex data altogether and stay within primitive types.

This only covers synchronous JNI call. Asynchronous JNI callback is a topic (nightmare) for another day.


Source files for the benchmark can be downloaded here.

Tools: Win7 64bit, Java7u10, VS2008, protobuf 2.4.1

Windows Wildcard Path Expansion

For the past few days, I’ve been working on a small Protobuf Compiler plugin. During the course of development, I was stuck on a trivial yet annoying problem – path expansion.

This trivial problem led me to a wild goose chase. From branch diff’ing to debugging, it took me several days to figure this out.

Long Story Short

Here’s a small program called wildcard_input.


int main(int argc, char **argv)
	for(int i=0; i<argc; ++i)
		std::cout << argv[i] << std::endl;
	return 0;

Now I invoke the following command.

wildcard_input *

This is the output from Win7 and Ubuntu 12.4.

Running from Window 7


Running from Ubuntu 12.04

Apparently path expansion is performed through the shell by default under Linux, whereas it is left to the program to handle under Windows. I prefer the Linux behavior, but others might disagree.

Windows Path Expansion

To get the automatic path expansion behavior in Windows, you need to add a linker options/link setargv.obj in Visual Studio.


With this option, wildcard paths are now expanded properly.


Same program, compiled with the new linker option /link setargv.obj



  1. Filed misleading bug to the Protobuf team.
  2. Diff’ed 4 different branches for code delta
  3. Hours of proto compiler debugging and Stackoverflow browsing.

… Alan

IOCP Server 1.2 Released

It has almost been 10 months since I published the IOCP server library. This little library has been very useful to me over several C++ projects. It offers impressive scalability on multi-core processors and a simple interface for TCP graceful shutdown (unlike boost ASIO).

I recently found a crash bug where an utility function was copying a non-NULL terminated IP address string. This bug is specific to Unicode build, and has no affect on ANSI build.

Also, I ran CPPCheck against the library. I am proud to say that CPPCheck has reported only one style warning, and this has also been addressed.

Anyway, you may get the latest version of IOCP Server here. Enjoy!


Plotting GetThreadTimes

As a side project, I have been developing an application to monitor the performance of our main product (hence the lack of update). In my experience, our large C++ programs have not worked well under most profilers (either too slow, or too resource intensive). The PGO technique I posted last year works well within the scope of a library, but does not provide performance data in a system wide perspective. ProcExp from SysInternal provides good performance insights with threads CPU cycles and context switches, but the results are not recordable. So I started by duplicating ProcExp’s thread performance output through GetThreadTimes.

Plotting The Data

GetThreadTimes is a very powerful function that provides the cumulative CPU and kernel time consumed by a particular thread. The resolution is known to be coarse, and has a tendency to overlook very short execution (e.g. thread that couldn’t fully consume its quantum). A post online noticed the granularity of measurement, and suggested a 1 second sampling time to get sufficiently accurate result.

Below is a timing report from GetThreadTimes sampling once a second. The target application is Media Player Classic playing a 720p H.264 video.

GetThreadTimes sampled at one second interval.

Sadly, the plot is almost completely indecipherable. There are large fluctuations from reading to reading, and thread times collected are occasionally zero.

So I spent a lot of time optimizing the data. After many trials and errors, I finally decided to smooth on the data points with a 15 second rolling average. The resulting plot looks far more consumable.

Same data as before, but with 15 second rolling average.

Final Thoughts

After digging into Windows Internal, I still couldn’t grasp the output of GetThreadTimes. Without averaging, the data is extremely difficult to consume.

By smoothing out the data with a rolling average, the data plot became very practical. In fact, this method has already uncovered a performance bug during overnight runs.

STL and erase()

The inconsistency of erase() method in STL containers has always bothered me.

Say if you would want to loop through a std::list, print its items, and erase the item from the list, you could do the following.

std::list<int> l(6); // add in 6 zeros
std::list<int>::iterator itr = l.begin();
while(l.end() != itr)
	std::cout << *itr << std::endl;
	itr = l.erase(itr);

Pretty straight forward.

The erase() method will remove the item pointed by the current iterator, and move the iterator to the next element.

Now what if you want to do the same to a std::set?

std::set<int> s; // add in 6 zeros
//do some initialization here
std::set<int>::iterator s_itr = s.begin();
while(s.end() != s_itr)
	std::cout << *s_itr << std::endl;
	s_itr = s.erase(s_itr); // COMPILER ERROR!

Accck! This won’t compile under gcc because erase() method in an associative container returns void. To work around the problem, you need to use the post-increment operator of the iterator within the erase() method.

std::set<int> s; // add in 6 zeros
//do some initialization here
std::set<int>::iterator s_itr = s.begin();
while(s.end() != s_itr)
	std::cout << *s_itr << std::endl;
	s.erase(s_itr++); // use post-increment

In C++, even erasing an item from a STL container is subtle and error-proning.

The Standard

To make matter worse, Visual Studio’s erase() method implementation violates the standard, and is consistent across all containers. This confuses a lot of people.

As far as I can see, this inconsistency has been deemed as a defect by LGW since 1999. Hell, even Josuttis mentioned a few dissatisfactions in his infamous STL Bible.

In the current C++0x standard, it looks like this issue will finally be put to rest. So… just a few more years of pain before the next compiler upgrade.

Shared Bathroom Problem

I came across an interesting synchronization problem called the shared bathroom problem. The shared resource is the bathroom. It needs to be shared among males and females. However, the bathroom resource sharing protocol must be implemented in the following manner:

  1. Mutual exclusion: persons of opposite sex may not occupy the bathroom simultaneously,
  2. Fairness: everyone must wait in line in a civil manner without cutting in line.

Rule #1 lays out similar characteristics compare to the reader/writer problem. But the rules here are slightly relaxed, where there could be multiple male or female occupying the resource at once.

To ensure starvation-freedom, rule #2 enforces fairness such that any person will eventually enter the bathroom.

A Jab at the Problem

I have just started (re)learning Java professionally, so I might as well write it in Java.

Mapping the problem to implementation is fairly straight forward. There are two classes of threads – MaleThread and FemaleThread. Both types of threads will try to enter a resource (class) called Bathroom. Bathroom is consisted of two API – enter(Person p) and leave(Person p), where Person is just a data class that indicates holds the gender and thread information.

So an example of a male thread entering the bathroom could be the following:

Person male = new Person(Thread.currentThread(), Gender.MALE);
// do some private business

From rule #1, the Bathroom class has the characteristics of a mutex, where pending threads need to be blocked and unblocked efficiently. For this, I chose LockSupport for thread synchronization. For the fairness requirement in rule #2, ConcurrentLinkedQueue is used for the FIFO behavior.

class Bathroom
  private AtomicInteger numMale = new AtomicInteger(0);
  private AtomicInteger numFemale = new AtomicInteger(0);
  private Queue waiters = new ConcurrentLinkedQueue();

  public void enter(Person person)
    boolean wasInterrupted = false;


    // Block while not first in line, or if there is opposite sex in the
    // bathroom.
    while (waiters.peek() != person
        || getOppositeGenderCount(person.getGender()).get() > 0)

      // ignore interrupts while waiting
      if (Thread.interrupted())
        wasInterrupted = true;

    // Increment the current sex count since this person is first in line
    // and there is no opposite sex in the bathroom.

	// Remove its from the waiting line. This is the linearization point.

	// wake up the next waiter
    Person nextPersonInLine = waiters.peek();
    if (nextPersonInLine != null)

    // reassert interrupt status on exit
    if (wasInterrupted)

  public void leave(Person person)
	// decrement the gender count, and wake up the next waiter

    Person nextPersonInLine = waiters.peek();
    if (nextPersonInLine != null)

  private AtomicInteger getOppositeGenderCount(Gender gender)
    if (gender == Gender.MALE)
      return numFemale;
    else if (gender == Gender.FEMALE)
      return numMale;

    // something is wrong here.
    return numMale;

  private AtomicInteger getGenderCount(Gender gender)
    if (gender == Gender.MALE)
      return numMale;
    else if (gender == Gender.FEMALE)
      return numFemale;
    // something is wrong here.
    return numMale;
public class Person
  public Person(Thread ownerThread, Gender sex)
    this.ownerThread = ownerThread;
    this.sex = sex;
  private final Thread ownerThread;
  private final Gender sex;

  public Thread getOwnerThread()
    return ownerThread;

  public Gender getGender()
    return sex;

public enum Gender

Some Explanation

The bathroom itself is lock-free. From the high level, every person who tries to enter a bathroom goes through a “test and set” process for two conditions – 1. he/she is the first person in the waiting line, and 2. no opposite sex is in the bathroom. As soon as these two conditions are valid, the person will enter the bathroom by incrementing his gender count, and leave the waiting line. This is the linearization point of the enter() method.

The leave() method simply decrements the gender count and notify the next in line to wake up and re-test the conditions to enter the bathroom.

For suspending and waking up threads, LockSupport’s park() and unpark() method were used. The park() method is simple, where threads are suspended until a “permit” is available through the call to unpark(). However, the unpark() function has a subtle yet beautiful post-condition. It guarantees the following:

Make available the permit for the given thread, if it was not already available. If the thread was blocked on park() then it will unblock. Otherwise, its next call to park() is guaranteed not to block.

Because unpark() guarantees unblocking the next call to park(), the bathroom class does not suffer from lost-wakeup.


To prove the code’s correctness, I was able to identify the linearization point of enter() and leave(). It also passes my unit test. However, I do not claim to have complete confident in the correctness of the algorithm. After all, data race detection is a NP hard problem.

I found two similar blog posts on this problem. This solution is different in the sense that only guarantees starvation-freedom (as oppose to fairness, which is more restrictive). Another solution guarantees fairness (also uses FIFO queue), and is written in a lock-based manner.

Performance was not measured here, as it is only a proof of concept.

The bathroom resource is not recursive (not re-entrant). After all, recursive lock is evil.

The source can be downloaded here. (Written with Java 1.6)

Some thoughts on Java

For the past month, I have been working on a server side web service in Java. Although I have used Java in numerous courses throughout college and grad school, this is the first time I have programmed in it professionally.

Compare to C++, Java is a much cleaner language. Here’s a few things I enjoy so far.

  • Importing and exporting libraries is fairly trivial (no declspec and name mangling issues).
  • A memory model that provides visibility and ordering guarantees (unlike C++ volatile)
  • Safe casting without worrying about memory layout (no memory alignment or padding).
  • Garbage collection allows for slightly sloppier exit sequence. (vs. memory corruption)

In general, difficult things I deal with in C++ are still difficult in Java (such as concurrency). But Java’s cleaner design provides an invisible guiding hands to prevent me from shooting my foot.


Deep copying in Java is an unpleasant experience. Java does not provide a default copy constructor, hence all copy constructor must be hand-crafted. The clone interface counter-intuitively provides a shallow copy, and is avoided by best practices.

Over the years with C++, I have formed a habit of cleanly separating business logic and data. Under such a setup, data is coded in form of POD, the default copy constructor (and = operator) provides deep copying for free. In general, deep copying of POD is a trivial operation, and generally has no-throw guarantee.

Unfortunately, Java does not have a well defined concept of POD. In fact, even the simplest object in Java, java.lang.Object, is non-trivial. Each Object is associated with a monitor object, and deep copying a monitor object under a multi-threaded environment is almost never desirable. Since the internal monitor object is not exposed in the Object interface, you can never truly perform a deep copy of any Java objects.

And without an automated POD copy constructors, the copy-and-swap idiom does not transfer as nicely in Java.

More Thoughts

If the lack of deep copy is the only thing I can complain about Java, it is not a bad experience so far.