Wednesday, July 31, 2013

I'm back(?) and polybuf

I hope I am!

I was just thinking I was in the mood to write something for my blog. And I have things to talk about. I have some new code online at github; Polybuf and Harmonic might be intersting to you.

Mind you that I don't have any Laterna Magica code online there. I'm not sure about LM's future, besides that it will definitely have one! Looking at the history, it seems it was over two years ago when I added the last bit of real functionality. Somewhere in that time frame, I realized that the whole architecture of LM wasn't able to address all the stuff I wanted to get out of it - and that stifled the fun I had with the project.

But LM was always in the back of my head. I tried to figure out what I wanted, what I needed to do so, and what was fun for me to make at that moment. And here I am, writing a new blog entry, because it's fun to me. I hope it will stay fun to me, so I hope I'm back!


Now to make this post not entirely meta, let's talk about protobuf and polybuf.

Protobuf is a protocol library which is apparently widely used inside Google. You specify a protocol using .proto files, and protobuf creates code to read and write that protocol. It's basically a replacement for serialization, and provides code generators for different target languages, so that programs written in different languages can exchange data.

The protobuf code generator supports options that let you control how the code is generated; without losing protocol compatibility, you can code that is more or less optimized for memory footprint, code size, parsing time. Protobuf is only one serialization replacement library of many, and depending on how you run your benchmarks, you probably get different results as for which is the fastest. Anyway, protobuf definitely has the features I need, whether or not it's the "best", and it's supposed to have good documentation. I can't complain so far.


One thing I didn't like about serialization from the beginning is that you throw the Serializable interface right into your code. Exporting and importing state is a very specialized aspect of a software, and it's likely not directly related to the core classes of your software. The thought behind serialization is that adding the implements Serializable clause immediately enables you to serialize your objects - but that's only true if your object really correspond to whatever you would want to transmit over the network or store on this. Any discrepancies mean that you have to work around this central concept of serialization, often making the code further less readable, and further clogging your classes with logic that does not correspond to their original concerns.

And after you're finished with all this, you have a class that has logic for one specific serialization mechanism. If you want to switch that mechanism out, you have to plumb directly in your application logic classes, instead of replacing some auxiliary import/export classes.

And there comes the fact that Serialization semantics are... funny to begin with. Deserialization does not invoke a constructor of the class to deserialize. It creates an empty object without using constructors and then populates the fields manually from the stream. Deserialization can even modify final fields. But, after that, you yourself can't, even using the mechanisms that deserialization provides.


Protobuf (and its likes) kind of work around these problems. You're not serializing your logic objects, but dedicated messages that have well defined serialization properties. On the receiving end, you can restore your object using whatever constructors and methods you seem fit. That seems like more work, but only in the short run; your benefits are familiar semantics (constructors and methods), separation of concerns, easier reaction to changes (through the separation, and the fact that protobuf was created with the explicit goal of compatibility between protocol versions), and of course improved performance through dedicated code instead of reflection.

One thing, by the way, that protobuf does not support, is polymorphism. The reason is simple: there are other patterns in protobuf that enable the same functionality, and polymorphism is a feature handled differently in different languages (Java has no multiple inheritance, and C obviously has no inheritance at all). Making Polymorphism a feature on that level limits the interoperability of protocols.


And that's where polybuf comes into play, and where I'll finally show some example code. Polybuf is essentially a convention on how to format messages that contain polymorphic content, and a collection of classes that facilitate deserializing these messages. At the core is the polybuf.proto file, showing the relevant parts:

message Obj {
  optional uint32 type_id = 1 [default = 0];
  optional uint32 id = 2 [default = 0];
  
  extensions 100 to max;
}

Proto files contain message definitions. The messages declare fields that may be required, optional or repeated. required is actually not recommended, because it makes changing the message slightly harder. This message features only one data type: uint32. Protobuf encodes these as compact as possible; small numbers will take less than four bytes; big numbers may be larger. For values where big values are expected, there are other fixed length encoding types. Every field has a numeric tag that is used instead of encoding the field name. Fields can be removed and added without breaking compatibility, as long as these tags are not reused - at least on the protobuf level. Of course, the application must be able to process messages where some fields are missing; that's what the default values are for. These are not transmitted. The receiver simply fills in the values if the field was not present (very possible for optional fields).

Below these two fields, there is an extension declaration, and now it becomes interesting: 100 to max specifies that all field tags greater than 100 are free to use for extensions. There is an example directly in the file:

//  message Example {
//    extend Obj {
//      optional Example example = 100;
//    }
//    
//    optional string value = 1;
//  }

The Example message declares an extension to Obj, which is simply a set of additional fields, labeled with tags from the extension range. It has nothing to do with and extends clause, and is pretty independent from the Example message; it just happens to be there for logical grouping. The only consequence from the placement is the namespace used by the extension.

The Example message itself declares a field value, plain and simple. Now, where's the benefit? Imagine there's a class Example in your program and you expect it to be subclassed, but want to still support protobuf serialization. This structure allows you to
  • reference an Example by declaring the field as Obj,
  • create an Example by setting type_id to 100 (the extension tag used by Example), and filling the example extension field with an Example message,
  • create an ExampleSub message that uses a different extension tag, and simply fill both the Example and ExampleSub extension fields. From the type_id field, you can see that it's an ExampleSub object, and since you know the ExampleSub class structure, you know that there will be an Example message containing the superclass fields.
  • The same goes for multiple inheritance: If you know that the type_id-referenced class has two superclasses, simply read both superclass extension messages from the Obj.
Of course there's some Java code supporting this message structure to give a more serialization-like experience. Look at it. Input and Output are similar to the streams you're used to; however, they don't convert objects into bytes and vice versa, they convert between objects and Objs. From there, it's only a small step to networking and persistence. Config stores the IO objects that actually translate between objects and the corresponding extension messages. The Config is reused for subsequent Input and Output uses.

The IO interface is the most interesting one. Its three methods specify clearly the semantics that polybuf expects for (de)serialization, making it easy to understand what's going on. It extracts the logic from private methods in the class itself into a place that is actually meant to handle that concern, and gives the programmer full control over what data is (de)serialized, what constructors and methods should be called. And last but not least - these IO classes support inheritance! Deserializing a subclass can be as simple as subclassing the superclass IO, add your own subclass initialization and persistence, and delegate to the superclass IO for fields that belong to it.

No comments: