Wednesday, July 31, 2013


Check it out. Do. The "A successful Git branching model" post has, as far as I can tell, gone viral. I stumbled upon it some years ago, when git was almost new to me. I felt that it was a good idea to have some structure in your branching. Intriguing was the idea that, whenever you check out a repository, the head of the master branch is something stable, something you could immediately build and use. You don't have to search the history for an earlier commit where the code did work; if it's on the master branch, it works.

And I wasn't the only one intrigued by the branching model. Someone wrote a set of additional git commands that allows to use the high-level "git-flow" concepts such as "start feature" and "finish release" instead of initial branch and merge commands.

Despite the benefits, I wasn't really comfortable with git on the command line. I'm still not. I feel that for efficient code versioning, a concise representation of changes is essential. I might be hacking along, not sure whether the code is worth committing, before finally saying, "yes". Then, I start up my git tool of choice, SmartGit, and review what changes I actually made, figure out what that means, stage individual changes that are logically related, and commit them with relevant commit messages. My personal feeling is that I'm more productive with a graphical tool.

But recently, SmartGit started to support git-flow out of the box, and I'm happy to be able to finally use git-flow comfortably. Really, go ahead and read the original post and try git-flow or SmartGit. You'll like it, and I'll like it if I check out your repository and have runnable code right in front of me.

I'm back(?) and polybuf

I hope I am!

I was just thinking I was in the mood to write something for my blog. And I have things to talk about. I have some new code online at github; Polybuf and Harmonic might be intersting to you.

Mind you that I don't have any Laterna Magica code online there. I'm not sure about LM's future, besides that it will definitely have one! Looking at the history, it seems it was over two years ago when I added the last bit of real functionality. Somewhere in that time frame, I realized that the whole architecture of LM wasn't able to address all the stuff I wanted to get out of it - and that stifled the fun I had with the project.

But LM was always in the back of my head. I tried to figure out what I wanted, what I needed to do so, and what was fun for me to make at that moment. And here I am, writing a new blog entry, because it's fun to me. I hope it will stay fun to me, so I hope I'm back!

Now to make this post not entirely meta, let's talk about protobuf and polybuf.

Protobuf is a protocol library which is apparently widely used inside Google. You specify a protocol using .proto files, and protobuf creates code to read and write that protocol. It's basically a replacement for serialization, and provides code generators for different target languages, so that programs written in different languages can exchange data.

The protobuf code generator supports options that let you control how the code is generated; without losing protocol compatibility, you can code that is more or less optimized for memory footprint, code size, parsing time. Protobuf is only one serialization replacement library of many, and depending on how you run your benchmarks, you probably get different results as for which is the fastest. Anyway, protobuf definitely has the features I need, whether or not it's the "best", and it's supposed to have good documentation. I can't complain so far.

One thing I didn't like about serialization from the beginning is that you throw the Serializable interface right into your code. Exporting and importing state is a very specialized aspect of a software, and it's likely not directly related to the core classes of your software. The thought behind serialization is that adding the implements Serializable clause immediately enables you to serialize your objects - but that's only true if your object really correspond to whatever you would want to transmit over the network or store on this. Any discrepancies mean that you have to work around this central concept of serialization, often making the code further less readable, and further clogging your classes with logic that does not correspond to their original concerns.

And after you're finished with all this, you have a class that has logic for one specific serialization mechanism. If you want to switch that mechanism out, you have to plumb directly in your application logic classes, instead of replacing some auxiliary import/export classes.

And there comes the fact that Serialization semantics are... funny to begin with. Deserialization does not invoke a constructor of the class to deserialize. It creates an empty object without using constructors and then populates the fields manually from the stream. Deserialization can even modify final fields. But, after that, you yourself can't, even using the mechanisms that deserialization provides.

Protobuf (and its likes) kind of work around these problems. You're not serializing your logic objects, but dedicated messages that have well defined serialization properties. On the receiving end, you can restore your object using whatever constructors and methods you seem fit. That seems like more work, but only in the short run; your benefits are familiar semantics (constructors and methods), separation of concerns, easier reaction to changes (through the separation, and the fact that protobuf was created with the explicit goal of compatibility between protocol versions), and of course improved performance through dedicated code instead of reflection.

One thing, by the way, that protobuf does not support, is polymorphism. The reason is simple: there are other patterns in protobuf that enable the same functionality, and polymorphism is a feature handled differently in different languages (Java has no multiple inheritance, and C obviously has no inheritance at all). Making Polymorphism a feature on that level limits the interoperability of protocols.

And that's where polybuf comes into play, and where I'll finally show some example code. Polybuf is essentially a convention on how to format messages that contain polymorphic content, and a collection of classes that facilitate deserializing these messages. At the core is the polybuf.proto file, showing the relevant parts:

message Obj {
  optional uint32 type_id = 1 [default = 0];
  optional uint32 id = 2 [default = 0];
  extensions 100 to max;

Proto files contain message definitions. The messages declare fields that may be required, optional or repeated. required is actually not recommended, because it makes changing the message slightly harder. This message features only one data type: uint32. Protobuf encodes these as compact as possible; small numbers will take less than four bytes; big numbers may be larger. For values where big values are expected, there are other fixed length encoding types. Every field has a numeric tag that is used instead of encoding the field name. Fields can be removed and added without breaking compatibility, as long as these tags are not reused - at least on the protobuf level. Of course, the application must be able to process messages where some fields are missing; that's what the default values are for. These are not transmitted. The receiver simply fills in the values if the field was not present (very possible for optional fields).

Below these two fields, there is an extension declaration, and now it becomes interesting: 100 to max specifies that all field tags greater than 100 are free to use for extensions. There is an example directly in the file:

//  message Example {
//    extend Obj {
//      optional Example example = 100;
//    }
//    optional string value = 1;
//  }

The Example message declares an extension to Obj, which is simply a set of additional fields, labeled with tags from the extension range. It has nothing to do with and extends clause, and is pretty independent from the Example message; it just happens to be there for logical grouping. The only consequence from the placement is the namespace used by the extension.

The Example message itself declares a field value, plain and simple. Now, where's the benefit? Imagine there's a class Example in your program and you expect it to be subclassed, but want to still support protobuf serialization. This structure allows you to
  • reference an Example by declaring the field as Obj,
  • create an Example by setting type_id to 100 (the extension tag used by Example), and filling the example extension field with an Example message,
  • create an ExampleSub message that uses a different extension tag, and simply fill both the Example and ExampleSub extension fields. From the type_id field, you can see that it's an ExampleSub object, and since you know the ExampleSub class structure, you know that there will be an Example message containing the superclass fields.
  • The same goes for multiple inheritance: If you know that the type_id-referenced class has two superclasses, simply read both superclass extension messages from the Obj.
Of course there's some Java code supporting this message structure to give a more serialization-like experience. Look at it. Input and Output are similar to the streams you're used to; however, they don't convert objects into bytes and vice versa, they convert between objects and Objs. From there, it's only a small step to networking and persistence. Config stores the IO objects that actually translate between objects and the corresponding extension messages. The Config is reused for subsequent Input and Output uses.

The IO interface is the most interesting one. Its three methods specify clearly the semantics that polybuf expects for (de)serialization, making it easy to understand what's going on. It extracts the logic from private methods in the class itself into a place that is actually meant to handle that concern, and gives the programmer full control over what data is (de)serialized, what constructors and methods should be called. And last but not least - these IO classes support inheritance! Deserializing a subclass can be as simple as subclassing the superclass IO, add your own subclass initialization and persistence, and delegate to the superclass IO for fields that belong to it.