Wednesday, May 30, 2012

Under the Hood: Java Bytecode

So, now that we have cleared up how a Java application is executed, I can talk about bytecode manipulation, a technique to modify the compiled classes before execution.
There are two possible times when modification can occur: at build time, modifying the class filers after compilation, or at runtime, using a class loader that modifies the bytes of the class before passing them on and actually defining the class from it. Both have their advantages: Build time simplifies the structure at runtime; your application runs like an ordinary Java application, except that not all of its classes were ordinarily created from Java source code. Runtime gives you the possibility of applying different modifications in different executions, which could be used to, say provide compatibility to different environments. More importantly in my opinion, this allows to create completely new classes based on user input!

Let me give you an example: You make an application that draws the graph of a function the user types into a text field, so basically you take y = f(x) for each coordinate visible in the plot; that's a lot of times and you want to compute your function very quickly. So, instead of parsing the expression, building a tree from it, inserting a new value for x, and then evaluating the function, wouldn't it be nice to parse the expression into a real Java class that is executed on the tested and optimized JVM, with the possibility that your code is even JIT-compiled into machine code for an extra boost?

And guess what - that's not only possible, but even relatively easy, once you've grasped what java byte code looks like, so that's what were going to look at now. It's pretty straight forward that a class consists of fields and methods, and when you think about the fact that inner classes have separate class files, it becomes clear that there must be some information about enclosing class and inner classes, too.

The code of methods is of course where it gets interesting. The JVM is a stack based machine, which means that values are kept in a last-in-first-out data structure, and every operation pops a specific number of values from the stack and pushes its results onto the stack again. Note that this data is not related to local variables or fields; here, we're basically talking about the steps taken in evaluating an expression.
Also, type names have a special form in class files. Where a class name is needed, a string of the form java/lang/Object is used; where it could be a primitive type, too, this is wrapped into L...;, where L somehow was chosen to mean "object type", and ; denotes the end of the type name. As another example, long, which can't use the L, was given J as the identifier, and boolean uses Z; arrays prepend a [ for every dimension. A method, for example int indexOf(char ch) or String substring(int beginIndex, int endIndex) would be described as (C)I or (II)Ljava/lang/String; - note that the name is not part of the signature but stored separately.

And that's all you really need to know, except for the few less than 256 possible commands (opcodes) that the JVM knows. The details of how all the strings are stored in a classfile is hidden from you when you use ASM to generate bytecode.

Now let's look at some bytecode, as this is where it gets interesting. A simple example first:

package Example;
class Example {
    int a;
    void example(int a) {
        this.a = a;
    }
}

...and the bytecode for example():


example(I)V
ALOAD 0
ILOAD 1
PUTFIELD example/Example.a : I
RETURN


MAXSTACK = 2
MAXLOCALS = 2

ALOAD pushes a local variable onto the stack. In this example, there are - surprise - two! Not only is there parameter a, there's also the implicit this of every nonstatic method. When both are on the stack (explaining the MAXSTACK entry above), PUTFIELD can take the second value and store it into the named field of the first reference on the stack, which better be an instance of Example.

Let's take a more complex example:

package Example;
class Example {
    int a;
    Example(int a) {
        this.a = a;
    }
}

...doesn't seem so? Look at the bytecode:

(I)V
ALOAD 0
INVOKESPECIAL java/lang/Object.<init> ()V
ALOAD 0
ILOAD 1
PUTFIELD example/Example.a : I
RETURN

MAXSTACK = 2
MAXLOCALS = 2

First of all, as you see, constructors are internally void methods with special names. The special code here is ironically the one that we didn't write ourselves: the superconstructor invocation.

Okay, that's it for today! Thanks for reading!

No comments: