Saturday, June 2, 2012

Under the Hood: Creating a class with ASM

Now, let's get serious! We know how ClassLoaders are responsible for getting the binary class data to load Java classes, and we know how that binary class data actually looks. What's missing is a way to effectively modify that data, without worrying too much about the low-level details of data storage in a class file.
And that's where ASM comes into play. ASM is one library (BCEL is another) for bytecode manipulation, and it seems to be the "tool of choice" out there. That means it's relatively active in development, yet mature and usable. Let's dive straight into a code example for creating (not modifying) a class with ASM:

public class AsmBuilder implements Opcodes {
    public static final byte[] HelloWorld = HelloWorld();
  
    private static byte[] HelloWorld() {
        ClassWriter cw = new ClassWriter(0);
      
        cw.visit(V1_6, ACC_PUBLIC | ACC_SUPER, "net/slightlymagic/asm/test/HelloWorld", null, "java/lang/Object",
                null);
        cw.visitSource("<generated>", null);
      
        HelloWorld_init(cw);
        HelloWorld_main(cw);
        HelloWorld_hello(cw);
      
        return cw.toByteArray();
    }
  
    private static void HelloWorld_init(ClassWriter cw) {
        MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "<init>", "()V", null, null);
        mv.visitCode();
        mv.visitVarInsn(ALOAD, 0);
        mv.visitMethodInsn(INVOKESPECIAL, "java/lang/Object", "<init>", "()V");
        mv.visitInsn(RETURN);
        mv.visitMaxs(1, 1);
        mv.visitEnd();
    }
  
    private static void HelloWorld_main(ClassWriter cw) {
        MethodVisitor mv = cw.visitMethod(ACC_PUBLIC | ACC_STATIC, "main", "([Ljava/lang/String;)V", null, null);
        mv.visitCode();
        mv.visitTypeInsn(NEW, "net/slightlymagic/asm/test/HelloWorld");
        mv.visitInsn(DUP);
        mv.visitMethodInsn(INVOKESPECIAL, "net/slightlymagic/asm/test/HelloWorld", "<init>", "()V");
        mv.visitMethodInsn(INVOKEVIRTUAL, "net/slightlymagic/asm/test/HelloWorld", "hello", "()V");
        mv.visitInsn(RETURN);
        mv.visitMaxs(2, 1);
        mv.visitEnd();
    }
  
    private static void HelloWorld_hello(ClassWriter cw) {
        MethodVisitor mv = cw.visitMethod(ACC_PUBLIC, "hello", "()V", null, null);
        mv.visitCode();
        mv.visitFieldInsn(GETSTATIC, "java/lang/System", "out", "Ljava/io/PrintStream;");
        mv.visitLdcInsn("Hello, World");
        mv.visitMethodInsn(INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
        mv.visitInsn(RETURN);
        mv.visitMaxs(2, 1);
        mv.visitEnd();
    }
}

What you see here is a series of visitSomething() calls on a ClassWriter (which implements ClassVisitor) and the MethodVisitors created from it. The Opcodes interface impelemented just gives us easy access to some useful constants. Before elaborating on the Visitor pattern next time, let's go through the calls:

ClassVisitor.visit() gets all data that is necessary once for a class definition; that includes class version, access flags like public, class name, superclass name, signature (for generic classes like List<T>) and an array of implemented interfaces. visitSource() gets one piece of optional information; the source file name, plus a debug string that I know nothing about.

Now to the interesting method instructions: visitMethod() again takes some data that is needed once for each method: access flags, name, method descriptor, signature, and declared exceptions. visitCode() simply indicates that the actual instructions follow, as opposed to the optional annotations.
visitVarInsn(ALOAD, 0) takes an opcode that works with local variables (either LOAD or STORE) of a specific type (a reference in this case; don't ask me why the prefix is A) and takes the index of a local variable; as I said earlier, 0 is the implicit this reference for nonstatic methods. So, in plain words, this instruction pushes this onto the stack.
visitMethodInsn(INVOKESPECIAL, "java/lang/Object", "<init>", "()V") invokes a method. In this case, because it's a constructor (see the name), using the INVOKESPECIAL opcode. There are four other INVOKE opcodes, but we'll only see two others in this example. The next argument is the owner, i.e. the class containing the called method, followed by name and descriptor. If you haven't guessed it, this is a super() call, invoked on this. (Ironically, writing this.super(); in Java is illegal, yet the implicit reference is needed of course)
Last but not least, there's visitInsn(RETURN). Unlike visit*Insn(), visitInsn() takes no additional arguments, because the opcodes supported by this instruction simply don't need any. RETURN is only one of many RETURN opcodes for void methods (including constructors).
Finally, there's visitMaxs(1, 1) and visitEnd(). The maxs are stack and local. stack is 1, because there is at most one entry on the stack, namely this after ALOAD 0. And the sole local variable is, again, this, ammounting for 1 again.

A look at main([Ljava/lang/String;)V: We have a new type of instruction again! visitTypeInsn(NEW, "net/slightlymagic/asm/test/HelloWorld") takes an opcode that needs a class name. In this case, we allocate a new object (of our class) on the heap.
visitInsn(DUP) duplicates the top element on the stack, i.e. the reference.
visitMethodInsn(INVOKESPECIAL, "net/slightlymagic/asm/test/HelloWorld", "<init>", "()V") invokes the constructor on our new object. Think about it: the constructor has an implicit argument that must be on the stack prior to calling, and in the case of additional parameters, these have to come even after the new reference, so there's essentially no way to integrate the constructor call into the NEW opcode. (Well, of course you could, but why complicate the VM by introducing a different order of arguments for constructor calls if you can make it a really regular INVOKESPECIAL call simply by splitting object creating into two opcodes?) And here, you have the reason for the previous DUP: <init> is void, but we want to use the object afterwards, so we have to keep the reference on the stack some other way.
visitMethodInsn(INVOKEVIRTUAL, "net/slightlymagic/asm/test/HelloWorld", "hello", "()V") invokes hello(). INVOKEVIRTUAL is probably the most common INVOKE opcode. It's used for all nonstatic, non-constructor, non-interface methods (and now, if you wonder what the fifth variant, as hinted above, is: it's INVOKEDYNAMIC, which is a new feature targeted at dynamic JVM languages, not Java itself). The big point about virtual method invocation is overriding. The VM searches for an overridden variant of the called method, which is not necessarily that of the owner class given in the instruction.
visitInsn(RETURN); visitMaxs(2, 1); visitEnd(); there were two entries on the stack after DUP, and there is one parameter [Ljava/lang/String; in the main method.

Now hello()V: visitFieldInsn(GETSTATIC, "java/lang/System", "out", "Ljava/io/PrintStream;"), yet another type of instruction. visitFieldInsn() supports the four logically named opcodes GETFIELD, GETSTATIC, PUTFIELD and PUTSTATIC. The nonstatic variants take a reference from the stack for obvious reasons. Then there's an owner, a field name, and a field descriptor. Unlike LOAD and STORE, where there's no descriptor, but different opcodes for different types.
visitLdcInsn("Hello, World"), where LDC stands for "load constant". The argument is of type Object, but must be a primitive wrapper like Integer, or String.
visitMethodInsn(INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V") is our first method invocation with real arguments! I hope you can follow what's currently on the stack; there's a PrintStream (System.out) and a String ("Hello, World"), and now we're invoking a nonstatic method of PrintStream, taking a String as an argument. Knowing the order of elements on the stack is important when trying to modify bytecode; obviously, when you insert instructions, you have to leave the stack as you found it to have the following code work correctly (and don't forget to tamper with maxStack afterwards!)
visitInsn(RETURN); visitMaxs(2, 1); visitEnd(); there were two entries on the stack after LDC, and the method is not static.

Now, to make this post longer, and not make you wait on how to invoking your new class, let's look at the classloading part. we have byte[] containing the class and need a Class from it. Take a look at this simple custom ClassLoader:

public class DynamicClassLoader extends ClassLoader {
    private final HashMap<String, byte[]> bytecodes = new HashMap<String, byte[]>();
   
    public DynamicClassLoader() {
        super();
    }
   
    public DynamicClassLoader(ClassLoader parent) {
        super(parent);
    }
   
    public void putClass(String name, byte[] bytecode) {
        bytecodes.put(name, bytecode);
    }
   
    @Override
    protected Class<?> findClass(String name) throws ClassNotFoundException {
        //remove here, because the class won't be loaded another time anyway
        byte[] bytecode = bytecodes.remove(name);
        if(bytecode != null) {
            return defineClass(name, bytecode, 0, bytecode.length);
        } else {
            return super.findClass(name);
        }
    }
}


... and take a look at that simple main method:

public static void main(String[] args) throws Exception {
    DynamicClassLoader l = new DynamicClassLoader();
    l.putClass("net.slightlymagic.asm.test.HelloWorld", AsmBuilder.HelloWorld);
   
    Class<?> HelloWorld = l.loadClass("net.slightlymagic.asm.test.HelloWorld");
    HelloWorld.getMethod("main", String[].class).invoke(null, (Object) args);
}


Of course, we can't plainly reference a class that isn't present at compile time, but we can use reflection to test it. Of course, normally the generated class would implement some interface so we can store it in a variable other than an Object, and make regular method calls through the interface.

No comments: