Adding some C++ to your coffee - transpling Java to C++

In this little article, I report from my fascinating journery into the innards of JVM, transpliation, and explore how feasible translating Java to C++ automaticaly is

JVM is fascinating

I have recently began to look a little bit more into JVM, and the Java class format. The format is a mix of pure genius and brilliance, such as the way generics are implemented(There is no special sauce, they are just normal classes), confusing decisions(the way class names are stored, seemingly needless indirection), and pure madness left form the 90s (why on earth do long and double take 2 entries in a table that already has variable length elements, such as strings?). Besides a few interesting surprises, and mistakes I could avoid by just reading the documentation and not skipping the very important section with the little "In retrospect, making 8-byte constants take two constant pool entries was a poor choice." under, parsing the class format turned to be almost trivial.

First attempt, Java Bytecode Interpreter

Out of that curiosity for inner workings of JVM, my first project jbi(JVM Bytcode Interpreter, currently private), was born. I got the implementation to a state in which it could be tested(Was able to run simple programs, calculating factorials and such) and benchmarked. Sadly, it did not fare well in the benchmarks, and had far too many structural issues.

A modern tower of babel, or translating Java To C

Because of those issues, I decided to abandon that project, and focus on something else. Still, it felt kinda wasteful to just let it sit there, and not have anything come out of the time I had already spent working on it. So I deiced to rip out the most broken parts(ones related to the Virtual Machine implementation), and re-purpose the java class importer implementation. At this time the project was called jtc, and generated C code, instead of C++ generated by jtcpp. I naively thought generating C instad of C++ couldn't be so much harder. I was, however, very wrong.

jtc used very, very, hacky C macros to emulate inheretence, and asign slots to virtual functions. Expectantly, the implementation was so brittle it collapsed if you looked at it wrong. Macros were used to select a slot in the vtable. #ifdef V_FUNCTION_NAME detected if the virtual function was overriding a function of the parent class, or defining a new one. But this could be easily messed up by another, unrelated class having a function with the same name. If such class was included at any point, it could make the macros assume a virtual function that wasn't inherited from the parent class, was inherited, and mess up the vtable compleatly. Needles to say, if the implementation of inhertence is broken simply by other classes being present, it is probably not a very good idea to keep it around. So I decided to stop trying to patch the shaky tower of incomprehensible code taped together with magic-macros, and just generate C++ like God intended.

Java To C++

Converting JVM bytecode C++ was far easier to do than the other way around. This also made me aware of a couple other issues that I had never thought to consider. The generated C++ code looks nothing like C++ a human would write, but it at least works.

Header files

This is how a header file generated by jtcpp looks like

    #pragma once
    #include "java_cs_lang_cs_Object.hpp"
    struct Vector3: public  java::lang::Object
    {
    virtual ~Vector3() = default;
    	float x;
    	float y;
    	float z;
    	static void _init___V(ManagedPointer<Vector3>);
    	static void _init__FFF_V(ManagedPointer<Vector3>,float,float,float);
    	static float Distance_Vector3_Vector3__F(ManagedPointer<Vector3>,ManagedPointer<Vector3>);
    	static ManagedPointer<Vector3> Random__Vector3_();
    	virtual float SqrMag__F();
    	virtual float Magnitude__F();
    	virtual ManagedPointer<Vector3> clone__Vector3_();
    	virtual ManagedPointer<Vector3> Add_Vector3__Vector3_(ManagedPointer<Vector3>);
    	virtual ManagedPointer<Vector3> Sub_Vector3__Vector3_(ManagedPointer<Vector3>);
    	virtual ManagedPointer<Vector3> Normalize__Vector3_();
    	virtual ManagedPointer<Vector3> Mul_F_Vector3_(float);
    	virtual ManagedPointer<java::lang::Object> clone__java_cs_lang_cs_Object_();
    };

You may quickly see some weird things. So, lets go trough them one-by-one.

First of all, the class converted from Java are not classes but structs. Why? It is actually pretty simple. Only real difference between them is that every member of a struct is public by default. The Java compiler already checked to ensure public, protected and private fields, functions and objects are used properly. Because of that, I can just assume that whatever Java byte-code does is OK, and the intended behaviour. Handling accessibility modifiers would just make everything far more completed. The problems that not handling those modifiers could introduce are already solved by the Java compiler for me, so why bother?

If you haven't seen any C++ code before, the public thingy majiggy before java::lang::Object may seem strange. It just makes the public members of the parent class remain public. If it was not there, they would become private, which is not the intended behaviour.

The virtual destructor (virtual ~Vector3() = default;) helps with some modes of the GC(Yes, the GC has a couple different modes!), and is the same as the default destructor(If you haven't guessed this is is exacly what = default part means).

Function names may seem VERY wierd at first. Part of them make sense(like Distance) or (Magnitude), but the rest seem like garbled mess. What is their purpose? They help with function overloading. But does not C++ already support function overloading? Yes, but it differs slightly from the way Java does stuff. And this very, very, slight difference is enough to make it sometimes break. The names are not very pretty, but they are very predictable, since they encode the function signature. By looking at functions like virtual ManagedPointer<Vector3> Mul_F_Vector3_(float); you can pretty easily deduce how they work(although init functions are special snowflakes that like to be diffrent, and break from this convention).

Another curious little thing is the ManagedPointer template. Why didn't I just use normal pointers? In some configurations of the GC, this is exactly what ManagedPointer is: just a normal pointer with a fancy name. But in other GC modes, they can allow us to do some crazy stuff, like clearing objects early, reducing memory consumption by orders of magnitude, and making GC pauses far more rare.

I had almost forgot about the #include "java_cs_lang_cs_Object.hpp" part. Why this strange file name? Original, java class Vector3 derives from is java/lang/Object. I could try to just use that as the path, and put each namespace in a separate directory, but it would force me to do far messier relative includes, so I opted to temporally stick with mangled class names. To get a mangled class name, I simply replace each / with _cs_(short for Class Separator). It is not the prettiest solution, but it works very relay ably.

Source files

This is how a source file generated by jtcpp looks like.

    #include "Vector3.hpp"
    ManagedPointer<Vector3> Vector3::Add_Vector3__Vector3_(ManagedPointer<Vector3> l1a){
    	ManagedPointer<Vector3> l0a = managed_from_this(Vector3);
    	bb0:
    	{
    		float i0 = l0a->x;
    		float i1 = l1a->x;
    		float i2 = i0+i1;
    		l0a->x = i2;
    		float i3 = l0a->y;
    		float i4 = l1a->y;
    		float i5 = i3+i4;
    		l0a->y = i5;
    		float i6 = l0a->z;
    		float i7 = l1a->z;
    		float i8 = i6+i7;
    		l0a->z = i8;
    		return l0a;
    	}
    }

Good Lord! What is Happening in There?! The generated C++ code may seem like some sort of arcane magic, but it has a very, very good reason to look like that. To clear everything up, lets run jtcpp with debug info enabled!

    float Vector3::SqrMag__F(){
    	ManagedPointer<Vector3> l0a = managed_from_this(Vector3);
    	float l1f;
    	bb0:
    	{
    		//ALoad(0)
    		//FGetField(ClassInfo { cpp_class: "Vector3" }, "x")
    		float i0 = l0a->x;
    		//ALoad(0)
    		//FGetField(ClassInfo { cpp_class: "Vector3" }, "x")
    		float i1 = l0a->x;
    		//FMul
    		float i2 = i0*i1;
    		//ALoad(0)
    		//FGetField(ClassInfo { cpp_class: "Vector3" }, "y")
    		float i3 = l0a->y;
    		//ALoad(0)
    		//FGetField(ClassInfo { cpp_class: "Vector3" }, "y")
    		float i4 = l0a->y;
    		//FMul
    		float i5 = i3*i4;
    		//FAdd
    		float i6 = i2+i5;
    		//ALoad(0)
    		//FGetField(ClassInfo { cpp_class: "Vector3" }, "z")
    		float i7 = l0a->z;
    		//ALoad(0)
    		//FGetField(ClassInfo { cpp_class: "Vector3" }, "z")
    		float i8 = l0a->z;
    		//FMul
    		float i9 = i7*i8;
    		//FAdd
    		float i10 = i6+i9;
    		//FStore(1)
    		l1f = i10;
    		//FLoad(1)
    		//FReturn
    		return l1f;
    	}
    }

That looks even scarier! But if you look at the comments, it explains nearly everything. jtcpp simply translated each java opcode individually. This makes the resulting code less readable to humans, but it guarantees it to be very accurate! There are still a couple funky things in there I would like to explain.

managed_from_this - this code snippet seems really weird. What does it actually do? Depending on GC settings, either nothing, or obtains some additionally info about the object. In on default GC settings, ManagedPointer behaves similarly to a rust Arc, with one small exception. It can additionally use Bohem GC to delete objects which are cyclicaly referencing each other.

bb0: - jtcpp uses gotos to emulate Java instructions. bb0 simply means basic block 0. Separating operations into those basic, isolated blocks is something I do to prevent numerous errors. One such error is skipping an initialisation of a variable. This is not premited in C++, because then a garbage value would be dropped. This could lead to you trying to free some random memory address, which is generally frowned upon.

iwhatever - intermediate variable number whatever. Exist because JVM is stack based, and this represents such value on the stack. In the Rust codegen I use a stack of variable names and types. This makes producing accurate code very easy.

lwhateversometing - function local variable number whatever, with type kind postfix someting. Type kind is not the specific type of something, but a more general idea. So integer gets a postifx i, float f, long l, and so on, with arrays and objects all having postifx a This is related to way those values are loaded/stored using JVMs opcodes, see FLoad(number) and ALoad(number)

A very, very quick word on GC

I could talk on this, or any other project of mine, for hours on end, but I am quickly approaching both the bottom of my last coup of tea, and the character limit. So, to all the 3 people still reading, hang on there. We are going to quickly come to the end as fast as possible.

here is the quick summary of the GC model of jtcpp

It can be changed in config.hpp in the target directory.
Includes 4 modes: No GC(The I bought 32 GB of RAM, I am going to use 32GB of RAM! mode), Arc-like GC(potentaily leaky, fastest, lowest memory footprint if used correctly), pure BohemGC(Good enough for most cases), Mixed-mode GC(The default. Combnation of the Arc like GC and BohemGC, should be the best of both worlds. Mixing Arcs and Bohem is not tested well enough, and could potentially lead to crashes if I understood something very wrong).

Performance

Performance varies from 2.5-3x times original Java in compute heavy tasks, to 30x slower in GC-heavy tasks. Why such performance loss? jtcpp evaluates everything on an individual level. Classes,methods,ops are all analysed one by one, so jtcpp is not capable of some kind of reasoning about lifetimes of local variables passed as arguments to other functions. It does not know that a Vector3 is used temporarily, and could be, for example, allocated on stack instead of on the heap. Java knows that and uses that info to the full potential.