CBBrowne Computing

1. About Bytecode

Contrary to the hype surrounding Java and MONO , bytecode compilation is hardly a new thing. It dates back to the days of BCPL and Pascal, perhaps further. Mono development platform

The general idea is that you take code written in some high level language, and rather than compiling it into "native" code for a specific hardware architecture, you compile it into a sort of "virtual assembly language," the instruction set for some sort of generic processor. This has quite a number of merits:

  • The code becomes somewhat more 'opaque', which is good for those that want to distribute proprietary software written in scripting languages like Perl or Python ;

  • Code is parsed by the "bytecompiling" process and is transformed into some form that may be read in quickly without a need for complex parsing;

  • By removing whitespace and the likes, there is sometimes a savings of space as compared to the source code form (this is typical with ELisp code).

    More importantly, there is almost always a huge savings of space as compared to compiling to machine code.

    For example, the calendrica code compiles in various forms to the following sizes:

    Table 1. Compiling calendrica.lisp

    File Form Size (bytes)
    calendrica.lisp Source Code 170347
    calendrica.x86f CMUCL Machine Code 472649
    calendrica.lbytef CMUCL Bytecompiled 87660
    calendrica.fas CLISP Bytecode 190873
    calendrica.lbytef.gz CMUCL Bytecompiled, compressed 34941
    calendrica.fas.gz CLISP Bytecode, compressed 30290
     

    The critical comparison here is that the bytecoded forms are a whole lot smaller than the roughly 472K of calendrica.x86f.

    It is far more difficult to measure this, but bytecode is also likely to be stored more compactly in memory than machine code. This is one of the purposes of the way CMUCL combines native compilation with a bytecode compiler: code that is executed a lot will benefit from compilation to native code, whilst by bytecode-compiling those parts of a system that are seldom executed, substantial memory savings are attained. The compactness here comes from the fact that the "machine language" is designed not for the computer hardware , but rather for the application .

  • Hand in hand with the diminished size comes the combination of convenience of implementation as well as improved computational efficiency.

    All three walk in together as joint merits of designing a "computational engine" specifically for the application. 

    • Consider that if the application is intended to process strings, it makes sense to have strings as basic data types. Parrot has "string" operations length, concat, repeat, tostring, which work with strings far more conveniently than operators you would get with "real machine language." That convenience can make it easier to write compact, efficient code.

      Expanding this to "real machine code" would increase the size of the code considerably.

    • A simulated "virtual machine" can be manipulated in ways that would be prohibitively complex to do on "bare hardware." For instance, in the Parrot system, it is easy enough to save sets of registers by pushing a few pointers onto a stack. On "bare hardware," the equivalent behaviour requires pushing a whole of registers into memory locations. This has the unexpected result that bytecoding can, here and there, actually be faster than coding to bare hardware.

      Bytecode machines have traditionally been stack-oriented machines, where objects would be drawn in and out of memory onto a stack where they would then be processed.

      The Parrot virtual machine is a little different, having a register architecture with four sets of 32 registers for four data types of integers, floats, strings, and Parrot Magic Cookies. They figure that this will lead to less stack thrashing.

    • It is convenient to create operations that do extremely complex processing.

      Such operations will provide a compact representation for something that is complex, which reduces the size of a program; they also substantially improve performance by allowing a lot of work to be done within the optimized code of the "virtual machine simulator."

      The classic example of an arguably mistaken example of this is in the CRC operation on the old VAX architecture. Calculating CRC checksums and evaluating polynomials are wonderful examples of "extremely complex processing." Rather a lot of microcode silicon was likely consumed on these operations, and few compilers made use of them. At least not the C compiler! As a result of that, code implemented in C is unlikely to use these operations, such as popular bytecoded language interpreters!

      In an application where you expect to calculate a lot of polynomials, a POLY operator will certainly be of great value, as would, very likely, a whole set of matrix math operators.

      CLISP is known for having unusually good performance when processing BIGNUMs (quasi-infinite precision integers). Other Common Lisp implementations tend to beat its pants off when working with small integers when they can render code into native 32 bit arithmetic operations, as you might find with crypto applications, but once you cross the line to the BIGNUM, all the implementations wind up invoking function calls, and behave little different from a bytecode interpreter. CLISP has an unusually good BIGNUM library, and so works better than many others in this area of strength.

      As for the CRC function on the VAX being a "mistake," it's a mistake when it consumes silicon on the CPU that would have better been used for something else, and then remains unused when your favorite compilers don't use it. The same is not true for rarely-used bytecode instructions. If there are 160000 gates on a CPU that aren't being used, that feels wasteful. If there is 16K of code in the bytecode interpreter that never gets used, and perhaps never even gets paged into memory, the waste is nowhere near as painful.

      In the hardware world, RISC may have become "king," in that it allows silicon to be devoted to having more registers and in improving the ability to execute code in parallel. In a bytecode interpreter, CISC is virtually always a win. 

 

1.1. MONO

The MONO project represents a free software implementation of a number of components of the Microsoft .NET "platform," notably including a C# Compiler, CLI Runtime, and class libraries.

Mono Hacking Roadmap l

1.2. Why MONO?

There have been some rather hysterical reports and theories about the relationship between MONO, GNOME and Microsoft. Many quite wild, with rather incoherent theories as to why someone would have thought it sensible to implement MONO.

Contrary to some of the wild theories floating out on Slashdot, the reasoning has little to do with "using Microsoft code," or Microsoft Passport authentication, or anything else of the sort.

The real reasoning has to do with language. Microsoft is implementing all sorts of things as "part of .NET;" the parts MONO is looking at are:

 

  • A dynamic language

    The big name Ximian project is the email-and-stuff application Evolution .

    The code for it is written in C, and apparently whopping huge portions of it consist of memory management code, which, in C, must be done quite manually.

    Using a more dynamic language offering garbage collection allows the ability to not bother writing hordes of malloc() and free() calls, which would allow an application like Evolution to be both smaller and more easily and quickly written.

    Java offers garbage collection, and so resembles an answer in this regard. So also would languages such as Lisp, Smalltalk, Eiffel, and Modula3.

  • A bytecoded (perhaps JIT-compiled) platform to provide some independence of platform.

    This also would disconnect application code somewhat from the deep details of the many C-based libraries of GNOME. Apparently the not-always-organized growth of libraries in GNOME has led to it becoming somewhat difficult to make concurrent use of many of the services offered.

    Again, Java offers a "JVM." A number of other languages offer language-specific bytecoding schemes that somewhat parallel this.

  • Language- and platform-independence

    One of the important characteristics of the GNOME project is that it intends to be relatively agnostic about what languages are used (in contrast with the somewhat C++-partisan KDE and Objective C-partisan GNUStep ).

    The various " bytecode execution machines" that are presently available are generally not terribly friendly to the use of multiple languages. JVM is for Java, for instance.

    There is some "never-accomplished Holy Grail" to this; witness UNCOL.

In effect, MONO represents something rather like the "Java platform," except that it is specifically intended to be language neutral.

Here are some links to interviews and commentary from sundry GNOME folk about what they're about: 

linux users conflict resolution seminar