Tuesday, February 12, 2013

Announcing Topaz, an RPython powered Ruby interpreter

Hello everyone

Last week, Alex Gaynor announced the first public release of Topaz, a Ruby interpreter written in RPython. This is the culmination of a part-time effort over the past 10 months to provide a Ruby interpreter that implements enough interesting constructs in Ruby to show that the RPython toolchain can produce a Ruby implementation fast enough to beat what is out there.

Disclaimer

Obviously the implementation is very incomplete currently in terms of available standard library. We are working on getting it useable. If you want to try it, grab a nightly build.

We have run some benchmarks from the Ruby benchmark suite and the metatracing VMs experiment. The preliminary results are promising, but at this point we are missing so many method implementations that most benchmarks won't run yet. So instead of performance, I'm going to talk about the high-level structure of the implementation.

Architecture

Topaz interprets a custom bytecode set. The basics are similar to Smalltalk VMs, with bytecodes for loading and storing locals and instance variables, sending messages, and stack management. Some syntactical features of Ruby, such as defining classes and modules, literal regular expressions, hashes, ranges, etc also have their own bytecodes. The third kind of bytecodes are for control flow constructs in Ruby, such as loops, exception handling, break, continue, etc.

In trying to get from Ruby source code to bytecode, we found that the easiest way to support all of the Ruby syntax is to write a custom lexer and use an RPython port of PLY (fittingly called RPly) to create the parser from the Ruby yacc grammar.

The Topaz interpreter uses an ObjectSpace (similar to how PyPy does it), to interact with the Ruby world. The object space contains all the logic for wrapping and interacting with Ruby objects from the VM. It's __init__ method sets up the core classes, initial globals, and creates the main thread (the only one right now, as we do not have threading, yet).

Classes are mostly written in Python. We use ClassDef objects to define the Ruby hierarchy and attach RPython methods to Ruby via ClassDef decorators. These two points warrant a little explanation.

Hierarchies

All Ruby classes ultimately inherit from BasicObject. However, most objects are below Object (which is a direct subclass of BasicObject). This includes objects of type Fixnum, Float, Class, and Module, which may not need all of the facilities of full objects most of the time.

Most VMs treat such objects specially, using tagged pointers to represent Fixnums, for example. Other VMs (for example from the SOM Family) don't. In the latter case, the implementation hierarchy matches the language hierarchy, which means that objects like Fixnum share a representation with all other objects (e.g. they have class pointers and some kind of instance variable storage).

In Topaz, implementation hierarchy and language hierarchy are separate. The first is defined through the Python inheritance. The other is defined through the ClassDef for each Python class, where the appropriate Ruby superclass is chosen. The diagram below shows how the implementation class W_FixnumObject inherits directly from W_RootObject. Note that W_RootObject doesn't have any attrs, specifically no storage for instance variables and no map (for determining the class - we'll get to that). These attributes are instead defined on W_Object, which is what most other implementation classes inherit from. However, on the Ruby side, Fixnum correctly inherits (via Numeric and Integer) from Object.

This simple structural optimization gives a huge speed boost, but there are VMs out there that do not have it and suffer performance hits for it.

Decorators

Ruby methods can have symbols in its names that are not allowed as part of Python method names, for example !, ?, or =, so we cannot simply define Python methods and expose them to Ruby by the same name.

For defining the Ruby method name of a function, as well as argument number checking, Ruby type coercion and unwrapping of Ruby objects to their Python equivalents, we use decorators defined on ClassDef. When the ObjectSpace initializes, it builds all Ruby classes from their respective ClassDef objects. For each method in an implementation class that has a ClassDef decorator, a wrapper method is generated and exposed to Ruby. These wrappers define the name of the Ruby method, coerce Ruby arguments, and unwrap them for the Python method.

Here is a simple example:

@classdef.method("*", times="int")
def method_times(self, space, times):
    return self.strategy.mul(space, self.str_storage, times)

This defines the method * on the Ruby String class. When this is called, the first argument is converted into a Ruby Fixnum object using the appropriate coercion method, and then unwrapped into a plain Python int and passed as argument to method_times. The wrapper method also supplies the space argument.

Object Structure

Ruby objects have dynamically defined instance variables and may change their class at any time in the program (a concept called singleton class in Ruby - it allows each object to have unique behaviour). To still efficiently access instance variables, you want to avoid dictionary lookups and let the JIT know about objects of the same class that have the same instance variables. Topaz, like PyPy (which got it from Self), implements instances using maps, which transforms dictionary lookups into array accesses. See the blog post for the details.

This is only a rough overview of the architecture. If you're interested, get in touch on #topaz.freenode.net, follow the Topaz Twitter account or contribute on GitHub.

Tim Felgentreff

4 comments:

Shin Guey said...

Interesting. Although I code a lot in python but still quite like Ruby. Am looking forward for a fast ruby...

Unknown said...

Does this mean that JVM is now obsolete?

Anonymous said...

Don't worry. JVM will outlive you and your grandgrandchildren.

smurfix said...

"Its __init__ method", not "It's".