Awesome
rhine: a C++ compiler middle-end for a typed ruby
rhine is designed to be a fast language utilizing the LLVM JIT featuring N-d tensors, first-class functions, and type inference; specifying argument types is enough. It has a full blown AST into which it embeds a UseDef graph.
rhine started off as rhine-ml, and rhine-ml was called rhine earlier.
- Effort put into rhine-ml: 2 months
- Effort put into rhine: 1 year, 1 month
Language Features
def bar(arithFn Function(Int -> Int -> Int)) do
println $ arithFn 2 4
end
def addCandidate(alpha Int, beta Int) do
ret $ alpha + beta
end
def subCandidate(gamma Int, delta Int) do
ret $ gamma - delta
end
def main() do
if false do
bar addCandidate
else
bar subCandidate
end
mu = {{2}, {3}}
println mu[1][0]
end
Int
is a type annotation, and only argument types need to be annotated,
return type is inferred. Function(Int -> Int -> Int)
is a function that takes
two integers and returns one integer, mixing in some Haskell syntax. $
is
again from Haskell, which is basically like putting the RHS in parens.
rhine-ml, in contrast, has arrays, first-class functions, closures, variadic arguments, macros. It's also much less buggy.
The recursive-descent parser
rhine uses a handwritten recursive-descent parser, which is faster and reports better errors, than the former Bison one. You will need to use a one-token lookahead atleast, if you want to keep the code simple. This gives you one level of:
parseSymbol(); // Oops, the lexed token indicates that we're not in the right
// function
parseInstruction(); // Ask it to use an existing token, not lex a new one
Another minor consideration is that newlines must be handled explicitly if you want to substitute ; with a newline in the language.
void Parser::getTok() {
LastTok = CurTok;
CurTok = Driver->Lexx->lex(&CurSema, &CurLoc);
LastTokWasNewlineTerminated = false;
while (CurTok == NEWLINE) {
LastTokWasNewlineTerminated = true;
CurTok = Driver->Lexx->lex(&CurSema, &CurLoc);
}
}
The AST
The AST is heavily inspired by LLVM IR, although it has some higher-level
concepts like Tensor
. It's an SSA and has a UseDef graph embedded in it,
making analysis and transformation easy.
The main classes are Type
and Value
. All types like IntType
, FloatType
inherit from Type
, most of the others inherit from Value
. A BasicBlock
is
a Value
, and so is ConstantInt
.
A BasicBlock
is a vector of Instruction
, and this is how the AST is an SSA:
assignments are handled as a StoreInst
; there is no real LHS, just RHS
references.
StoreInst::StoreInst(Value *MallocedValue, Value *NewValue);
UseDef in AST
Value
is uniquified using LLVM's FoldingSet
, and Use
wraps it, so we can
replace one Value
with another.
/// A Use is basically a linked list of Value wrappers
class Use {
Value *Val;
Use *Prev;
Use *Next;
// Laid out in memory as [User] - [Use1] - [Use2]. Use2 has DistToUser 2
unsigned DistToUser;
};
An Instruction
is a User
. User
and its Use
values are laid out
sequentially in memory, so it's possible to reach all the Use
values from the
User
. It's also possible to reach the User
from any Use
, using
DistToUser
.
class User : public Value {
protected:
unsigned NumOperands;
};
class Instruction : User;
The User
has a custom new
to allocate memory for the Use
instances
as well.
void *User::operator new(size_t Size, unsigned Us) {
void *Storage = ::operator new (Us * sizeof(Use) + Size);
auto Start = static_cast<Use *>(Storage);
auto End = Start + Us;
for (unsigned Iter = 0; Iter < Us; Iter++) {
new (Start + Iter) Use(Us - Iter);
}
auto Obj = reinterpret_cast<User *>(End);
return Obj;
}
};
The Context
The Context is a somewhat large object that keeps the uniqified Type
and
Value
instances. It also keeps track of Externals
, the external C functions
that are provided as part of a "standard library". Unique llvm::Builder
and
llvm::Context
objects, as well as the DiagnosticPrinter
are exposed member
variables. Finally, it is necessary for symbol resolution, and keeps the
ResolutionMap
.
Symbol resolution
src/Transform/Resolve is an example of something that utilizes the UseDef embedded in the AST.
B = A + 2
creates one UnresolvedValue
, A
, an AddInst
, and a MallocInst
,
which takes the string "B" and AddInst
as operands.
The transform basically goes over all the Instruction
in the BasicBlock
,
resolves UnresolvedValue
instances, and sets the Use
to the resolved value.
It hence replaces the Value
underneath the Use
, and since the Instruction
is referencing Use
instances, there are no dangling references.
if (auto S = K->Map.get(V, Block)) {
/// %S = 2;
/// ^
/// Came from here (MallocInst, Argument, or Prototype)
///
/// Foo(%S);
/// ^
/// UnresolvedValue; replace with %Replacement
if (auto M = dyn_cast<MallocInst>(S)) {
if (dyn_cast<StoreInst>(U->getUser()))
U.set(M);
}
}
Type Inference
Type Inference is too simple. One visit
function is overloaded for all
possible Value
classes.
Type *TypeInfer::visit(MallocInst *V) {
V->setType(visit(V->getVal()));
assert(!V->isUnTyped() && "unable to type infer MallocInst");
return VoidType::get(K);
}
Building
The desired directory structure is:
bin/ ; if you downloaded the tarball for this
cmake
ninja
flex
src/
rhine/
README.md
llvm/ ; git submodule update --init to get the sources
llvm-build/
bin/
llvm-config ; you need to call this to build
rhine-build/
rhine ; the executable
On an OSX where you have everything:
$ brew install flex
$ brew link --force flex
$ git submodule update --init
$ cd llvm-build
# rhine is buggy; without debugging symbols, you can't report a useful bug
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ../llvm
$ export PATH=`pwd`/bin:$PATH
$ cd ../rhine-build
$ cmake -GNinja ..
# this will run the packages unittests, which should all pass
$ ninja check
On a Linux where you have nothing (and no root privileges are required):
Get git-lfs, and fetch cmake-ninja-flex.tar.bz2
$ git lfs fetch
Untar it and set up environment variables.
$ tar xf cmake-ninja-flex.tar.bz2
$ cd cmake-ninja-flex
# for bash/zsh
$ export TOOLS_ROOT=`pwd`
$ export PATH=$TOOLS_ROOT:$PATH
# for csh
$ setenv TOOLS_ROOT `pwd`
$ setenv PATH $TOOLS_ROOT:$PATH
Then,
$ git submodule update --init
$ cd llvm-build
# rhine is buggy; without debugging symbols, you can't report a useful bug
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ../llvm
$ ninja
$ export PATH=`pwd`/bin:$PATH
$ cd ../rhine-build
# flex isn't picked up from $PATH
$ cmake -GNinja -DTOOLS_ROOT=$TOOLS_ROOT -DFLEX_EXECUTABLE=$TOOLS_ROOT/flex ..
# if there are build (usually link) errors, please open an issue
# tests are currently failing on Linux, need to look into it
$ ninja check
Commentary
An inefficient untyped language is easy to implement. println
taking 23 and
"twenty three" as arguments is a simple matter of switching on
type-when-unboxed. There's no need to rewrite the value in IR, and certainly no
need to come up with an overloading scheme.
Crystal made a good decision to start with Ruby. If your idea is to self-host, then the original language's efficiency does not matter. All you need is good generated assembly (which LLVM makes easy).