execute(do more inlining, extended dead code elimination, increased the number of cases where overflow checks can be eliminated, and made object instantiation fast in more cases. Here, I'll explain what the optimizations are and how they're implemented.
call( -- )and
execute( -- )are words which let you call quotations or execute words. Slava explained them in a blog post. They differ from
executein that they don't require that the word or quotation is available through combinator inlining. But they require an explicit stack effect to be given, to ensure that it takes and returns the right number of parameters. This is nice, because the versions with a stack effect have an additional safety property: they'll only run with if the code has the right stack effect.
Until now, these combinators carried with them a perfomance penalty over using
executewith known quotations. The penalty is that the stack effect of the quotation must be checked, at runtime, to match the stack effect of the callsite. With an optimization I implemented, the performance is the same when the quotation is known. With matching performance in this case (they're both completely free in the case where either would work), it should be easier to write code that uses the checked versions.
For example, the following implementation of an absolute value function compiles down to the same code that you'd write with the normal
:: iff ( x cond true-quot false-quot -- x' )
[ x true-quot call( x -- x' ) ]
[ x false-quot call( x -- x' ) ] if ; inline
: abs ( n -- abs(n) )
dup 0 < [ neg ] [ ] iff ;
This is implemented as part of the sparse conditional constant propagation (SCCP) pass in Factor's compiler.
execute-effecthave custom inlining behavior there, which takes advantage of information collected by the propagation pass. If enough is known about the quotation at this time (if it is a literal quotation, or a literal quotation with something (literal or non-literal) curried on to it, or two literal quotations composed together) so that the stack effect can be inferred and the code can be inlined, then it is inlined.
I could have implemented this as a transform in the stack checker, but this strategy gives a stronger optimization, since it can interact with everything in constant propagation. For example, it interacts with method inlining. This will help improve performance in the Factor Smalltalk implementation, where previously combinator inlining would have been impossible without special support from the Smalltalk-to-Factor translator.
Eliminating overflow checks
In Factor, unlike C and Java, calculations on integers never overflow. Instead, numbers that are too big are converted to a representation that scales to arbitrary size. The smaller integers which are faster to calculate on are called "fixnums" and the larger ones, which are slower to use, are called "bignums"
Factor's compiler does a lot of work to convert general arithmetic to arithmetic on fixnums when possible. One thing the compiler does is try to infer that integers will be in a small enough range that no overflow can happen. This is part of the SCCP pass.
Another compiler pass checks if a value is only used in places where overflowing doesn't matter. For example, if the code
+ >fixnumis run in a context where it's known that its arguments will both be fixnums, then it doesn't matter if an overflow check is done because of how modular arithmetic works.
I extended this pass to take into account a couple other words that have this property that
>fixnumhas. Specifically, the words which set memory at a given pointer address, like
set-alien-unsigned-1and take integers as arguments. Depending on the word size of the platform (32 vs 64 bits), this optimization is valid for a different set of words.
One time that this comes up is in
v+nwhen called with a byte array and a fixnum. Adding two fixnums gives back something that can be either a fixnum or a bignum, but the form of addition without an overflow check can be used since the result is going to be stored back into a byte array. Storing into a byte array is implemented with
set-alien-unsigned-1, so the optimization applies.
newinstantiates a tuple with default values for all slots, given a tuple class. By default, this is done dynamically: the tuple class does not need to be known before runtime. But usually it is known ahead of time. If it is known ahead of time, then code can be inlined which instantiates the specific tuple that is there.
This was previously implemented as part of the stack checker. I moved it to the propagation pass, which makes the optimization stronger. I plan on moving more transforms like this (for example, the one for
member?) to the propagation pass.
[Update: I did this, adding a utility called
define-partial-eval. Its interface is identical to
define-transform, but it operates on the propagation pass IR. Transformations which don't need to interact with stack checking should use
define-transform, since it creates a stronger optimization.]
Better dead code elimination
I extended dead code elimination in the low-level IR to be more accurate. Now, it realizes that an allocation is dead if it is only written to, and never read from. With this, together with alias analysis, the code
2array first2compiles into a no-op, and no allocation takes place. (This isn't optimized out by tuple unboxing in the high-level IR, because arrays are mutable.)