Well, finally getting back to some (or a) posting after some time. Fortunately or unfortunately depending on how you look at it I was heads down in project work at my last semester at GA Tech. Now, I have a few brief moments between ramping up on a new gig and the moving thing (back to SJ, once again). Figured I would take some time to convert over an overview of a fun project I did along with Shauvik Roy Choudhary and Jared Klumpp, two talented CS students here at Tech and authors of the content below. It was part of Nate Clark’s CS8803-DynamicCompilation course, which I highly recommend for anyone reading this from Tech. The class ROCKED, had quite the cast of characters in there with strong backgrounds, but unfortunately I missed the Tab tasting day, but wtf drinks Tab anyway 😉 Ok, thats a few inside remarks, but for anyone interested in VMs and Tracing read on. The idea is to improve tracing results by fusing loops together prior to any recording. The project was in large part driven by the “Trace-based just-in-time type specialization for dynamic languages” PLDI 09 paper, and thanks out to initial ideas from guys back at Adobe and Mozilla IRC channel.
Trace compilation represents a different paradigm than traditional just-in-time (JIT) compilation techniques. Rather than adhering to principles derived from traditional static compilers such as whole method compilation, trace compilation relies on runtime profiling information, making it unique to the domain of dynamic compilation. With trace compilation program execution starts in the interpreter. While the interpreter executes bytecodes, the VM maintains profiling information to identify frequently executed (hot) loops within the program. In a trace-based approach it is assumed that backward branches represent a control flow transfer from the end of the loop to the beginning of the loop. When such branches becomes hot, i.e., when the execution count crosses a certain threshold, traces are recorded and compiled. The benefits of tracing include the ability to easily inline procedure calls, simplified on-stack replacement, and the compilation of only relevant code areas (cold code is ignored).
Trace compilation was a major component of the JIT compiler provided in the FireFox 3.1 browser . The trace based JIT, TraceMonkey, used trace compilation as a means of asserting type stability within a loop. Unlike method-based dynamic compilers, TraceMonkey operates at the granularity of individual loops. This design choice was based on the expectation that programs spent most of their time in hot loops. Furthermore, we can expect hot loops to be mostly type-stable, meaning that the types of values are invariant. TraceMonkey relies on type stability to generate type specialized code during tracing, obtaining an overall 10x speedup in performance .
The objective of this project, JSS, is to facilitate Trace-Monkey’s use of Trace compilation by increasing the scope of what gets included in a trace. More specifically we look at applying loop-fusion as a means of effectively fusing what would otherwise be separate traces, creating larger traces overall. The details of our implementation are discussed in section 3, followed by the results in section 4. Fusing loops before the tracer adds difficulties because parsing and compilation into a bytecode representation within TraceMonkey occurs in a single pass. These constraints and their potential workarounds are discussed in section 5.
2. Related Work
There has been much significant work in the area of Tracing as of recent. In particular, the benefits of trace over whole method compilation are still being debated and evaluated. Accordingly, the two areas of related work discussed in this section are investigations into whole method compilation and expansions of prior work in trace compilation for dynamic languages.We discuss both below and present several other more loosely related topics.
At the same time there is possible work going forward to improve the usage of trace complication in the context of dynamic languages. Andreas Gal has discussed extending TraceMonkey’s current implementation with parallel dynamic compilation and recompilation of traces, the folding of trees with minimal differences, more precise load propagation within a tree, and improved register allocation .
In this paper, we improve the trace based type specialization in TraceMonkey by performing loop-fusion on the program’s AST. Hence, this augmented technique leverages both static as well as dynamic information of the program, thereby harnessing the benefits of both.
JSS seeks to broaden the granularity at which tracing occurs. Increasing the vision of the tracer through loop-fusion reduces the number of loops, and inherently branches, as well as increases the opportunity for optimizations within the loops. JSS therefore performs loop-fusion during parse time in order to occur before the tracer and avoid adding additional passes, which would decrease overall performance.
The TraceMonkey parser scans and parses one statement at a time. After a statement has been parsed, the generated AST is emitted to a stack-based intermediate representation and parsing begins for the next statement. Whenever loops are emitted to IR, additional instrumentation is inserted to control the tracer and count the number of trips. After the file has been parsed and emitted, the interpreter begins executing at the top of the IR. When the trace instructions have been executed a specified number of times, execution pauses while the tracer consumes the IR and emits native code for that loop. Parsing also performs rudimentary data-flow analysis for constant propagation and folding. JSS uses this information to characterize loops at parse time and identify data-flow dependencies. Only loops within a higher-level statement construct, such as a function, may be fused. This occurs because of the self-imposed restriction that loop-fusion occurs within parse time and characteristic of the parser to emit the IR on a per-statement basis (as opposed to emitting after the entire program has been parsed). First the AST nodes were augmented with an additional data structure, LoopInfo, which characterizes the loop. The characteristics required for fusion are the loops bounds, direction and stride. Identifying these features requires the use of usedef chains created during parsing. Very basic induction variable identification assumes that the induction variable must occur in the condition of the loop, for the proof-of-concept. The induction variable is then traced back to its definition through the use-def chains to identify one of the constant bounds. The other side of the conditional expression is then traced through the use-def information to identify the other constant bound. If either of these bounds results in an expression or global variable, then the array bounds cannot be safely determined and loop-fusion is not possible. Both the loop body and header (consisting of the conditional) are examined to identify the stride. The stride must be a monotonic, constant change, and only basic induction variables are supported (such as d = d +/- C). Dependency identification at parse time proved more challenging, because the ASTs lack features that would greatly ease their identification. Were these features present, they would still be undesirable because they would essentially add several passes of analysis for each loop. Instead JSS identifies dependencies after the loop has been completely parsed, using the aforementioned use-def chains and list of generated variables. Because the variables used by a loop cannot be propagated to the top-level AST node without augmenting every aspect of the parser, the list of all variables in the program serves as the mechanism for identifying dependencies. After loops are parsed the list of variables is scanned and each variable’s use-def set examined for a use or definition occurring within the loop. If this occurs then the loop will be conservatively marked as carrying a dependency on that name, although this could be improved to more aggressively remove dependencies. The code lying between two loops a and b can also create potential dependencies. If the definition of a variable, x1 used in loop b, depends upon a variable x2 used or defined in loop a, then a dependency is formed from b to a. This is the only dependency-creating condition, but additional source transformations must occur for code lying between the two loops:
- The intermediate code contains the use of a variable defined or used in a. In this situation the code can be sunk below b.
- The definition of a variable that is used by b occurs in the intermediate code. In this situation the definition can be
hoisted above a.
The appropriate actions for each of the above two situations occurs for all intermediate code between a and b. Any remaining code carries no dependence on the loops and can arbitrarily be sunk or hoisted depending upon other data-flow dependencies. The final dependency causing condition occurs because of global variables and function calls. Loops are dependent upon any global variable defined or accessed within the loop, and if the use or definition of a global variable occurs between the two loops then they cannot be merged, because the effects of globals are difficult to determine. This is discussed in further detail in section 5, Problems. Function calls cause dependencies from the loop to all global variables, because the effects of a call cannot be determined at parse time. Two loops a and b are then considered for fusing after the identification of loop-characteristics and dependencies. First the dependencies are checked to ensure that there is no dependency from b on a and that the intermediate code between a and b does not carry a use of a dependency from a in the same statement as the definition or use of a dependency from b. If the two loops pass the dependency analysis then they can be transformed and merged. The merging process first creates a pre-header for both loops and moves the initialization variables (for for-loops) into this pre-header. The update-component is then appended to the end of the loop body for for-loops. Then, the trip count for both loops is calculated. If both loops execute the same number of times then the bodies are fused and the condition for the fused loop comes from the first loop. Otherwise, the loop with the smaller trip count has a guard with its current condition inserted around the body, the two bodies are fused, and the condition for the fused loop is created using the condition condition from the larger loop. This is clearly an opportunity for loop-peeling and is discussed in section 6, Future Work.
TraceVis1 was used to profile TraceMonkey and determine how much time the tracer was spending in each phase of execution. TraceMonkey emits timing events to a log events whenever the compiler phase changes, which the TraceVis tool can later analyze. The post-execution analysis reduces the overhead and provides more accurate timing results. The two TraceVis tools used were Vis, a visualization creating tool, and binlog, a text-based analysis tool that was used to quantify times spent in each phase. These tools were used to analyze the performance of several benchmarks before and after application of JSS.
The results of a specific test case we designed for loopfusion and two benchmark programs from sunspider  are shown in table 1.
The first program, which we designed, tests a best-case situation where two loops are immediately next to each other, increment in the same direction, but have different bounds. The second program, modified from sunspider to allow for loop-fusion, measures binary tree access times. The third program, access-nbody.js, was the only SunSpider test that allowed for loop-fusion. The bound for access-nbody’s fusible loops, although the same was variable rather than constant. JSS was modified to account for variable bounds. Each benchmark was repeated 100 times and the average speed found. Figure 2 shows the amount of time spent in each phase of TraceMonkey, with different colors representing the different phases. The colors have been shown in the key in Figure 2(g).
Applying our optimization should result in the reduction of the overhead, which would mean less total time spent monitoring, recording, compiling and setting up and tearing down. The first and third images show the timing without the JSS optimization while the second and fourth show the timing after applying JSS, respectively. The second graph (Fusion1-Post) shows a definite reduction in the amount of time spent monitoring and recording traces, as well as fewer number of traces, indicating the application of JSS.
The results of applying JSS to the first benchmark is clearly visible, with a greatly reduced amount of time spent in the overhead. The results from table 1 help to more clearly quantify the change, and show that the time spent monitoring and recording traces was significantly lower after JSS than before. This reduction made up for the additional time spent in compilation because of the larger amount of code being compiled. The estimated speed up increased by 0.01, indicating improved performance. However, the additional guard added to prevent the loop with the smaller trip count from executing too often caused more time to be spent in native execution and caused the benchmark to execute more slowly than pre-JSS. Future optimizations, such as loop-peeling, could remove the need for a guard and increase the performance of JSS.
The second benchmark ran over a larger period of time, increasing the benefit of JSS. The time spent monitoring traces was reduced, and the overall native execution time was much lower. Discrepancies between this benchmark and the previous could be due to better loop-bounds alignment that removed the need for a guard to prevent the fused loop from executing too often.
The third benchmark shows the greatest improvement due to JSS. The time spent in the interpretation, monitoring, recording and compilation phases was greatly reduced and the code spent more time running natively. This third benchmark highlights the benefits of JSS, benefits that could be realized in other less-optimal situations with additional optimizations.
Side exits were ignored in JSS to show that loop-fusion was a viable optimization for increasing performance. Detecting side exits would not be difficult, but would require traversing an AST after parsing completion or carrying a flag up the tree to show that a break or continue statement occurs. Traversing the AST after creation is undesirable because it would add extra processing time by adding another analysis pass. Adding flags to the tree would require augmenting the parser in many locations and adding additional information to the AST node objects.
Global variable dependencies are not particular to Java-Script, but their difficulties are certainly more pronounced than other languages. Because variables default to global, they abound in real-world scripts. The global variables then cause unnecessary dependencies because the dependencies cannot be easily determined statically. They also cause problems because any loop calling a function causes a dependency from that loop to every global variable. Methods for removing the pressure from global variables could include copying the global variable in the pre-header and using the copied name inside the loop. Analyzing called functions would reduce the number of dependencies, but would add analysis time. As shown in the example in Figure 3, both variables a and b are globals as their declaration is not preceded by the keyword var. A naive analysis would wrongly try to fuse the two for loops that are declared on lines 3 and 7. However, function bar which is called at line 4 of the first loop, changes the global variable b making the loops infusible. Hence, a more thorough inter-procedural analysis is required to handle global variables.
6. Future Work
Future work would improve the safety of JSS and add other loop-related optimizations that could increase the performance of the resulting code. Currently the induction variable is identified based on the variable used in the loop condition, but true induction variable analysis would lead to a more accurate optimization. Dependency analysis, loop-peeling, and induction variable normalization would all aid in improving the performance and number of opportunities available to JSS.
Current dependency analysis marks any use or definition of a used variable as a dependency. The number of dependencies could be reduced by analyzing variable uses and reordering fused code to remove unnecessary dependencies. For example, if the same array, A, is used in both loops, the access variables can be analyzed to determine if it is a true dependence. This analysis could be performed using a standard combination of tests, such as GCD, Banerjee Inequalities, or I-Test. These tests have the undesirable quality of slowness, but a balance between analysis time and conservatism could possibly be found.
Loop-peeling could also increase the performance of JSS by removing the need for guard checks when the trip counts of the two loops are different. The guard checks add several unnecessary branch instructions to ensure the side effects after fusion remain the same as before. Although loop-peeling would reduce the amount of native code, because it would not occur within the trace, it would reduce the number of checks and branches occurring within the native code.
Finally, induction variable normalization would greatly improve the number of loops that could be fused and increase opportunity for additional optimizations. Normalization would allow loops that run in opposite directions to be fused, increasing the number of loops that could be fused by JSS. It would also increase opportunities for additional optimizations, such as induction variable elimination and strength reduction.
 A. Gal. Tracing the web. URL http://andreasgal.wordpress.com/2008/08/22/tracing-the-web/. Accessed on 06 May 2010.
 A. Gal, B. Eich, M. Shaver, D. Anderson, D. Mandelin, M. R. Haghighat, B. Kaplan, G. Hoare, B. Zbarsky, J. Orendorff, J. Ruderman, E. W. Smith, R. Reitmaier, M. Bebenita, M. Chang, and M. Franz. Trace-based just-in-time type specialization for dynamic languages. In PLDI ’09: Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, pages 465–478, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-392-1. doi:http://doi.acm.org/10.1145/1542476.1542528.
 K. Kennedy and K. S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing, pages 301–320, London, UK, 1994. Springer-Verlag. ISBN 3-540-57659-2.
 M. Wiki. J¨agermonkey – method based jit engine, . URL https://wiki.mozilla.org/JaegerMonkey. Accessed on 06 May 2010.