Fusible Traces

Well, finally getting back to some (or a) posting after some time. Fortunately or unfortunately depending on how you look at it I was heads down in project work at my last semester at GA Tech. Now, I have a few brief moments between ramping up on a new gig and the moving thing (back to SJ, once again). Figured I would take some time to convert over an overview of a fun project I did along with Shauvik Roy Choudhary and Jared Klumpp, two talented CS students here at Tech and authors of the content below. It was part of Nate Clark’s CS8803-DynamicCompilation course, which I highly recommend for anyone reading this from Tech. The class ROCKED, had quite the cast of characters in there with strong backgrounds, but unfortunately I missed the Tab tasting day, but wtf drinks Tab anyway 😉 Ok, thats a few inside remarks, but for anyone interested in VMs and Tracing read on. The idea is to improve tracing results by fusing loops together prior to any recording. The project was in large part driven by the “Trace-based just-in-time type specialization for dynamic languages” PLDI 09 paper, and thanks out to initial ideas from guys back at Adobe and Mozilla IRC channel.

1. Introduction

Trace compilation represents a different paradigm than traditional just-in-time (JIT) compilation techniques. Rather than adhering to principles derived from traditional static compilers such as whole method compilation, trace compilation relies on runtime profiling information, making it unique to the domain of dynamic compilation. With trace compilation program execution starts in the interpreter. While the interpreter executes bytecodes, the VM maintains profiling information to identify frequently executed (hot) loops within the program. In a trace-based approach it is assumed that backward branches represent a control flow transfer from the end of the loop to the beginning of the loop. When such branches becomes hot, i.e., when the execution count crosses a certain threshold, traces are recorded and compiled. The benefits of tracing include the ability to easily inline procedure calls, simplified on-stack replacement, and the compilation of only relevant code areas (cold code is ignored).

Trace compilation was a major component of the JIT compiler provided in the FireFox 3.1 browser [2]. The trace based JIT, TraceMonkey, used trace compilation as a means of asserting type stability within a loop. Unlike method-based dynamic compilers, TraceMonkey operates at the granularity of individual loops. This design choice was based on the expectation that programs spent most of their time in hot loops. Furthermore, we can expect hot loops to be mostly type-stable, meaning that the types of values are invariant. TraceMonkey relies on type stability to generate type specialized code during tracing, obtaining an overall 10x speedup in performance [4].

The objective of this project, JSS, is to facilitate Trace-Monkey’s use of Trace compilation by increasing the scope of what gets included in a trace. More specifically we look at applying loop-fusion as a means of effectively fusing what would otherwise be separate traces, creating larger traces overall. The details of our implementation are discussed in section 3, followed by the results in section 4. Fusing loops before the tracer adds difficulties because parsing and compilation into a bytecode representation within TraceMonkey occurs in a single pass. These constraints and their potential workarounds are discussed in section 5.

2. Related Work

There has been much significant work in the area of Tracing as of recent. In particular, the benefits of trace over whole method compilation are still being debated and evaluated. Accordingly, the two areas of related work discussed in this section are investigations into whole method compilation and expansions of prior work in trace compilation for dynamic languages.We discuss both below and present several other more loosely related topics.

As mentioned above, the debate of Whole method vs. Trace compilation is still ongoing. Much of the debate has spurred by the fact that Google’s V8 [5] VM utilizes a whole method compilation technique as does a new research project at Mozilla, J¨agerMonkey [9]. Google’s V8 performs whole method compilation prior to execution in order to gain native-execution performance gains. The major benefit of performing whole method compilation as opposed to tracebased compilation is the increase in granularity for optimization, which would allow for loop optimizations in the native code. In addition to basic JIT compilation, V8 performs additional optimizations on JavaScript. For instance to improve the speed of property access, V8 uses hidden classes in the background to establish commonalities between objects. This allows traditional class based optimizations such as inline caching to be applicable to JavaScript objects.

At the same time there is possible work going forward to improve the usage of trace complication in the context of dynamic languages. Andreas Gal has discussed extending TraceMonkey’s current implementation with parallel dynamic compilation and recompilation of traces, the folding of trees with minimal differences, more precise load propagation within a tree, and improved register allocation [3].

Other related efforts include the more general use of loop-fusion and type specialization. Static compilers use loop-fusion to improve data locality and reduce the number of branches and overhead when loops have similar properties[7]. Loop-fusion also brings opportunities for parallelism and later optimizations. These additional optimizations could potentially increase the performance of the code but come at the cost of complexity of analysis and restructuring. However, the run-time goals of JavaScript preclude overly complex optimizations that may not pay off in execution. There has also been work in type specialization using pure static analysis. TAJS [6] uses abstract interpretation to infer sound types that can be used to detect common programming errors and assist program comprehension. Similarly, RATS [8] is focussed on aggressive type specialization that finds accurate information about the variable interval (range), kind (integer, float or NaN) and also relates it’s value. In their study, they report considerable improvements on numerical benchmarks.

In this paper, we improve the trace based type specialization in TraceMonkey by performing loop-fusion on the program’s AST. Hence, this augmented technique leverages both static as well as dynamic information of the program, thereby harnessing the benefits of both.

3. Implementation

JSS seeks to broaden the granularity at which tracing occurs. Increasing the vision of the tracer through loop-fusion reduces the number of loops, and inherently branches, as well as increases the opportunity for optimizations within the loops. JSS therefore performs loop-fusion during parse time in order to occur before the tracer and avoid adding additional passes, which would decrease overall performance.

Figure1: Tool implementation within SpiderMonkey

The TraceMonkey parser scans and parses one statement at a time. After a statement has been parsed, the generated AST is emitted to a stack-based intermediate representation and parsing begins for the next statement. Whenever loops are emitted to IR, additional instrumentation is inserted to control the tracer and count the number of trips. After the file has been parsed and emitted, the interpreter begins executing at the top of the IR. When the trace instructions have been executed a specified number of times, execution pauses while the tracer consumes the IR and emits native code for that loop. Parsing also performs rudimentary data-flow analysis for constant propagation and folding. JSS uses this information to characterize loops at parse time and identify data-flow dependencies. Only loops within a higher-level statement construct, such as a function, may be fused. This occurs because of the self-imposed restriction that loop-fusion occurs within parse time and characteristic of the parser to emit the IR on a per-statement basis (as opposed to emitting after the entire program has been parsed). First the AST nodes were augmented with an additional data structure, LoopInfo, which characterizes the loop. The characteristics required for fusion are the loops bounds, direction and stride. Identifying these features requires the use of usedef chains created during parsing. Very basic induction variable identification assumes that the induction variable must occur in the condition of the loop, for the proof-of-concept. The induction variable is then traced back to its definition through the use-def chains to identify one of the constant bounds. The other side of the conditional expression is then traced through the use-def information to identify the other constant bound. If either of these bounds results in an expression or global variable, then the array bounds cannot be safely determined and loop-fusion is not possible. Both the loop body and header (consisting of the conditional) are examined to identify the stride. The stride must be a monotonic, constant change, and only basic induction variables are supported (such as d = d +/- C). Dependency identification at parse time proved more challenging, because the ASTs lack features that would greatly ease their identification. Were these features present, they would still be undesirable because they would essentially add several passes of analysis for each loop. Instead JSS identifies dependencies after the loop has been completely parsed, using the aforementioned use-def chains and list of generated variables. Because the variables used by a loop cannot be propagated to the top-level AST node without augmenting every aspect of the parser, the list of all variables in the program serves as the mechanism for identifying dependencies. After loops are parsed the list of variables is scanned and each variable’s use-def set examined for a use or definition occurring within the loop. If this occurs then the loop will be conservatively marked as carrying a dependency on that name, although this could be improved to more aggressively remove dependencies. The code lying between two loops a and b can also create potential dependencies. If the definition of a variable, x1 used in loop b, depends upon a variable x2 used or defined in loop a, then a dependency is formed from b to a. This is the only dependency-creating condition, but additional source transformations must occur for code lying between the two loops:

  1. The intermediate code contains the use of a variable defined or used in a. In this situation the code can be sunk below b.
  2. The definition of a variable that is used by b occurs in the intermediate code. In this situation the definition can be
    hoisted above a.

The appropriate actions for each of the above two situations occurs for all intermediate code between a and b. Any remaining code carries no dependence on the loops and can arbitrarily be sunk or hoisted depending upon other data-flow dependencies. The final dependency causing condition occurs because of global variables and function calls. Loops are dependent upon any global variable defined or accessed within the loop, and if the use or definition of a global variable occurs between the two loops then they cannot be merged, because the effects of globals are difficult to determine. This is discussed in further detail in section 5, Problems. Function calls cause dependencies from the loop to all global variables, because the effects of a call cannot be determined at parse time. Two loops a and b are then considered for fusing after the identification of loop-characteristics and dependencies. First the dependencies are checked to ensure that there is no dependency from b on a and that the intermediate code between a and b does not carry a use of a dependency from a in the same statement as the definition or use of a dependency from b. If the two loops pass the dependency analysis then they can be transformed and merged. The merging process first creates a pre-header for both loops and moves the initialization variables (for for-loops) into this pre-header. The update-component is then appended to the end of the loop body for for-loops. Then, the trip count for both loops is calculated. If both loops execute the same number of times then the bodies are fused and the condition for the fused loop comes from the first loop. Otherwise, the loop with the smaller trip count has a guard with its current condition inserted around the body, the two bodies are fused, and the condition for the fused loop is created using the condition condition from the larger loop. This is clearly an opportunity for loop-peeling and is discussed in section 6, Future Work.

TraceVis1 was used to profile TraceMonkey and determine how much time the tracer was spending in each phase of execution. TraceMonkey emits timing events to a log events whenever the compiler phase changes, which the TraceVis tool can later analyze. The post-execution analysis reduces the overhead and provides more accurate timing results. The two TraceVis tools used were Vis, a visualization creating tool, and binlog, a text-based analysis tool that was used to quantify times spent in each phase. These tools were used to analyze the performance of several benchmarks before and after application of JSS.

4. Results

The results of a specific test case we designed for loopfusion and two benchmark programs from sunspider [1] are shown in table 1.

Table1: Time spent in each JS execution and tracing phase

The first program, which we designed, tests a best-case situation where two loops are immediately next to each other, increment in the same direction, but have different bounds. The second program, modified from sunspider to allow for loop-fusion, measures binary tree access times. The third program, access-nbody.js, was the only SunSpider test that allowed for loop-fusion. The bound for access-nbody’s fusible loops, although the same was variable rather than constant. JSS was modified to account for variable bounds. Each benchmark was repeated 100 times and the average speed found. Figure 2 shows the amount of time spent in each phase of TraceMonkey, with different colors representing the different phases. The colors have been shown in the key in Figure 2(g).

Figure2: Visualization of JS execution and tracing phases

The top row of each image graph is read by reading top to bottom, left to right. The middle section shows the expected speed up of native execution over interpretation, while the bottom row shows the proportion of time spent in each phase of the JavaScript engine.

Applying our optimization should result in the reduction of the overhead, which would mean less total time spent monitoring, recording, compiling and setting up and tearing down. The first and third images show the timing without the JSS optimization while the second and fourth show the timing after applying JSS, respectively. The second graph (Fusion1-Post) shows a definite reduction in the amount of time spent monitoring and recording traces, as well as fewer number of traces, indicating the application of JSS.

The results of applying JSS to the first benchmark is clearly visible, with a greatly reduced amount of time spent in the overhead. The results from table 1 help to more clearly quantify the change, and show that the time spent monitoring and recording traces was significantly lower after JSS than before. This reduction made up for the additional time spent in compilation because of the larger amount of code being compiled. The estimated speed up increased by 0.01, indicating improved performance. However, the additional guard added to prevent the loop with the smaller trip count from executing too often caused more time to be spent in native execution and caused the benchmark to execute more slowly than pre-JSS. Future optimizations, such as loop-peeling, could remove the need for a guard and increase the performance of JSS.

The second benchmark ran over a larger period of time, increasing the benefit of JSS. The time spent monitoring traces was reduced, and the overall native execution time was much lower. Discrepancies between this benchmark and the previous could be due to better loop-bounds alignment that removed the need for a guard to prevent the fused loop from executing too often.

The third benchmark shows the greatest improvement due to JSS. The time spent in the interpretation, monitoring, recording and compilation phases was greatly reduced and the code spent more time running natively. This third benchmark highlights the benefits of JSS, benefits that could be realized in other less-optimal situations with additional optimizations.

5. Problems

The goal of JSS is to show that increasing the scope of code the tracer could use for optimization would lead to performance benefits required ignoring many erroneous situations, the most important of which is side exits occurring in loops. There are also many difficulties intrinsic to JavaScript which are out of the scope of a static analyzer. In particular global variables, with statements and eval expressions are dynamic language features that are out of.

Side exits were ignored in JSS to show that loop-fusion was a viable optimization for increasing performance. Detecting side exits would not be difficult, but would require traversing an AST after parsing completion or carrying a flag up the tree to show that a break or continue statement occurs. Traversing the AST after creation is undesirable because it would add extra processing time by adding another analysis pass. Adding flags to the tree would require augmenting the parser in many locations and adding additional information to the AST node objects.

Global variable dependencies are not particular to Java-Script, but their difficulties are certainly more pronounced than other languages. Because variables default to global, they abound in real-world scripts. The global variables then cause unnecessary dependencies because the dependencies cannot be easily determined statically. They also cause problems because any loop calling a function causes a dependency from that loop to every global variable. Methods for removing the pressure from global variables could include copying the global variable in the pre-header and using the copied name inside the loop. Analyzing called functions would reduce the number of dependencies, but would add analysis time. As shown in the example in Figure 3, both variables a and b are globals as their declaration is not preceded by the keyword var. A naive analysis would wrongly try to fuse the two for loops that are declared on lines 3 and 7. However, function bar which is called at line 4 of the first loop, changes the global variable b making the loops infusible. Hence, a more thorough inter-procedural analysis is required to handle global variables.

Figure 3: Global variables in JavaScript

With statements cause difficulties in dependency analysis because they change the namespace of code. Variables used within a with block could be referencing properties of the parameter passed to the with statement or variables defined in the local or global scopes. The properties of a JavaScript object are determined at runtime, which means that the dataflow analysis techniques used for JSS fail to determine the correct definition of variables used within the with block. As shown in Figure 4, function myObj is called to create an object with two properties (i and j) which is assigned to variable obj. The with construct on line 4 makes the properties of the object obj directly accessible to the code within the block it defines. The two while loops on line 5 and 9 use properties i and j from this object as their induction variables (instead of obj.i and obj.j). Even though these loops can be fused, a simple def-use analysis technique cannot find the definitions of these variables.

Figure 4: Use of with construct in JavaScript.

Another major difficulty occurs with the use of eval expressions. JSS performs optimistically with regard to these expressions in order to show the viability of the optimization, however, in reality eval kills the opportunity for loop-fusion. The major problem with this is that variables can be defined (or deleted!) within eval statements, and the side-effects of eval cannot be determined until runtime. If an eval statement occurs between the definition of an induction variable and its uses, then the bounds of the induction variable cannot be correctly identified. As shown in Figure 5, the eval statement on line 4 introduces a dependency between the loops and they cannot be fused. Analyzing the effects of an eval statement requires precise string analysis and is out of the scope of this paper. A reader can refer to [6] for a static string analysis algorithm for JavaScript.

Figure 5: Use of eval function in JavaScript.

A final problem we encountered was with finding appropriate benchmarks that would benefit from loop-fusion. Only the access-nbody test from the SunSpider benchmark suite had loops that qualified for fusion under our criteria. The lack of applicability of loop-fusion could very well be indicative of the programs developed or generated in JavaScript, however we have not yet done or seen research to back such an assertion.

6. Future Work

Future work would improve the safety of JSS and add other loop-related optimizations that could increase the performance of the resulting code. Currently the induction variable is identified based on the variable used in the loop condition, but true induction variable analysis would lead to a more accurate optimization. Dependency analysis, loop-peeling, and induction variable normalization would all aid in improving the performance and number of opportunities available to JSS.

Current dependency analysis marks any use or definition of a used variable as a dependency. The number of dependencies could be reduced by analyzing variable uses and reordering fused code to remove unnecessary dependencies. For example, if the same array, A, is used in both loops, the access variables can be analyzed to determine if it is a true dependence. This analysis could be performed using a standard combination of tests, such as GCD, Banerjee Inequalities, or I-Test. These tests have the undesirable quality of slowness, but a balance between analysis time and conservatism could possibly be found.

Loop-peeling could also increase the performance of JSS by removing the need for guard checks when the trip counts of the two loops are different. The guard checks add several unnecessary branch instructions to ensure the side effects after fusion remain the same as before. Although loop-peeling would reduce the amount of native code, because it would not occur within the trace, it would reduce the number of checks and branches occurring within the native code.

Finally, induction variable normalization would greatly improve the number of loops that could be fused and increase opportunity for additional optimizations. Normalization would allow loops that run in opposite directions to be fused, increasing the number of loops that could be fused by JSS. It would also increase opportunities for additional optimizations, such as induction variable elimination and strength reduction.

7. Conclusion

The JavaScript language proves difficult to optimize because of its dynamic nature and wide variety of constructs. TraceMonkey[4, 10] and V8 showed JIT to be an effective method of increasing performance, but further importance placed on this browser-focused language of choice demand further performances increases. JSS attempted to increase the performance by combining the larger granularity of JITed code associated with V8 and the tracing used by TraceMonkey. JSS increased the amount of code available for optimization by performing parse-time loop-fusion. Increasing loop size and decreasing the number of loops means there are fewer calls to the JIT compiler and thus less overhead. JSS showed the viability of this optimization but ignored some side issues due to outside constraints that will be remedied with future work.

8. References

[1] Webkit. SunSpider JavaScript benchmarks. URL http://www2.webkit.org/perf/sunspider-0.9/sunspider.html.

[2] D. Anderson. Tracemonkey and type instability in javascript. URL http://www.bailopan.net/blog/?p=112. Accessed on 06 May 2010.

[3] A. Gal. Tracing the web. URL http://andreasgal.wordpress.com/2008/08/22/tracing-the-web/. Accessed on 06 May 2010.

[4] A. Gal, B. Eich, M. Shaver, D. Anderson, D. Mandelin, M. R. Haghighat, B. Kaplan, G. Hoare, B. Zbarsky, J. Orendorff, J. Ruderman, E. W. Smith, R. Reitmaier, M. Bebenita, M. Chang, and M. Franz. Trace-based just-in-time type specialization for dynamic languages. In PLDI ’09: Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, pages 465–478, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-392-1. doi:http://doi.acm.org/10.1145/1542476.1542528.

[5] Google. v8 javascript engine. URL http://code.google.com/p/v8/. Accessed on 06 May 2010.

[6] S. H. Jensen, A. Møller, and P. Thiemann. Type analysis for JavaScript. In Proc. 16th International Static Analysis Symposium, SAS ’09, volume 5673 of LNCS. Springer-Verlag, August 2009.

[7] K. Kennedy and K. S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing, pages 301–320, London, UK, 1994. Springer-Verlag. ISBN 3-540-57659-2.

[8] F. Logozzo and H. Venter. Rapid atomic type analysis by abstract interpretation. application to javascript optimization. In International Conference on Compiler Construction. Springer Verlag, March 2010.

[9] M. Wiki. J¨agermonkey – method based jit engine, . URL https://wiki.mozilla.org/JaegerMonkey. Accessed on 06 May 2010.

[10] M. Wiki. Tracemonkey – javascript tracing engine, . URL https://wiki.mozilla.org/TraceMonkey. Accessed on 06 May 2010.


No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: