PoiNtEr->: Understanding Run-time Environment Representation and Control

In Algol-like languages like Pascal, Algol W, and C/C++, all of the environments that exist at any point during a computation can be collectively represented using a stack. This representation for environments is particularly advantageous because the environment stack can be implemented as an elaboration of the control stack included in the architecture of a modern computer to support procedure calls.

Algol-like languages are almost always compiled to machine code instead of interpreted like LC and Jam. Nevertheless, during program execution, the compiled machine code must perform all of the same operations on program data structures as interpreted code. The major difference between the two approaches is that a compiler typically has the opportunity to perform far more program analysis enabling it to precompute quantities that are computed by an interpreter during program execution.

Almost all modern machines provide a control stack to store the return addresses of procedure calls. In addition, other context information (such as the contents of registers that must be restored by the called procedure) is typically saved with the return address in a frame on the control stack. To return, the called procedure pops the current ("top") frame off the stack, restores the saved context information (typically the contents of registers that by convention must be preserved across procedure calls) and jumps to the specified return address. The popping of the "top" stack frame off the stack restores the stack to the form it had before the call (although some of the bindings for local variables stored in the stack may have changed).

Some machines also pass argument values to procedures in the stack. The argument values are pushed on the stack along with the return address and the saved context information. Another common convention is to pass arguments (assuming the number is small) in registers. The result returned by a procedure is typically returned in a register because the stack frame associated with the call is deallocated on return. (Another possible convention is to store the return value in a designated location in the calling stack frame.)

Now let us relate the stack representation of environments to our understanding of the evaluation of programming languages with lexical scoping, i.e., the (recursive) let and lambda constructs. We have discussed lexical scope in the context of mostly functional languages based on the lambda-calculus, but exactly the same lexical constructs are present in Algol-like languages. In Algol-like language, a rec-let is called a block and a lambda (which must occur as a rhs of a definition introduced in a block [rec-let definition]) is called a procedure declaration.

In a stack-based implementation of a lexically-scoped language, a new environment is constructed (the extend-env process in our LC interpreter) to evaluate the body of a let or a lambda application by allocating a new frame called an activation record on the control stack. The activation record contains:

* the new variable bindings introduced by the let or lambda,

* a pointer called the static link pointing back to the rest of the environment (a linked list of activation records),

* a pointer called the dynamic link to the preceding activation record,

* the return address (address of the next instruction in the code block that invoked the let or lambda application, and

* any register values that need to be saved for restoration on return from the let or lambda.

In this representation, an environment consists of a linked list of activation records; the link field connecting this list static link in each record. The first record in the sequence gives the local bindings (static distance 0), the second record gives the bindings at static distance 1, and so forth. The length of this list is simply the lexical nesting level of the body of the let form or lambda application being evaluated.

For let invocations (regardless of whether let is recursive)

(let ([x1 e1] ... [xn en]) E)

and raw lambda applications

((lambda (x1 ... xn) E) e1 ... en),

the static link and dynamic link in the new activation record both point to the same place, namely the preceding activation record on the stack (the activation record for the enclosing let form or lambda application.

For a function application

(f e1 ... en)

the static link in the new activation points to the activation record in the static chain corresponding the static distance between the application site and the definition of f. For a simple recursive function call (e.g., the recursive call in the usual definition of factorial), this static link is identical to the static link in the calling activation record (the preceding activation record on the stack).

In Algol-like languages, the only closures are functions that are passed as arguments to other functions. These closures are represented by a pair of pointers. One pointer points to a representation of the the function consisting of a record with the code for the function (or a pointer it) and a template describing the format of its activation record. The other pointer is the static link to be stored in the activation record for the closure when it is invoked (the saved environment for the closure).

In Algol-like languages, closures can only be passed as function arguments. As a result, the activation record identified by the static link stored in the closure is always available on the stack when the closure in invoked. When that activation is finally freed, the corresponding closure is guaranteed to be inaccessible. (Since C/C++ has no nested blocks or procedure definitions, closures degenerate to function pointers.)

The stack representation of environments breaks if a closure can be invoked after the activation record to which it points (through its static link) is deallocated ("popped"). The activation record no longer exists and the storage it occupied may have been reused. If the closure code tries to refer to a variable in the deallocated record, it retrieves corrupt data. Algol-like languages restrict the usage of closures to prevent this problem from occurring. (In C/C++ the same problem can arise if a data structure points to an object on the stack. Unfortunately, C/C++ does nothing to prevent the catastrophe that occurs if the activation record containing the object is deallocated while the reference still exists.)

In the absence of tail-call optimization which converts procedure calls to jumps, every procedure call allocates a new stack frame. Hence, a recursive procedure may allocate a large number of stack frames. The standard definition of the factorial function, for example, allocates 1001 frames to evaluate 1000! In this case, all of the activation records for invocations of factorial have the same static link.

Programs written in "advanced" languages like Scheme, ML, Jam, and LC can obviously be restricted to accommodate the stack representation of environments by prohibiting closures from being returned as values or stored in data structures. But this restriction reduces these advanced language to Algol with a garbage-collected heap. To accommodate general closures, two implementation strategies are widely used.

The first strategy is to abandon the stack discipline for managing activation records by allocating activation records in the heap and relying on garbage collection to reclaim the storage occupied by activation records. This approach does not use the control stack mechanism provided by the underlying computer architecture.

The second strategy uses a hybrid representation scheme for environments that supplements the stack representation with information stored in the heap. The critical flaw in the stack implementation is that it destroys variables when the evaluation of the corresponding lambda returns. If such variables are stored in the heap and the closure knows how to find them, then the static link stored in the closure representation is unnecessary: all lookups that would have followed the static chain simply access the appropriate variables in the heap. Such an implementation must identify the set of variables that occur free in each lambda-expression and force them to be allocated on the heap (adding a level of indirection to the access protocol). The activation records that would have contained these variables now contain pointers to them (located in the heap) instead. Fortunately, the "closure analysis" required to determine which variables must be heap allocated is easy for a compiler to do.

In the hybrid strategy for supporting closures, the activation record template used to represent a closure must include a pointer field for each free variable in the closure. When a lambda-expression is evaluated, an activation record template is allocated and the values of the pointers to free variables are copied from the relevant activation records on stack.

It is possible to build a good language implementation using either strategy. The hybrid scheme adds a level of indirection to some variable accesses (ordinary lookups of heap allocated variables) but reduces it in others (free variable lookups from within closure bodies). Overall, the hybrid scheme has a modest advantage because it manages memory allocation for activation records more efficiently (through memory reuse and less fragmentation). In addition, the hybrid scheme tends to recover more free memory during garbage collection because it only retains the variable bindings that actually appear in closure bodies, while the brute force heap allocation scheme retains all bindings in the activation records accessible to closures.

In the course of an computation, an exceptional condition may be encountered that requires abandoning a subcomputation. If that subcomputation has a large number of associated stack frames, the "bubbling" action require to pass a special value back up the call chain is time-consuming and awkward to program. What we want is a construct that lets us label a selected stack frame as a recovery point and return a value directly to that frame (just as if the next stack frame had returned normally). The "direct return" operation deallocates all of the stack frames from the current frame back to the recovery frame; this can be done simply by changing the contents of the register serving as the stack pointer. In addition, it places the return value in the standard place (usually a register) in determined by the procedure calling conventions.

PoiNtEr->

Search This Blog

Monday, January 17, 2011

Understanding Run-time Environment Representation and Control

No comments:

Post a Comment