<BODY >

"http://www.w3.org/TR/REC-html40/loose.dtd"> Pointers

3 Pointers

As in C, Cyclone pointers are just addresses. Operations on pointers, such as *x, x->f, and x[e], behave the same as in C, with the exception that run-time checks sometimes precede memory accesses. (Exactly when and where these checks occur is described below.) However, Cyclone prevents memory errors such as dereferencing dangling pointers, so it may reject legal C operations on pointers.

In order to enforce memory safety, Cyclone pointer types contain more information than their C counterparts. In addition to the type of the object pointed to, pointer types indicate:

Whether a value of the type may be NULL
The number of objects pointed to
The region into which the pointer points

For example, the type int *{7}`H is for possibly-null pointers to seven int objects in the heap. The syntax and semantics of all this additional pointer information is now explained. Then we introduce a new type for arrays of unknown size. Pointer arithmetic is allowed only on this last collection of types. Throughout, we mention planned improvements. We end with a summary.

Whether a value of the type may be `NULL`

Cyclone's type system distinguishes between pointers that may be NULL and those that may not.

Syntax and Semantics

The syntax is straightforward: The * character is for pointers that may be NULL (as in C), and the @ character is for pointers that may not be NULL. So ``int * x = NULL;'' is accepted, but ``int @ x = NULL;'' is not.

Subtyping

For any type t, the type t@ is a subtype of t*. The type of malloc(sizeof(t)) is t@, as is new e where e has type t. Hence in the declaration, ``int *x = malloc(sizeof(int))'', there is an implicit legal cast from t@ to t*. Note that even when t1 is a subtype of t2, the type t1* is not necessarily a subtype of t2*, nor is t1@ necessarily a subtype of t2@. For example, int@@ is not a subtype of int*@. This illegal code shows why:


  void f(int @@ x) {

    int *@ y = x; // would be legal were int *@ a subtype of int @@

    *y = NULL;    // legal because *y has type int *

    **x;          // seg faults even though the type of *x is int @

  }

You can explicitly cast a value of type t* to t@. Doing so will perform a run-time check. The cast can be omitted, but the compiler emits a warning and performs the run-time check. Because the current implementation does not consider tests to change a t* to t@, such casts are sometimes necessary to avoid spurious warnings, such as in this code:


  extern void f(int @);



  void g(int * x) {

    if (x != NULL)

       f((int @)x);

  }

Implementation

A run-time null check is a simple comparison. If it fails (i.e., the value is NULL), the exception Null_Exception is thrown. A check is inserted whenever a t* is (explicitly or implicitly) cast to a t@. Casting t@ to t* has no run-time effect.

Safety demands that if x may be NULL, then *e, e.f, e->f, and e[e2] are translated such that we first check that e is not NULL. e is only evaluated once. The only way to guarantee there is no check at run-time is to use @ instead of *. For example, the function on the left performs one check whereas the one on the right performs three (both throw Null_Exception if passed NULL):


  int sum3(int *{3} x) {           int sum3(int *{3} x) {

    int @{3} y = x;                  return x[0]+x[1]+x[2];

    return y[0]+y[1]+y[2];         }

  }

Note that &e->f and &e[e2] check (if necessary) that e is not NULL even though these constructs do not read through e.

Future

We may use dataflow information to avoid spurious warning about implicit casts from t* to t@ and to avoid inserting unnecessary checks. However, the analysis is non-trivial (due to the address-of operator, unstructured control flow, and undefined evaluation order), and the C compiler may be able to eliminate unnecessary checks for us.
For debugging purposes, we may have Null_Exception carry source-position information.

The number of objects pointed to

Syntax and Semantics

The type t@{37} (similarly t*{37}) describes pointers to 37 t values. In other words, if x has type t@{37}, then x[e] is safe so long as e is between 0 and 36, inclusive. If the {n} is omitted, it is implicitly {1}. Currently, the number must be a compile-time constant---see below for arrays of unknown size.

We are taking pains not to say t@{37} describes an array of 37 t values because C (and therefore Cyclone) distinguishes arrays and pointers in certain contexts. For example, a local declaration ``t@{37} x;'' allocates space on the stack for a pointer (which must hold a pointer to 37 t values) whereas ``t x[37];'' allocates space on the stack for 37 t values.

Subtyping

Pointers to more objects are subtypes of pointers to fewer objects (of the same type). An explicit cast is not necessary. Put another way, we could say t@{37} describes pointers to at least 37 t values.

Implementation

The length information is not present at run-time, except implicitly in run-time checks. That is, if e has type t @{37}, the compiler translates e[e2] to check that e2 is less than 37. e2 is evaluated only once. If e2 is a constant expression, there is no run-time check. If e2 is a constant expression not less than 37, it is a compile-time error.

Future

In the future, the bounds information on a pointer will not have to be a compile-time constant. For example, you will be able to write


  void f(int n) {}

    int *{n} arr = new {for i < n : 37};

    ...

  }

This addition is non-trivial because, in terms of the above example, the variable n may be mutated later in the function. In general, we are developing a general system where the sizes of pointer bounds may be expressed in terms of expressions, yet the compiler can always insert the correct bounds check or verify that the check is unnecessary.

Currently, pointer arithmetic is only allowed on types of the form t?. Soon we will allow adding a compile-time constant c to t@{n} (for example), with the type of the result being t@{n-c}. It will be a compile-time error if c > n.

The region into which the pointer points

Syntax and Semantics

The type t@`r describes pointers into region `r. All regions start with the ` character so that they are not confused with identifiers. If the region is omitted, the compiler inserts one. The region inserted depends on where the type occurs, as described below.

The heap region (written `H) conceptually lives forever; in practice, it is garbage-collected.

Every block (i.e., local scope) is a region. If you label a block with L:, then the region's name is `L. Similarly, the parameters to a function f are in a region named `f. Thanks to region inference, you can point into regions without explicit names. For example, you can say int *x = &y if y is a local variable in an unlabeled block. Conceptually, the compiler creates a label for the block and fills in the corresponding region name for you. (The output need not actually have a label.)

Because every pointer has a type and every pointer type has a region, a pointer cannot be mutated so that it points into a different region than it did before the assignment. Often subtyping (see below) is sufficient, but in some cases it is necessary to rewrite C code to use different variables for pointers into different regions. Note that there is no way for a global variable to hold a stack pointer.

Functions are implicitly polymorphic over the regions of their arguments. For example, void f(int *`r); is a prototype that can be passed a pointer into any accessible region. That is, it can be passed a stack pointer or a heap pointer, so long as it is not passed a dangling pointer. Note that our example function f could not possibly assign its argument to a global, whereas void g(int *`H); could. On the other hand, g cannot be passed a stack pointer.

The rules the compiler uses for filling in regions when they are omitted from pointer types are numerous, but they are designed to avoid clutter in the common case:

In function-argument types, a fresh region name is used.
In function-return types, `H is used.
In type definitions, including typedef, `H is used.
In function bodies, unification is used to infer the region based on how the location assigned the pointer type is used in the function.

In the future, we intend to change the rule for typedef so that the meaning can be different at each use of the typedef, as dictated by the other rules. Until then, be warned that


typedef int * foo_t;

void g(foo_t);

is different than


void g(int *);

Also, note that these rules are exactly the same as the rules for omitted regions in instantiations of parameterized types.

Subtyping

t *`r1 is a subtype of t *`r2 if `r1 is known to outlive `r2. In particular, you can always cast a heap pointer to a pointer into another region.

Implementation

A pointer's region is not stored with the pointer at run-time. So there is no way to ask for the region into which a pointer points. For stack regions there is no region object at run-time per se, just the stack space for the objects. As is normal with region-based systems, Cyclone does not prevent dangling pointers. Rather, it prevents dereferencing dangling pointers. But this is a subtle point.

Pointers to an Unknown Number of Objects---The `t ?` Types

So far, we have not provided a way to point to a number of objects when the number is not known at compile-time.

Syntax and Semantics

The type t ? describes such pointers to objects of type t. Such types may be assigned NULL. They may be annotated with a region, which (as with other pointer types) is the region into which the pointer points. Omitted region annotations are filled in by the compiler. Clearly, explicit bounds information makes no sense for these types. If e has type t ?, then e.size has type int and is the number of objects pointed to by type t. (Actually, e.size is allowed for any pointer type, but for other pointers it is evaluated at compile-time.) The meaning of operations on t ? objects is best explained in terms of the implementation.

Implementation

Unlike with types like t*{37}, the implementation stores bounds information with objects of type t ?. Currently, a t ? object occupies three machine words. Conceptually, the object maintains the starting address of the collection of objects pointed to, the length of the collection, and the current value of the pointer used for accessing memory. Pointer arithmetic may cause the access pointer to be out-of-bounds; no error occurs at this point. On the other hand, a subscript operation e1[e2] where e1 has type t? checks that the access-pointer of e1 plus e2 is within bounds of e1. Both e1 and e2 are evaluated once. If the bound is violated the exception Null_Exception is thrown.

When an object of type t? is assigned to, it gets the bounds information from the ``right-hand side'' of the assignment. So x=y copies all of y's fields to the fields of x. Similarly x = y + 17, copies y's fields and then adds 17 to x's access pointer. Finally, x++ just increments x's access pointer. As in C, pointer arithmetic is limited to addition of constants and subtraction. The result of pointer subtraction has type unsigned int, so there is no bounds information.

Even though, t ? types are implemented as multi-word values, comparison operations (e.g., ==) are defined on them---the comparison is performed on the access pointers.

Conversions to/from t ? `r types from/to t*{n}`r and t@{n}`r types exist. Converting to a t? type just uses the t* or t@'s static type information to initialize the bounds information. The cast may be implicit; no warning is given. Converting to a t* or t@ type incurs a run-time check that the access pointer has a value such that the target type's bounds information is sound. If so, the access pointer is returned, else the exception Null_Exception is thrown. Implicit casts of this form cause the compiler to give a warning.

Future

We may add a ``cannot be NULL'' version of these types for sake of completeness. More significantly, we intend to allow user-defined types to have certain fields describe the bounds information for other fields, rather than relying on types built into the language.

Summary and Discussion

A pointer type has one of the following forms, where t is a type and n is a constant unsigned expression:

t *{n}`r, a possibly NULL pointer to n elements of type t in region `r
t @{n}`r, a non-NULL pointer to n elements of type t in region `r
t ? `r, a pointer to an unknown number of elements of type t in region `r. Implemented as a multi-word object.

If {n} is omitted, it is {1}. If the region is omitted, the compiler inserts one. The region inserted depends on where the type is written.

The easiest way to port code is to replace uses of t * with t ?. Of course, this technique does not address region annotations. Functions that can take only heap pointers (because the pointers escape into data structures, for example) will need to add `H annotations for the relevant parameters.

Of course, using t? delays errors until run-time and is less efficient. Using t@ is the most efficient and guarantees that Null_Exception will not be thrown.

Currently, code performing pointer arithmetic must use t?.

3 Pointers

Whether a value of the type may be NULL

Syntax and Semantics

Subtyping

Implementation

Future

The number of objects pointed to

Syntax and Semantics

Subtyping

Implementation

Future

The region into which the pointer points

Syntax and Semantics

Subtyping

Implementation

Pointers to an Unknown Number of Objects---The t ? Types

Syntax and Semantics

Implementation

Future

Summary and Discussion

Whether a value of the type may be `NULL`

Pointers to an Unknown Number of Objects---The `t ?` Types