Walk through the implementation of deep field selection

2019-12-07

This blog entry is unusual in that it documents an aspect of Ramen implementation that is amongst the hardest to comprehend.

Arguably, such documentation should reside with the code; But it is a well-known fact that the order of exposition that's best for an explanation does not always match how code is organized in a code base. One typical solution to this problem is literate programming, which comes with its own flaws. Another typical solution which I'm trying here is to supplement code documentation with blog posts.

This one is going to describe the implementation of how subfield access is represented and compiled, with a special focus on unused fields elision.

Compound types

With compound types come the need to access subparts of a value. For now we have 4 possible compound types:

Array like compound types

The simplest compound types are vectors and lists, which are very similar.

All sub-items of vectors or lists are of the same type.

The difference between vectors and lists is that vectors have a dimension that is known at compile time and that must be at least 1, whereas lists have a dynamic length and can be empty. The distinction is useful for type checking but past that point vectors and lists are compiled into the same thing: arrays.

Those compound types, like all other types, are described in the very classical sum type RamenTypes.structure, as variants TVec of int * t and TList of t. Note that the type t here is simply:

type t =
  { structure : structure ;
    nullable : bool }

In this post we do not need to pay attention to NULLable values and therefore from now on you can equal t and structure.

Values of those types are described by the corresponding variants VVec of value array and VList of value array of the type RamenTypes.value.

Immediate values of type vectors can be created with this syntax, for instance here creating a vector of dimension 3: [ 1; 4; 9 ]. This will be parsed as the immediate value: RamenExpr.VVec [| VI32 1l; VI32 4l; VI32 9l |].

It is also possible to enter immediate vector expressions such as: [ 1^2; 2^2; 3^2 ], which will be parsed into an AST of type RamenExpr.r and value:

RamenExpr.{
  text = Vector [
    { text = Stateless (SL2 (Pow, { text = Const (VI32 1l) ; ... },
                                  { text = Const (VI32 2l) ; ... })) ; ... },
    { text = Stateless (SL2 (Pow, { text = Const (VI32 2l) ; ... },
                                  { text = Const (VI32 2l) ; ... })) ; ... },
    { text = Stateless (SL2 (Pow, { text = Const (VI32 3l) ; ... },
                                  { text = Const (VI32 2l) ; ... })) ; ... },

...omitting all fields from RamenExpr.t but text.

Notice that there is no way to write a constant list. Since a constant list would have a known length then it would be a vector (exception: the empty list would kind of make sense). Lists can only appear as the result of some functions.

Given v is a vector or a list and n is any integer, the syntax to access v's nth item is the familiar: v[n].

Product types

The other compound types supported by Ramen are tuples and records.

Unlike array-like types, tuples and records can hold together values of different types.

Tuple types are represented by the variant RamenTypes.TTuple of t array and tuple values by RamenTypes.VTuple of value array. Internally, tuple values will be compiled into OCaml tuples.

Immediate tuple values can be created with the syntax: (1; "two"; 3.0) similar to vectors but with parentheses instead of brackets (and no restriction that every values must have the same type).

Records are very similar to tuples, adding only the possibility to access fields by name. They are also compiled into OCaml tuples. Record types are described by the variant RamenTypes.TRecord of (string * t) array and values by RamenTypes.VRecord of (string * value) array.

Immediate record values are usually created with the SQL select clause syntax: 1 AS foo, "two" AS bar, 3.0 AS baz. Also, to help disambiguate the parsing in case of nested records, one can use parenthesis and the AZ keyword instead of AS: (1 AZ foo, "two" AZ bar, 3.0 AZ baz).

The syntax to access a subfield f of a record r is the familiar dotted notation r.f. To access the nth item of a tuple t there is no such convenient shorthand and one has to use the get function: get(n, t).

Actually, that get function is universal and all other forms seen so far are just syntactic sugar for get:

v[42] is equivalent to get(42, v), r.f is get("f", r), and foo.bar[7].baz is get("baz", get(7, get ("bar", foo))).

Accessing deep fields, and field elision

As far as parsing is concerned, get is pretty much an ordinary function, defined as the variant RamenExpr.Get of the RamenExpr.stateless2 type that denotes all stateless functions with 2 parameters, with some added syntactic sugar on top.

But get is much more than a usual function. In conjunction with the other special expressions RamenExpr.Variable, Path and Binding, Get implements subfield access.

Variables

What's called variable in this context is, not unlike in general purpose programming languages, a named handle for some value. Ramen variables are not only immutable, but they cannot even be created. There are only a handful of given variables depending on the context.

Variables are used to hold a function input value, its output value, the record of program parameters, the record representing the environment, etc. Some are seldom used such as the variable holding the smallest input value in the input sorting buffer.

Variables can be referenced by their name: in and out, for instance, are the variables holding respectively the input and output value of a function, whereas smallest is the name of the variable holding the smallest input value in the current sort buffer. Those variables are usually of a record type, and then their name can sometimes be omitted, so that for instance in the SELECT clause one can write simply foo instead of in.foo.

When it is omitted, a variable name is represented by the value Unknown. Therefore, foo is equivalent to get("foo", unknown).

The list of all those well-known variables can be found in RamenLang.parse_variable, the function that parses such a name into a variable.

The parser RamenExpr.Parser.sugared_get, which is responsible for parsing all the syntactic sugar around the get function, is also where omitted variable names are assigned to the variable Unknown.

Unknown variables are replaced by actual ones in RamenOperation.resolve_unknown_variables as part of RamenOperation.checked, a function that is called right after an operation has been parsed, to perform various checks, in particular to check that the accessed variables are indeed available when they are accessed (for instance, using smallest outside of a SORT operation will raise an error).



There is also a special variable named Record that is used to refer to any previously opened record (more on this later).

Paths

RamenExpr.Path expressions are similar to RamenExpr.Get expressions, with a few important differences:



To begin with, there is no way to create a path expression from the expression parser. Only later during the compilation process will some get functions be replaced by equivalent paths.

Then, path expressions can handle deep access directly: a chain of get functions such as get("baz", get(7, get ("bar", in))) will be replaced with a single Path expression.

Last but not least, paths apply only to input record.



Indeed, path expressions, which variant is RamenExpr.Path of path_comp list, materialize deep field selection into the parent functions output.

For instance, in.bar[7].baz (or merely bar[7].baz whenever In is the default variable) will be turned into RamenExpr.Path [ Name "bar"; Idx 7; Name "baz" ].

The function responsible for this change is RamenFieldMaskLib.subst_deep_fields, called as part of RamenCompiler.precompile (Note: what is called precompilation in this context is merely parsing + typing of ramen programs. The actual compilation, ie. generation of a native executable, takes place elsewhere both in time and place. Look for it in RamenMake).

Once all input field selections have been replaced by paths, Ramen can start to think about what it needs from the parent output values and how to obtain them.

FieldMasks

A field mask represents the selection of fields that a given function wants to receive from each of its parents.

For instance, in this simple program:

DEFINE parent AS YIELD 1 AS one, 2 AS two, 3 AS three EVERY 1s;
DEFINE child AS SELECT two + two AS four FROM parent;


...the parent function must send to its child the field two only, which cam be represented succinctly by the string "_X_" meaning: skip the first field, send the second field, and skip the third field. The equivalent OCaml expression is: RamenFieldMask.[| Skip; Copy; Skip |].

Those field masks in shorter string form are stored in the out_ref files as part of the destination ring-buffer file name.

Deep field selection complicates this picture significantly.

What was basically a flat bitmask becomes a recursive structure:

DEFINE parent AS YIELD 0 AS fist, (1 AZ thumb, [2; 3; 4; 5] AZ fingers) AS hand EVERY 1s;
DEFINE child AS SELECT finger[2] AS ring, fingers[3] AS pinky;


Here the child uses only the 3rd and 4th items of the fingers vector of the hand record (of the input record), and we want to transmit only these from the parent to the child; not the whole output record nor even the whole finger vector.

The field mask become: "_(_(__XX))", meaning: skip the first field (fist), then enter the second field, then skip its first field (thumb), then enter its second (finger), then skip, skip, and finally copy and copy. Or the equivalent OCaml expression: [| Skip; Rec; Skip; Rec; Skip; Skip; Copy; Copy |].

So, back to our story: Ramen has the set of all paths needed from the parent, and must be able to compute the corresponding field mask. That is the task of the RamenFieldMaskLib module, starting with function fieldmask_of_operation, and this is needed every time the out_ref specification is needed: when updating the out_ref files of course, but also when generating the code to serialize values.

This serialization function used in a parent function must handle all possible field masks (as we do not want to recompile a running worker every time a new child is attached to it, each with its special needs). The crux of the code generator is a field mask crawler in CodeGen_OCaml.emit_for_serialized_fields. The generated code will take a fully fledged output value and a field mask and serialize only the requested subfield from that value.

For now, there is still a small but important operation to perform in the precompilation stage: that of computing the input type of each function. This is subtly different from the set of all paths, for two reasons:


If all items of a compound value are selected then the whole value is selected rather than every individual items, simplifying the code to serialize a record;
If a compound value is selected then there is no more need to also select any individual sub items.


Bindings

Code is generated for many small functions that receive some well-known variable as parameter. For instance, the function implementing the fast variant of the where clause receives the Param, Env, In, MergeGreatest and OutPrevious variables, which hold respectively the record with all program parameters, the record with all used environment variables, the input value, the largest value form the optional merge buffer and the latest output value.


Instead of hard-coding in the code generator how to access each of those variables, a more general stack of accessible values is passed around. Each item of this stack is composed of a pair of a RamenExpr.binding_key and whatever code is supposed to be equivalent to this value at runtime. A binding_key identifies either a variable, a subfield of a variable, or a given stateful expression internal state, or, as a special case, any other runtime identifier.


This stack is initialised with all accessible variables (and all the required internal states of all present stateful expressions) before calling the code generator for a given function.

Provided all Path and Get expressions have been substituted with an equivalent Binding of binding_key expression, the AST walker can then retrieve any referenced object by looking it up in that stack. This substitution happens in CodeGen_OCaml.subst_fields_for_binding.

Beware that this stack is classically called the "environment" (~env named parameter), not to be confused with the UNIX process environment available via the Env variable.

This process or substitution from Path and Get expressions to bindings, then building the environment stack, and looking up in that stack how to actually transpose each access into concrete code, can look convoluted compared to hard-coding how to access each possible Path and Get expressions, but once the machinery is in place it is conceptually simpler, and allows an interesting trick: local opening of records.

Indeed, as soon as a record field value is known, a reference to it is pushed onto the environment stack (and popped after the record has been compiled) so that the following expressions can refer to it by name. This allows to write such things as:

SELECT 42 AS foo, foo+1 AS bar


...where at the location where bar is computed a binding to foo is present in the stack (may be shadowing another foo defined earlier).

Conclusion

Of course many things have not been covered by this short post, especially with regard to code generation, but at least these simple explanations will highlight some useful milestones and help the adventurous hacker to get the gist of some of the most obscure expressions.