Ramen Language Reference

Values are described first, then expressions, then operations, and finally programs. All these concepts reference each others so there is no reading order that would save the reader from jumping around. First reading may not be clear but everything should fall into place eventually. Starting with a quick look at the glossary may help.

Basic Syntax

Blanks

Any space, tab, newline or comment is a separator.

Comments

As in SQL, two dashes introduce a line comment. Everything from those dashes and the end of that line is treated as space.

There is no block comments.

Quotation

Some rare reserved keywords cannot be used as identifiers unless surrounded by simple quotes. Quotes can also be used around operation names if they include characters that would be illegal in an identifier, such as spaces, dots or dashes.

Values

NULLs

Like in SQL, some values might be NULL. Unlike SQL though, not all values can be NULL: indeed value types as well as nullability are inferred at compile time. This benefits performance, as many NULL checks can be done away with, and also reliability, as there is some guarantee that a NULL value will not pop up where it's least expected, such as in a WHERE clause for instance.

Users can check if a nullable value is indeed NULL using the IS NULL or IS NOT NULL operators, which turn a nullable value into a (non-nullable) boolean.

To write a literal NULL value enter NULL.

For any type t, the type t? denotes the type of possibly NULL values of type t.

Booleans

The type for booleans is called boolean (bool is also accepted). The only two boolean values are spelled true and false.

It is possible to (explicitly) convert integers and even floating point numbers from or to booleans. In that case, 0 is equivalent to false and other values to true.

Strings

The type for character strings is called string. A literal string is double quoted (with "). To include a double-quote within a string, backslash it. Other characters have a special meaning when backslashed: "\n" stands for linefeed, "\r" for a carriage return, "\b" for a backspace, "\t" stands for tab and "\\" stands for slash itself. If another character is preceded with a backslash then the pair is replaced with that character without the backslash.

Some functions consider strings as UTF-8 encoded, some consider strings as mere sequence of bytes.

Numbers

Floats

The type for real numbers is called float. It is the standard IEEE.754 64 bits float.

Literal values will cause minimum surprise: dot notation ("3.14") and scientific notation ("314e-2") are supported, as well as hexadecimal notation (0x1.fp3) and the special values nan, inf and -inf.

Integers

Ramen allows integer types of 9 different sizes from 8 to 128 bits, signed or unsigned: i8, i16, i24, i32, i40, i48, i56, i64 and i128, which are signed integers, and u8, u16, u24, u32, u40, u48, u56, u64 and u128, which are unsigned integers.

Ramen uses the conventional 2-complement encoding of integers with silent wrap-around in case of overflow.

When writing a literal integer it is possible to specify the intended type by suffixing it with the type name; for instance: 42u128 would be an unsigned integer 128 bits wide with value 42. If no such suffix is present then Ramen will choose the narrowest possible type that can accommodate that integer value and that's not smaller than i32. Thus, to get a literal integer smaller than i32 one has to suffix it. This is to avoid having non-intentionally narrow constant values that would wrap around unexpectedly.

In addition to the suffix, you can also use a cast, using the type name as a function: u128(42). This is equivalent but more general as it can be used on other expressions than literal integers, such as floats, booleans or strings.

Scales

Any number can be followed by a scale, which is a short string indicating a multiplier.

The recognized scales are listed in the table below:

Numeric suffixes
names	multiplier
`pico`, `p`	0.000 000 001
`micro`, `µ`	0.000&mnsp;001
`milli`, `m`	0.001
`kilo`, `k`	1 000
`mega`, `M`	1 000 000
`giga`, `G`	1 000 000 0000
`Ki`	1 024
`Mi`	1 048 576
`Gi`	1 073 741 824
`Ti`	1 099 511 627 776

Network addresses

Ethernet

Ethernet addresses are accepted with the usual notation, such as: 18:d6:c7:28:71:f5 (without quotes; those are not strings).

Those values are internally stored as 48bits unsigned integers (big endian) and can therefore be cast from/to other integer types.

Internet addresses

IP addresses are also accepted, either v4 or v6, using the conventional notations, again without strings.

CIDR addresses are also accepted; for instance 192.168.10.0/24. Notice that there is no ambiguity with integer division since arithmetic operators do not apply to IP addresses.

NOTE: the in operator can check whether an IP belongs to a CIDR.

Compound types

Vectors

Vectors of values of the same type can be formed with [ expr1 ; expr2 ; expr3 ].

The type of vectors of N values of type t is noted t[N].

There is no way to ever build a vector of dimension 0. [] is not a valid value. This simplifies type inference a lot while not impeding the language too much, as the usefulness of a type that can contain only one value is dubious.

One can access the Nth item of a vector with the function GET or by suffixing the vector with the index in between brackets: if v is the vector [1; 2; 3], then GET(2, v) as well as v[2] is 3 (indexes start at 0).

Arrays

An arrays is like a vector which dimension is unknown until runtime.

It is not possible to create an immediate array (that would be a vector) but arrays are frequently returned by functions.

Accessing the Nth item of an array uses the same syntax as for vectors.

Tuples

Tuples are an ordered set of values of any type.

They are written within parentheses with values separated with semicolons, such as:(1; "foo"; true).

One can extract the Nth item of a tuple with the special notation 1st, 2nd, 3rd, 4th, and so on. For instance, 2nd (1; "foo"; true) is "foo".

Records

Records are like tuples, but with names given to each item.

Immediate values of record type uses curly braces instead of parentheses, and each item is preceded with its name followed by a colon. For instance: { age: 64; name: "John Doe" }.

To access a given field of a record, suffix the record with a dot (.) and the field name: given the above record value as person, then person.name would be "John Doe".

Units

In addition to a type, values can optionally have units.

When values with units are combined then the combination units will be automatically computed. In addition, Ramen will perform dimensional analysis to detect meaningless computations and will emit a warning (but not an error) in such cases.

The syntax to specify units is quite rough: in between curly braces, separated with a start (*) (or, equivalently, with a dot (.)), units are identified by name. Those names can be anything; They are mere identifier to Ramen. You can also use the power notation (^N) which avoids multiplying several times the same unit, and allows to use negative power as there is no dedicated syntax for dividing units.

Finally, each individual unit can be interpreted as relative if it's identifier ends with (rel). Relative units are expressed as a difference with some point of reference, whereas absolute units are no such reference. For instance, a duration is expressed in seconds (absolute), whereas a date can be expressed in seconds since some epoch (relative to the epoch). Similarly, heat can be expressed in Kelvins (absolute) whereas a temperature can be expressed in Kelvins above freezing water temperature (relative).

Relative or absolute units follow different rules when combined.

Expressions

Literal values

Any literal value (as described in the previous section) is a valid expression.

Record field names

In addition to literal values one can refer to a record field. Which records are available depends on the clause but the general syntax is: record.field_name.

The prefix (before the dot) can be omitted in many cases; if so, the field is understood to refer to the "in" record (the input record).

Here is a list of all possible records, in order of appearance in the data flow:

Input value

The value that has been received as input. Its name is in and that's also the default when a record name is omitted.

You can use the in value in all clauses as long as there is an input. When used in a commit clause, it refers to the last received value.

Output value

The value that is going to be output (if the COMMIT condition holds true). Its name is out. The only places where it can be used is in the commit clause.

It is also possible to refer to fields from the out record in select clauses which creates the out value, but only if the referred fields has been defined earlier. So for instance this is valid:

  SELECT
    sum payload AS total,
    end - start AS duration,
    total / duration AS bps

...where we both define and reuse the fields total and duration. Notice that here the name of the record has been eluded -- despite "in" being the default value, on some conditions it is OK to leave out the "out" prefix as well. This would be an equivalent, more explicit statement:

  SELECT
    sum in.payload AS total,
    in.end - in.start AS duration,
    out.total / out.duration AS bps

It is important to keep in mind that the input and output values have different types (in general).

Previous output value

Named previous, refers to the last output value for this group.

Can be used in select, where and commit clauses.

Notice that refering to this value in an expression has some runtime cost in term of memory, since group state must be kept in memory after the group value has been committed. So take care not to use this when grouping on potentially large dimension key spaces.

When no values have been output yet, any field read from previous is just NULL. Therefore, using previous makes it mandatory to always test for NULL.

Technically, the previous value has the same type as the out value with all fields nullable.

The other name of the previous value is local_last, to contrast it with the next special value.

Previous output value (globally)

Named global_last, this value is similar to previous aka local_last but it holds the last value that's been output from any aggregation group, ie. regardless of the key.

It therefore has the same type as local_last.

Unlike local_last, global_last incurs no runtime penalty.

Parameters

In addition to the values read from parent operations, an operation also receive some constant parameters that can be used to customize the behavior of a compiled program. See the section about defining programs below.

Such parameters can be accessed unambiguously by prefixing them with the value name param.

There is no restriction as to where this record can be used.

Environment

A RaQL program can also access its environment (as in the UNIX environment variables). Environment variables can be read from the env record.

Any field's value from that record is a nullable string, that will indeed be NULL whenever the environment variable is not defined.

There is no restriction as to where this record can be used.

Conditionals

Conditional expressions can appear anywhere an expression can. Conditions are evaluated from left to right. Contrary to programming languages but like in SQL language, evaluation of alternatives do not stop as soon as the consequent is determined; in particular the state of stateful functions will be updated even in non-taken branches.

CASE Expressions

The only real conditional is the case expression. Other forms of conditionals are just syntactic sugar for it. Its general syntax is:

  CASE
    WHEN cond1 THEN cons1
    WHEN cond2 THEN cons2
    ...
    ELSE alt
  END

...where you can have as many when clauses as you want, including 0, and the else clause is optional.

Every conditions must be of type bool. Consequents can have any type as long as they have all the same. That is also the type of the result of the CASE expression.

Regarding nullability: if there are no else branch, or if any of the condition or consequent is nullable, then the result is nullable. Otherwise it is not.

Variants

IF cond THEN cons or IF(cond, cons): simple variant that produce either cons (if cond is true) or NULL.

`IF cond THEN cons ELSE alt or IF(cond, cons, alt)`: same as above but with an ELSE branch.

Operators

Predefined operators can be applied to expressions to form more complex expressions.

You can use parentheses to group expressions. A table of precedence is given at the end of this section.

There is no way to define your own operator, short of adding them directly into Ramen source code.

Operator States

Most operators are stateless but some are stateful. Most stateful operators can also behave as stateless operator when passed a sequence of values. Many operators will aggregate values into a set in a certain way, and can thus be used as operands to other aggregate operators.

For instance you can write: SELECT AVG v to get the average value of all input values of v, or SELECT AVG SAMPLE 100 v to get the average value of a random set of 100 values of v. In the later case, AVG does not really perform any aggregation but SAMPLE does.

for every stateful operator that actually perform an aggregation, one can pick between two possible lifespans for the operators state: local to the current aggregation group, which is the default whenever an explicit group-by clause is present, or global (a single state for every groups). Thus it is possible to compute simultaneously an aggregate over all values with the same key and all values regardless of the key, such as in this example:

  SELECT AVG LOCALLY x AS local_avg,
         AVG GLOBALLY x AS global_avg,
         local_avg < global_avg / 2 AS something_is_wrong
  GROUP BY some_key;

Additionally, in some very rare cases it might be necessary to explicitly ask for the aggregate operator to operate over a given set of values, which can be enforced with the "IMMEDIATELY" keyword, such as in: SELECT AVG IMMEDIATELY SAMPLE 100 x.

Aggregates and NULLs

By default, aggregate functions will skip over NULL values. Consequently, aggregating nullable values result in a nullable result (since all input might be NULL).

To configure that behavior, two keywords can be added right after the operator's name: SKIP NULLS (the default) and KEEP NULLS.

Operators

Boolean operators

Arithmetic operators

Trigonometric functions

Arithmetic functions

Smoothing

Value selection operators

Grouping/Sketching functions

Bit-wise operators

String related operators

Time related operators

Networking related operators

NULLs and conversions related operators

Miscellaneous operators

Operator precedence

From higher precedence to lower precedence:

Operator precedence
Operator	Associativity
functions	left to right
`not`, `is null`, `is not null`	left to right
`^`	right tot left
`*`, `//`, `/`, `%`	left to right
`+`, `-`	left to right
`>`, `>=`, `<`, `<=`, `=`, `<>`, `!=`	left to right
`or`, `and`	left to right

Functions

Read

From files

The simplest way to ingest values may be to read them from a CSV files. The READ operation does just that, reading a set of files and then waiting for more files to appear.

Its syntax is:

  READ FROM FILES "file_pattern"
    [ PREPROCESS WITH "preprocessor" ]
    [ THEN DELETE [ IF condition ] ]
  AS CSV
    [ SEPARATOR "separator" ]
    [ NULL "null_character_sequence" ]
    [ [ NO ] QUOTES ]
    [ ESCAPE WITH "escapement_character_sequence" ]
    [ VECTORS OF CHARS AS [ STRING | VECTOR ] ]
    [ CLICKHOUSE SYNTAX ]
    (
      first_field_name first_field_type [ [ NOT ] NULL ],
      second_field_name second_field_type [ [ NOT ] NULL ],
      ...
    )

If THEN DELETE is specified then files will be deleted as soon they have been read.

The file_pattern, which must be quoted, is a file name that can use the star character ("*") as a wildcard matching any possible substring. This wildcard can only appear in the file name section of the path and not in any directory, though.

In case a preprocessor is given then it must accept the file content in its standard input and outputs the actual CSV in its standard output.

The CSV will then be read line by line, and a tuple formed from a line by splitting that line according to the delimiter (the one provided or the default coma (",")). The rules to parse each individual data types in the CSV are the same as to parse them as literal values in the function operation code. In case a line fails to parse it will be discarded.

The CSV reader cannot parse headers. CSV field values can be double-quoted to escape the CSV separator from that value.

If a value is equal to the string passed as NULL (the empty string by default) then the value will be assumed to be NULL.

Field names must be valid identifiers (aka string made of letters, underscores and digits but as the first character).

Examples:

  READ FROM FILE "/tmp/test.csv" SEPARATOR "\t" NULL "<NULL>"
  AS CSV (
    first_name string,
    last_name string?,
    year_of_birth u16,
    year_of_death u16
  )

  READ FROM FILES "/tmp/test/*.csv" || (IF param.compression THEN ".gz" ELSE "")
    PREPROCESSOR WITH (IF param.compression THEN "zcat" ELSE "")
    THEN DELETE IF param.do_delete
  AS CSV (
    first_name string?,
    last_name string
  )

It is also possible to read from binary files in ClickHouse binary format, which is more efficient. Then instead of CSV the format is called ROWBINARY and the format specification in between parentheses must be given in Clickhouse's specific "NameAndType" format.

Example:

  READ FROM
    FILES "/tmp/data.chb" THEN DELETE
    AS ROWBINARY (
columns format version: 1
2 columns:
`first_name` Nullable(String)
`last_name ` String
);

From Kafka

It is also possible to read data from Kafka. Then, instead of the FILE specification, one enters a KAFKA specification, like so:

  READ FROM KAFKA
    TOPIC "the_topic_name"
    [ PARTITIONS [1;2;3;...] ]
    [ WITH OPTIONS
        "metadata.broker.list" = "localhost:9092",
        "max.partition.fetch.bytes" = 1000000,
        "group.id" = "kafka_group" ]
    AS ...

The options are transmitted verbatim to the Kafka client library rdkafka, refers to its documentation for more details.

Yield

If you just want a constant expression to supply data to its child functions you can use the yield expression. This is useful to build a periodic clock, or for tests.

Examples:

  YIELD sum globally 1 % 10 AS count

  YIELD 1 AS tick EVERY 10 MILLISECONDS

Yield merely produces an infinite stream of values. If no every clause is specified, then it will do so as fast as the downstream functions can consume them. With an every clause, it will output one tuple at that pace (useful for clocks).

Syntax:

  YIELD expression1 AS name1, expression2 AS name2, expression3 AS name3... [ EVERY duration ]

Select

The select operation is the meat of Ramen operation. It performs filtering, sorting, aggregation, windowing and projection. As each of those processes are optional let's visit each of them separately before looking at the big picture.

Receiving values from parents - the from clause

Apart for the few functions receiving their input from external sources, most functions receive them from other functions.

The name of those functions are listed after the FROM keyword.

Those names can be either relative to the present program or absolute.

If only a function name is supplied (without a program name) then the function must be defined elsewhere in the same program. Otherwise, the source name must start with a program name. If that program name starts with ../ then it is taken relative to the current program. Otherwise, it is taken as the full name of the program.

Examples:

minutely_average: another function in the same program;
monitoring/hosts/httpd_stats: function httpd_stats from the monitoring/hosts program;
../../csv/tcp: function tcp from the program which name relative to the current one is ../../csv.

Notice that contrary to unix file system names, Ramen program names do not start with a slash (/). The slash character only special function is to allow relative names.

Filtering - the where clause

If all you want is to select incoming values matching some conditions, all you need is a filter. For instance, if you have a source of persons and want to filter only men older than 40, you could create an operation consisting of a single where clause, such as:

  WHERE is_male AND age > 40 FROM source

As is evidenced above, the syntax of the where clause is as simple as:

  WHERE condition FROM source

Notice that the clauses order within an operation generally doesn't matter so this would be equally valid:

  FROM source WHERE condition

The condition can be any expression which type is a non-nullable boolean.

NOTE: The default where clause is WHERE true.

Joining sources - the merge clause

When selecting from several operation (as in FROM operation1, operation2, ...) the output of all those parent operations will be mixed together. As parents will typically run simultaneously it is quite unpredictable how their output will mix. Sometime, we'd like to synchronize those inputs though.

It is easy and cheap to merge sort those outputs according to some fields, and the merge clause does exactly that. For instance:

  SELECT * FROM source1, source2, source3 MERGE ON timestamp

In case some parents are producing values at a much lower pace than the others, they might slow down the pipeline significantly. Indeed, after each tuple is merged in Ramen will have to wait for the next tuple of the slow source in order to select the smallest one.

In that case, you might prefer not to wait longer than a specified timeout, and then exclude the slow parent from the merge sort until it starts emitting values again. You can to that with the TIMEOUT clause:

  SELECT * FROM parent1, parent2 MERGE ON field1, field2 TIMEOUT AFTER 3 SECONDS

Whenever the timed-out parent emits a tuple again it will be injected into the merge sort, with the consequence that in that case the produced values might not all be ordered according to the given fields.

The merge clause syntax is:

  MERGE ON expression1, expression2, ... [ TIMEOUT AFTER duration ]

Sorting - the sort clause

Contrary to SQL, in Ramen sorts the query input not its output. This is because in SQL ORDER BY is mostly a way to present the data to the user, while in Ramen SORT is used to enforce some ordering required by the aggregation operations or the windowing. Also, on a persistent query you do not necessarily know what the output of an operation will be used for, but you know if and how the operation itself needs its input to be sorted.

Of course, since the operations never end the sort cannot wait for all the inputs before sorting. The best we can do is to wait for some entries to arrive, and then take the smaller of those, then wait for the next one to arrive, and so on, thus sorting a sliding window.

The maximum length of this sliding window must be specified with a constant integer: SORT LAST 42 for instance. It is also possible to specify a condition on that window (as an expression) that, if true, will process the next smallest tuple available, so that this sliding window is not necessarily of fixed length. For instance: SORT LAST 42 OR UNTIL AGE(creation_time) > 60 would buffer at most 42 values, but would also process one after reception of a tuple which creation_time is older than 60 seconds.

Finally, it must also be specified according to what expression (or list of expressions) the values must be ordered: SORT LAST 42 BY creation_time.

The complete sort clause is therefore:

  SORT LAST n [ OR UNTIL expression1 ] BY expression2, expression3, ...

Projection - the select clause

To follow up on previous example, maybe you are just interested in the persons name and age. So now you could create this operation to select only those:

  SELECT name, age FROM source

Instead of mere field names you can write more interesting expressions:

  SELECT (IF is_male THEN "Mr. " ELSE "Ms. ") + name AS name,
         age date_of_birth as age_in_seconds
  FROM source

The general syntax of the select clause is:

  SELECT expression1 AS name1, expression2 AS name2, ...

You can also replace _one_ expression anywhere in this list by a star (*). All fields from the input which are not already present in the list will be copied over to the output. What is meant here by "being present" is: having the same field name and a compatible type. Since field names must be unique, this is an error if an expression of an incompatible type is aliased to the same name of an input type together with the star field selector.

NOTE: The default select clause is: SELECT *.

Aggregation

Some functions that might be used in the SELECT build their result from several input values, and output a result only when some condition is met. Aggregation functions are a special case of stateful functions. Stateful functions are functions that needs to maintain an internal state in order to be able to output a result. A simple example is the lag function, which merely output the past value for every new value.

The internal state of those functions can be either global to the whole operation, or specific to a group, which is the default. A group is a set of input tuple sharing something in common. For instance, all persons with the same age and sex. Let's take an example, and compute the average salary per sex and age. avg is the archetypal aggregation function.

  SELECT avg salary FROM employee GROUP BY age, is_male

What happens here for each incoming tuple:

Extract the fields age and is_male and makes it the key of this tuple;
Look for the group for this key.
- If not found, create a new group made only of this tuple. Initialize its average salary with this employee's salary;
- If found, add this salary to the average computation.

The group-by clause in itself is very simple: it consists merely on a list of expressions building a key from any input tuple:

  GROUP BY expression1, expression2, ...

You can mix stateful functions drawing their state from the group the tuple under consideration belongs to, with stateful functions having a global state. Where a stateful function draws its state from depends on the presence or absence of the globally modifier to the function name. For instance, let's also compute the global average salary:

  SELECT avg salary, avg globally salary AS global_avg_salary
  FROM employee GROUP BY age, is_male

Each time the operation will output a result, it will have the average (so far) for the group that is output (automatically named avg_salary since no better name was provided) and the average (so far) globally (named explicitly global_avg_salary).

Contrary to SQL, it is not an error to select a value from the input tuple with no aggregation function specified. The output tuple will then just use the current input tuple to get the value (similarly to what the last aggregation function would do).

This is also what happens if you use the * (star) designation in the select clause. So for instance:

  SELECT avg salary, *
  FROM employee GROUP BY age, is_male

...would output records made of the average value of the input field salary and all the fields of input records, using the last encountered values.

NOTE: The default group-by clause is: nothing! All incoming records will be assigned to the same and only group, then.

Hopefully all is clear so far. Now the question that's still to be addressed is: When does the operation output a result? That is controlled by the commit clause.

Windowing: the commit clause

Windowing is a major difference with SQL, which stops aggregating values when it has processed all the input. Since stream processors model an unbounded stream of inputs one has to give this extra piece of information.

Conceptually, each time a tuple is received Ramen will consider each group one by one and evaluate the COMMIT condition to see if an output should be emitted.

Obviously, Ramen tries very hard *not* to actually do this as it would be unbearably slow when the number of groups is large. Instead, it will consider only the groups for which the condition might have changed ; usually, that means only the group which current tuple belongs to.

So, the syntax of the commit clause is simple:

  COMMIT AFTER condition

...where, once again, condition can be any expression which type is a non-nullable boolean.

NOTE: The default commit clause is: true, to commit every selected incoming values.

The next and final step to consider is: when a tuple is output, what to do with the group? The simplest and more sensible thing to do is to delete it so that a fresh new one will be created if we ever met the same key.

Indeed, the above syntax is actually a shorthand for:

  COMMIT AND FLUSH AFTER condition

This additional AND FLUSH means exactly that: when the condition is true, commit the tuple _and_ delete (flush) the group.

If this is the default, what's the alternative? It is to keep the group as-is and resume aggregation without changing the group in any way, with KEEP.

A last convenient feature is that instead of committing the tuple after a condition becomes true, it is possible to commit it before the condition turns true. In practice, that means to commit the tuple that would have been committed the previous time that group received an input (and maybe also flush the group) before adding the new value that made the condition true.

So the syntax for the commit clause that has been given in the previous section should really have been:

  COMMIT [ AND ( FLUSH | KEEP ) ] ] ( AFTER | BEFORE ) condition

So, as an example, suppose we want a preview of the average salaries emitted every time we added 10 persons in any aggregation group:

  SELECT avg salary, avg globally salary AS global_avg_salary
  FROM employee GROUP BY age, is_male
  COMMIT AND KEEP ALL AFTER (SUM 1) % 10 = 0

Outputting: How Values Are Sent To Child Functions

When Ramen commits a tuple, what tuple exactly is it?

The output tuple is the one that is created by the select clause, with no more and no less fields. The types of those fields is obviously heavily influenced by the type of the input tuple. This type itself comes mostly from the output type of the parent operations. Therefore changing an ancestor operation might change the output type of an unmodified operation.

The output tuple is then sent to each of the children operations, before a new input tuple is read. No batching takes place in the operations, although batching does take place in the communication in between them (the ring-buffers). Indeed, when an operation has no tuple to read it _sleeps_ for a dynamic duration that is supposed to leave enough time for N values to arrive, so that next time the operation is run by the operating system there are, in average, N values waiting. This behavior is designed to be efficient (minimizing syscalls when busy and avoiding trashing the cache), but offers no guaranteed batching. If a computation requires batches then those batches have to be computed using windowing, as described above.

Outputting: Notifying External Tools

Ramen is designed to do alerting, that is to receive a lot of information, to analyze and triage it, and eventually to send some output result to some external program. By design, there is a huge asymmetry between input and output: Ramen receives large amount of data and produces very little of it. This explains why the mechanisms for receiving values are designed to be efficient while mechanisms for sending values outside are rather designed to be convenient.

In addition (or instead) of committing a tuple, Ramen can output a notification, which is a very special type of tuple. While output values are destined to be read by children workers, notifications are destined to be read by the alerter daemon and processed according to the alerter configuration, maybe resulting in an alert or some other kind of external trigger.

A notification tuple has a name (that will be used by the alerter to route the notification) and some parameters supposed to give some more information about the alert.

So for example, given a stream of people with both a name and a location, we could send a notification each time a person named "Waldo" is spotted:

  FROM employee
  SELECT age, gender, salary
  -- The notification with its name:
  NOTIFY "Waldo Spotted" WHEN name = "Waldo"

NOTE: Notice here the NOTIFY clause replaces the COMMIT clause altogether.

This works because the default select clause is SELECT * and WHEN is an alias for WHERE.

Timeseries and Event Times

In order to retrieve time series from output values it is necessary to provide information about what time should be associated with each tuple.

Similarly, some functions need to know which time is associated with each value 9such as TOP or REMEMBER).

Although it is convenient at time to be able to mix events which time is specified in various ways, it would nonetheless be tedious to compute the timestamp of every event every time this is required.

This is why how to compute the start and stop time of events is part of every function definitions, so that Ramen can compute it whenever it is needed.

This is the general format of this event-time clause:

  EVENT STARTING AT identifier [ * scale ]
      [ ( WITH DURATION ( identifier [ * scale ] | constant ) |
          AND STOPPING AT identifier [ * scale ] ) ]

Contrary to most stream processing tools, events have not only a time but a duration, that can be specified either as an actual length or as an ending time. This is because Ramen has been originally designed to accommodate call records.

In the above, identifier represent the name of an output field where the event time (or duration) is to be found. scale must be a number and the field it applies to will be multiplied by this number to obtain seconds (either to build a time as a UNIX timestamp or to obtain a duration). constant is a constant number of seconds representing the duration of the event, if it's known and constant.

Notice that Ramen tries hard to inherit event time clauses from parents to children so they do not have to be specified over and over again.

As a further simplification, if no event-time clause is present but the function outputs a field named start then it will be assumed it contains the timestamp of the event start time; and similarly for a field names stop.

With all these information, the timeseries command will be able to produce accurate results.

For instance if we had minutely metric collection from sensors with a field "time" in milliseconds we could write:

  SELECT whatever FROM sensors WHERE some_condition
  EVENT STARTING AT time * 0.001 WITH DURATION 30

The Complete Picture

We are now able to give the full syntax and semantic of the Group By operation:

  [ SELECT expression1 AS name1, expression2 AS name2, ... ]
  [ ( WHERE | WHEN ) condition ]
  FROM source1, source2, ...
  [ GROUP BY expression1, expression2, ... ]
  [ [ COMMIT ],
    [ NOTIFY name ],
    [ ( FLUSH | KEEP ) ]
    ( AFTER | BEFORE ) condition ]
  [ EVENT STARTING AT identifier [ * scale ]
     [ ( WITH DURATION ( identifier [ * scale ] | constant ) |
         AND STOPPING AT identifier [ * scale ] ) ] ]

Each of those clauses can be specified in any order and can be omitted but for the from clause.

The semantic is:

For each input tuple,

compute the key;
retrieve or create the group in charge of that key;
evaluate the where clause:
- if it is false:
  1. skip that input;
  2. discard the new aggregate that might have been created.
- if it is true:
  1. accumulates that input into that aggregate (actual meaning depending on what functions are used in the operation);
  2. compute the current output-tuple;
  3. evaluates the commit clause:
    - if it is true:
      1. send the output tuple to all children;
      2. also maybe store it on disc;
      3. unless KEEP, delete this group.
  4. consider the commit condition of other groups if they depend in the last input tuple.

Programs

A program is a set of functions. The order of definitions does not matter. The semi-colon is used as a separator (although omitting the final semi-colon is allowed).

Here is a simple example:

  DEFINE foo AS
    SELECT * FROM other_program/operation WHERE bar > 0;

  DEFINE bar AS YIELD 1 EVERY 1 SECOND;

  DEFINE foobar AS
    SELECT glop, pas_glop FROM bazibar
    WHERE glop >= 42;

Parameters

TODO

Experiments

Internally, Ramen makes use of a system to protect some feature behind control flags. This system is usable from the Ramen language itself, which makes is easier for users to run their own experiments.

First, the variant function makes it possible to know the name of the variant the server has been assigned to. This makes it easy to adapt a function code according to some variant.

Then, the program-wide clause run if ... followed by a boolean expression instruct ramen that this program should only really be run if that condition is true.

Using the variant function in this run if clause therefore permits to run a program only in some variant.

Custom experiments can be defined in the $RAMEN_DIR/experiments/v1/config. An example configuration can be seen here.