One degree of scale

2019-04-05

This is a well known fact in engineering that the same problem have distinct solutions at different scales.

What is less often agreed upon is how many scales there are. Many software engineers would consider only two: small or big; not scalable or scalable; distributed or not distributed. But for many problems there are more meaningful scales than that.

The scales we are interested into are somewhere in between "a dedicated server can handle it forever" and "a handful of servers, to which we may add a few ones later". We are not interested in "so much hardware that a server or network link fails every two minutes" kind of scales. If you are then I'm sorry. How are things going at Google?

Can't a single machine do more work?

If a single server cannot handle your load, there are several things that can be tried before jumping to conclusions.

First, operations running can be simplified/optimised, for instance with the help of private fields.

Private fields are any field which name starts with an underscore. Like any other named value, it can be used anywhere in an expression after it's been defined, but it remains invisible from other functions that are not allowed to select it. Indeed, private fields are not part of the output type of a tuple as one can check with ramen info $SOME_PROGRAM. This also makes sure that those temporary values are never archived, which would waste disk space.

So for instance, one could replace this:

  SELECT
    95th PERCENTILE OF (SAMPLE(1000, response_time)) AS resp_time_95th,
    99th PERCENTILE OF (SAMPLE(1000, response_time)) AS resp_time_99th,
    ...

by:

  SELECT
    SAMPLE(1000, response_time) AS _sample,
    95th PERCENTILE OF _sample AS resp_time_95th,
    99th PERCENTILE OF _sample AS resp_time_99th,
    ...

...thus performing only one SAMPLE operation instead of two.

Another way to make Ramen faster, of course, is to optimise the functions themselves. Thankfully, there is plenty of room left for optimisation there!

Last month for instance the PERCENTILE operation has been optimised in two important ways.

Firstly, as it is often the case that one wants to extract several percentiles from the same value, the percentile function signature has been extended. It will now accept several percentiles and in that case will return a vector of the various results, like so:

  SELECT
    [95th; 99th] PERCENTILE (SAMPLE(1000, response_time)) AS _rt_perc,
    1st(_rt_perc) AS rt_95th,
    2nd(_rt_perc) AS rt_99th,
    ...

Then, the percentile computation itself has been improved to use a form of quick-select; the idea is to neglect sorting subparts of the array where lie no percentile of interest.

After these two optimisations some of the tests ran 10 times faster.

If that's not enough, then one can throw more CPUs at the problem. Indeed, as workers run concurrently then the more CPUs the less workers will step on each others toes.

But a single hardware machine does not scale linearly very far with the number of CPUs, so once you've spent a hundred dollars per month on a good server, then what?

Let's grow beyond a single server

We want to be able to handle more input tuples than a single machine can possibly read from its disks or its network interfaces. On top of that, we also want to process data that is produced in several geographic locations, and that we'd rather avoid to copy over to a single site.

Since Ramen is already a multi-program (ie. several independent processes running concurrently), what prevent the current implementation from spawning several machines?

First, there is no shared hard disk between different machines.

We use the local file system for storing a few configuration files, and to archive workers output. It is fine if archival stays on the machine where it is produced as it's also where it will be re-read by replay-workers. But we need a way to distribute the small set of configuration files.

Second, there are no shared signals and no process management between different machines.

I find actually quite surprising that this is not a Linux standard to share the list of pids of several kernels, forwarding signals and execve syscalls, to give the illusion of an operating system that goes beyond hardware and geographic boundaries. Of course for a fully fledged UNIX to run on multiple machines it would be a lot more involved than that, but as far as Ramen is concerned that would be enough. It seems a lot of effort have been spend recently with containers and such to scale Linux down rather than up.

Last but not least, there is no shared memory between different machines.

Ramen workers use memory-mapped ring-buffers to exchange data. Messages from one worker to another will have to go through the network, preferably direct.

The plan to scale Ramen by one order of magnitude is then straightforward, at least at the first approximation:

it is needed that the configuration file listing which programs should run be distributed amongst all the sites;
it is needed that the supervisor process be made aware that there are other sites and that only a part of the configuration matters to him.
it is needed that workers be able to send their output into a socket, and read their input from a socket;

...with plenty of details for the devil to hide.

Let's have a deeper look.

to begin with, a notion of site has to be added to the RC file.

I initially though a site would be a hostname, but I also wanted to test multi-site locally without messing my networking configuration; problem easily solved with a level of indirection: a site is now just a name, and there is a list of "services" that resolve into a hostname and a port. In other words, a service registry. That's going to come handy when port numbers can't be known in advance any longer.

So, now when starting a program with ramen run, one can specify on what site the worker is supposed to actually run: ramen run --on-site 'dc13.cluster_a*'. Similarly, when starting the supervisor, one can specify which site this is: ramen supervisor --site dc13.cluster_b42. This supervisor will then only start the workers supposed to be running on that site.

Easy enough, but it brakes an assumption: that function names are unique. Now the same function can be running in different places.

Then, workers have to be able to send tuples over the network.

We could merely connect each parent to each child with a socket, but that would not be efficient. Indeed, what if the child threw most of the input tuples away because of a picky WHERE clause? We'd rather have the tuples filtered before they board the plane.

So this is how it really works: for each individual parent and set of identical children, the children top-half is run in the same site than the parent. This top-half only executes the first stage of the WHERE clause, and then send the filtered tuples over the network to the actual children running remotely, which will treat those pre-filtered tuples are normal incoming tuples.

As an additional benefit, this frees normal workers to deal with sending packets to the network, as to them the top-half is just like a locally running child.

And to avoid that each child have to mess with the network (and the system administrator have to ensure many ports are reachable) we instead run a single copy service per site, that accept incoming connections and will forward incoming messages to the specified input ring-buffer. This is called the tunneld service, and is the first (and only) of the defined service in the service registry.

This is mostly it and is enough to have a bare minimal distributed version of Ramen running. But there are still some work required before the whole thing is running properly.

In particular, the statistics are not distributed, which means proper monitoring of Ramen itself is very hard, and more importantly the allocation of disk storage space is broken as it has to be done globally to be reliable, but can now work only on a per site basis (imagine if site A decided that it was better if site B's workers would archive their output, but site B decided that it's actually better the other way around!)

So, this one order of magnitude scaling up will likely requires that April be dedicated to it also.