Portia: a text processor for literate programs

1. What’s literate programming?

2. Why another preprocessor?

Because no literate programming preprocessor allows any document language and any programming language. I, for one, like to embed fragments of OCaml in asciidoc files.

Here are the requirements and short rationals:

R1. Must be able to build itself

That’s the fun part.

R2. Must be bootstrapable with funnelweb

Because funnelweb is quite usable. Especially, it is agnostic with regard to the programming language and almost so regarding the documentation language.

It follows from these requirements that we should either follow funnelweb syntax, which is ugly, or build a tool that’s flexible enough to act like funnelweb (or at least, that can understand a workable subset of funnelweb syntax).

Ideally, all escaping sequences of the macro system can be redefined. When bootstrapping (with the actual funnelweb) we do not mind the quality of the generated documentation since we can produce a better documentation (and source code, thanks to requirement R3 & R4) from recompiling with the bootstrapped processor.

R3. Add correct file and line number informations into generated source code

Whatever the programming language that’s being used of course.

R4. Do not output doc; rather, make code extraction flexible enough for the doc to be written in any documentation language in the first place

The documentation being the important part, do not interfere with it.

R5. Code blocks may be inline

Like short mathematical formulas are better inline.

R6. Must not require to have all the text, nor all the code, present in memory

Funnelweb builds a whole source code representation in memory before outputing anything. This is frightening. Despite I’ve never used literate style for anything but trivial program I believe the technique suits huge programs just as well.

R7. Can split content into several files

Quite obvious.

R8. Named code fragments may be defined from several places (even in different files)

In literate programming, the human reader must remember the main names used by the program as well as the names used to reference important code fragments. In order to limit the number of unessential definitions, Funnelweb allows to build a macro definition incrementally so that you must not introduce temporary names just to insert commentary in between two related code fragments.

R9. Support an include directive

Since we do not care about the order in which the code fragments will be encountered, and we do not care neither in what order we scan the documentation (since we do not produce a documentation according to R4), then we can merely be given a list of files to scan from the command line. We do not need an include directive as funnelweb (and other) does.

Still, many documentation language has an include directive and if we were able to follow it then we could alleviate the user from the need to maintain this list of files (since we could find everything from the root document).

So we do both: we will scan everything from the command line and additionally, if we are taught how to spot an include directive, then we will try to follow it.

3. Overview

For generating source code (remember R4: we do not have to generate the doc), we need a few directives:

define and name a block of code (see R5)
reference a block of code by name (from another block of code or from the literate text)
define and output to a file a block of code
add a file to the list of processed files (see R9)

We are going to use the OCaml language because it’s compendious yet fast.

The program basically reads its configuration file (or load its compiled configuration file), then proceed with reading the given file, building a dictionary of block definitions and loading additional files along the way, keeping track of code blocks definitions and modification time of the used files, and checking everything.

Then it outputs the code, rereading the files to fetch code blocks so that we do not have to hold in memory a quantity of information equivalent of the resulting source code.

So the basic skeleton, given a set of configuration files plugins and a list of source files to proceed srcfiles, looks like:

@$@<Skeleton@>==@{
List.iter (load_lib !libdir) !plugins ;
List.iter PortiaParse.parse !srcfiles ;
Output.all ()
@}

Notice that configuration does not appear here, nor does harvested definitions. They lies in global variables, which suits this short lived program just fine. We won’t worry about configuration parameters yet. Suffice to say that all of these global parameters, regular expressions and functions (remember some filters may be functions) are references defined in a module unambiguously named Config.

By loading a conf, we merely want to load a compiled .cmo file:

@$@<ConfigLoad@>==@{@-
let load_lib libdir fname =
    let libname = libdir ^"/"^ fname ^".cmo" in
    PortiaLog.debug "loading lib %s\n" libname ;
    Dynlink.(loadfile (adapt_filename libname))
@}

This is enough for a user to choose between several plugins (funnelweb, …).

So we only need this entry point to parse command line arguments and we are done with the boring work:

@$@<EntryPoint@>==@{@-
let main =
    let plugins = ref [] in
    let libdir = ref PkgConfig.plugindir in
    let srcfiles = ref [] in
    let addlst l s = l := s :: !l in
    Arg.(parse
        [ "-syntax", String (addlst plugins),
                     "Name of the plugin to use for parsing files \
                      (default to funnelweb)" ;
          "-libdir", Set_string libdir,
                     "Where to read plugins from" ;
          "-ignore-missing", Set PortiaDefinition.ignore_missing,
                     "Referenced  but never defined blocks are not \
                      an error" ;
          "-debug",  Set PortiaLog.verbose,
                     "Output debug messages" ]
        (addlst srcfiles)
        "portia - literate programming preprocessor\n\
         \n\
         portia [options] files...\n\
         Will output source code from given files.\n") ;
    if !plugins = [] then addlst plugins "funnelweb" ; (* default syntax *)
    @<Skeleton@>
@}

So we have the Main module (linked last):

@O@<main.ml@>==@{@-
open Batteries

@<ConfigLoad@>
@<EntryPoint@>
@}

Let’s focus now on our main data type, the code block definition.

4. Code block definitions

It is well known that naming things is the most difficult part in programming. This has to do with the fact that a large part of programming consists in inventing many simple but abstract concepts with no counterpart in actual life or pre-existing abstractions from other intellectual fields. Sometime we can borrow a term or two from mathematics but mostly we have to give names to things that are relevant only to our program, maybe not even for its full lifetime.

So we have to repeatedly look for designations close enough to what we actually mean that, with enough context, it will become clear what concept they refer to. Of course, this context is not easy to acquire without prior knowledge of these somewhat arbitrary definitions, so it takes some time and effort to pull oneself out of this catch-22 situation.

So, after this short introduction have hopefully inclined my reader to leniency, let’s ask ourself what the name of a program fragment should be. Shall we keep calling it a "program fragment"? But this implies that this is part of a program, which is not required. Should it be an "extract", considered this fragment is separated from the rest for display purpose? Or a "phrase", considered it’s part of a larger "discourse" (the program)?

Or, if we forget what we manipulate to consider how we manipulate it, should it be called a "macro body"? Merely a "macro" or "body"? Or a "definition"? This point of view is seducing since it makes our program a more general purpose text processor rather than a specialized tool for literate programming (of course a type or variable name is not really part of the running program and so cannot alter its behavior in any way, but I believe in the power of names to influence our reasoning about abstractions and that giving generic names help building generic programs).

Let’s call these code fragments "definitions", then.

What information is there in a definition? We have already seen that we need its location used both for error reporting (file name, line number and column) and for fetching it quickly on demand (offset and size in bytes).

Also note the recording of the mtime so that we can tell, when fetching the body, that the file have not changed since we collected this definition.

@$@<Location@>==@{@-
type location = { file : string ;
                 mtime : float ;
                lineno : int ;
                 colno : int ;
                offset : int ;
                  size : int }@}

with the convenient convention that line and column numbers are actually offset from the start, thus start at 0, which allows us to add them naturally.

Rather than having a single location, we want to allow for a definition to be split across many locations (the body of the definition is then the concatenation of all fragment in order of appearance - which is specified by the depth first exploration of included files, the ordering of files in command line and finally the order we met definitions in a given file)

an identifier (unique for the file its located in) which can be any string:

@$@<DefName@>==@{type id = string@}

a flag to tell us if this definition is supposed to be output in a file, with two consequences:
- of course, the expanded body of this definition will be written into a file (which name will be the identifier);
- and you are not allowed to refer to this definition from another one.

The user should be warned about any code fragment that is not, directly or indirectly, referenced from an output definition.

@$@<DefinitionType@>==@{@-
@<Location@>
@<DefName@>
type t = { locs : location list ; (* reverse order *)
             id : id ;
         output : bool }
@}

Notice that it’s possible that there is no location at all (empty list), meaning the definition was missing (might be conveniently allowed with ignore_missing flag, to make it possible to generate a valid program and write the extensions later on).

It’s always a good idea to write proper printers for any new type. This may looks fastidious but you are actually doing yourself a favor: better have these printers ready before they are needed than to have to write them quickly while struggling with a bug. Especially when using Batteries which make writing and using such printers so easy.

So here they are:

@$@<Definitions@>+=@{@-
let location_print fmt loc =
    Printf.fprintf fmt "%s:%d.%d-%d"
        loc.file loc.lineno loc.colno (loc.colno+loc.size)

let mtime_print = Float.print (* TODO: user friendly date&time? *)

let rec locations_print fmt = function
    | [] -> Printf.fprintf fmt "undefined"
    | [loc] -> location_print fmt loc
    | _loc::locs' ->
        locations_print fmt locs' (* print only the first location *)

let print fmt t =
    Printf.fprintf fmt "%s@%a" t.id locations_print t.locs
@}

Now obviously we also want to fetch a definition body from its file (checking mtime):

@$@<Definitions@>+=@{@-
exception FileChanged of string
let fetch_loc loc =
    let open Unix in
    let fname = loc.file in
    if (stat fname).st_mtime > loc.mtime then
        raise (FileChanged fname) ;
    read_file fname loc.offset loc.size
@}

Then, we will need a way to add definitions to a global registry, and the associated lookup function. Definitions are created from a file, offset and length (line number and column number are not given and will be computed when registering, so that plugins author work is limited to the minimum) and of course the identifier for the definition.

@$@<Definitions@>+=@{@-

let ignore_missing = ref false

let registry = Hashtbl.create 31

let add id output fname off sz =
    let loc = location_in_file fname off sz in
    PortiaLog.debug "Add definition for %s at position %a\n"
        id location_print loc ;
    Hashtbl.modify_opt id (function
        | None   -> Some { id ; output ; locs = [loc] }
        | Some t -> Some { t with locs = loc :: t.locs })
        registry

let lookup id =
    try Hashtbl.find registry id
    with Not_found ->
        if !ignore_missing then { id ; output = false ; locs = [] }
        else (
            Printf.fprintf stderr "Cannot find definition for '%s'\n" id ;
            exit 1
        )
@}

Where location_in_file is responsible to return a correct location (up to proper line and column numbers) from the file name, offset and size:

@$@<LocationInFile@>==@{@-
let location_in_file file offset size =
    let mtime = Unix.((stat file).st_mtime) in
    let txt = read_file file 0 offset in
    let colno = colno_at txt
    and lineno = lineno_at offset txt in
    { file ; offset ; size ; mtime ; lineno ; colno }
@}

Now the last part: expansion. Given a function PortiaConfig.find_references (supplied by the configuration) that’s able to spot all expansion points from a non expanded body, and the registry of all known definitions, let’s build a function that will return the complete expanded body (or signal a problem).

@$@<Definitions@>+=@{@-

let line_start txt offset =
    let rec loop c =
        if c < 1 || txt.[c-1] = '\n' then c
        else loop (c-1) in
    let start_pos = loop offset in
    String.sub txt start_pos (offset - start_pos)

(*$= line_start & ~printer:identity
  (line_start "glop" 0) ""
  (line_start "glop" 2) "gl"
  (line_start "glop\npas glop\n" 4) "glop"
  (line_start "glop\npas glop\n" 5) ""
  (line_start "glop\npas glop\n" 7) "pa"
*)

let indent_at unexpanded start =
    line_start unexpanded start |>
    String.fold_left (fun (need_nl, len) c ->
        (if Char.is_whitespace c then need_nl else true),
        len+1) (false, 0)

(*$= indent_at & ~printer:dump
  (indent_at "glop" 0) (false, 0)
  (indent_at "glop" 2) (true, 2)
  (indent_at "glop\npas glop\n" 4) (true, 4)
  (indent_at "glop\npas glop\n" 5) (false, 0)
  (indent_at "glop\npas glop\n" 7) (true, 2)
*)

let rec expanded_loc tab loc =
    let unexpanded = fetch_loc loc in
    PortiaLog.debug "expand '%s'\n" unexpanded ;
    (* Start with line number information. *)
    let txt = !PortiaConfig.linenum loc.lineno loc.file in
    (* find_references returns a list of (id, start_offset, stop_offset) *)
    let refs = !PortiaConfig.find_references unexpanded |>
               List.sort (fun (_,o1,_) (_,o2,_) -> compare o1 o2) in
    PortiaLog.debug "found references: %a\n"
        (List.print (Tuple3.print String.print Int.print Int.print)) refs ;
    let txt, last_stop =
        List.fold_left (fun (txt,last_stop) (id,start,stop) ->
            assert (start >= last_stop) ;
            let need_new_line, tab' = indent_at unexpanded start in
            (* If we do not need a new_line it's because we have only blanks
               before the expansion, and we do not want to copy those because
               every body must start at column 0 *)
            let start = if need_new_line then start else start - tab' in
            if not need_new_line then (
                PortiaLog.debug "no new line needed, tab=%d, tab'=%d\n" tab tab'
            ) ;
            let txt = txt ^
                      indent tab (String.sub unexpanded last_stop
                                             (start - last_stop)) in
            if not need_new_line then (
                PortiaLog.debug "appended '%s'\n"
                    (String.sub unexpanded last_stop (start - last_stop))
            );
            let txt = if need_new_line then txt ^ "\n" else txt in

            let t' = lookup id in
            let body = expanded_body (tab+tab') t' in
            let txt = txt ^ body in

            (* add a linenum indication that we are back in this block *)
            let ln = !PortiaConfig.linenum
                         (loc.lineno + (lineno_at stop unexpanded))
                         loc.file in
            let txt = txt ^
                (if String.length ln > 0 &&
                    String.length txt > 0 &&
                    txt.[String.length txt - 1] != '\n' then
                    "\n" else "") ^ ln in
            txt, stop)
            (txt, 0) refs in
    (* Complete with what's left *)
    let rest = String.length unexpanded - last_stop in
    txt ^ indent tab (String.sub unexpanded last_stop rest)

and expanded_body tab t =
    List.rev t.locs |>
    List.map (expanded_loc tab) |>
    String.concat ""
@}

Where each substituted definition is properly indented according to its insertion point. We must now complete this module with the functions we used up to here for helping dealing with text files and locations:

@$@<TxtHelpers@>==@{@-
let indent =
    let open Str in
    let re = regexp "\n\\([^\n]\\)" in
    fun tab str ->
        let spaces = String.make tab ' ' in
        spaces ^ global_replace re ("\n"^ spaces ^"\\1") str

(*$= indent & ~printer:identity
  "glop"   (indent 0 "glop")
  "  glop" (indent 2 "glop")
*)

(* first char is at column 0 *)
let colno_at txt =
    let rec aux colno p =
        if p = 0 || txt.[p-1] = '\n' then colno else
        aux (colno+1) (p-1) in
    aux 0 (String.length txt)

(* first line is 0 *)
let lineno_at pos txt =
    let rec aux p n =
        if p >= pos then n else
        aux (p+1) (if txt.[p] = '\n' then n+1 else n) in
    aux 0 0

let read_file fname offset size =
    let open Unix in
    let fd = openfile fname [O_RDONLY] 0 in
    lseek fd offset SEEK_SET |> ignore ;
    let str = String.create size in
    let rec read_chunk prev =
        if prev < size then
            let act_sz = read fd str prev (size-prev) in
            read_chunk (prev + act_sz) in
    read_chunk 0 ;
    close fd ;
    str
@}

Regarding linenum, this function depends on the programming language used. The default implementation from Config will not output linenum directives. But dedicated plugins are easy to write. First, for ocaml:

@O@<ocaml.ml@>==@{@-
let linenum lineno fname =
    Printf.sprintf "# %d \"%s\"\n" (lineno+1) fname

let () = PortiaConfig.linenum := linenum
@}

and for C:

@O@<c.ml@>==@{@-
let linenum lineno fname =
    Printf.sprintf "#line %d \"%s\"\n" (lineno+1) fname

let () = PortiaConfig.linenum := linenum
@}

Notice that those directives follow the GNU convention that:

Line numbers should start from 1 at the beginning of the file, and column
numbers should start from 1 at the beginning of the line.

One more word about the linenum directive in OCaml. It is documented in the chapter 6.1 (lexical conventions) of the OCaml manual, and from this documentation it appears that it is not constrained to appear alone on a line. We do make some effort to place these directives on dedicated lines, in order to generate better looking source files.

Also, notice that we must insert a linenum directive at the insertion point of each definition body and after each expansion to return to previous location.

With these functions we are now ready to start the real job of parsing input files(s) and writing output definitions.

@O@<portiaDefinition.ml@>==@{@-
open Batteries

@<DefinitionType@>
@<TxtHelpers@>
@<LocationInFile@>
@<Definitions@>
@}

5. Parsing

Parsing is a pretentious appellation, since we merely need to spot three things in the input files:

optional include command (with its filename parameter) to instruct us how to gather other file names to inspect;
code definitions;
in the body of a definition, references to other definitions.

For now we do not want to impose any format to these marks so in all generality we are going to read in memory a whole file and ask a configuration provided function to return the list of additional files to scan and the list of definitions that can be found in the file content.

So "parsing" is just:

@$@<Parsing@>==@{@-
let read_whole_file file =
    let ic = Unix.(openfile file [O_RDONLY] 0 |> input_of_descr) in
    IO.read_all ic (* autoclosed *)

let rec parse file =
    PortiaLog.debug "Parsing file %s\n" file ;
    let txt = read_whole_file file in
    !PortiaConfig.find_definitions txt |>
    List.iter (fun (id, output, start, stop) ->
        PortiaDefinition.add id output file start (stop-start)) ;
    !PortiaConfig.find_inclusions txt |>
    List.iter parse
@}

That we can group, with some helper functions to be defined later, in a parse module:

@O@<portiaParse.ml@>==@{@-
open Batteries

@<Parsing@>
@<ParsingHelpers@>
@}

Also, we want to be able to attach several code fragments to the same name (see R8), with the actual expansion being composed of the concatenation of these fragments. To handle this, we will merely register several definitions with the same name, and when writing the output of a given definition we will append all bodies in order of appearance.

5.1. FunnelWeb

Now of course the real difficulty lies in the find_definitions and find_inclusions functions, which by default could be the one we need to bootstrap (ie. funnelweb compatible).

So let’s implement at first the simpler of both. For inclusion, funnelweb uses a very straightforward syntax: a line consisting only of @i somefilename. This simple regular expressions will easily collect all such commands for us:

@$@<RegexForInclusion@>==@{"^@i +\\(.+\\) *$"@}

Which leads to this find_inclusions function:

@$@<FW_FindInclusions@>==@{@-
let find_inclusions =
    let re = Str.regexp @<RegexForInclusion@> in
    fold_all_groups (fun l p -> match l with
        | [Some (f, _, _)] -> f::p
        | _ -> assert false) [] re
@}

With the almighty fold_all_groups, folding over all groups matched in a given string:

@$@<ParsingHelpers@>==@{@-
let fold_all_groups f p re str =
    let open Str in
    let rec aux p o =
        try search_forward re str o |> ignore ;
            let rec fetch_grps n groups =
                try let g = try Some (matched_group n str,
                                      group_beginning n,
                                      group_end n)
                            with Not_found -> None in
                    fetch_grps (n+1) (g::groups)
                with Invalid_argument _ -> List.rev groups in
            let groups = fetch_grps 1 [] in
            aux (f groups p) (Str.match_end ())
        with Not_found ->
            p in
    aux p 0 |> List.rev
@}

Regarding code definitions, the regular expression is more complex but can still handle the job. We have to take greater care here since code blocks typically spans several lines and regular expressions are greedy. We handle this by forbidding the ending marker (@ followed by }) from the definition; hopefully this marker is both improbable and short.

We end up with this regular expression:

@$@<RegexForDefinition@>==@{@-
"^@\\(\\$\\|O\\)@<\\([^@]+\\)@>\\(==\\|\\+=\\)@{\
\\(@-\n\\)?\\(\\([^@]\\|@[^}]\\)*\\)@}"@}

Here we met another difficulty: we must be able to write strings and regular expressions that describes funnelweb special commands without triggering funnelweb (nor portia in funnelweb mode) to interpret them as actual commands! In other words we must write a regular expression that does not match itself. The easy trick is to split the regular expression into several lines right in the middle of problematic token sequences.

With the corresponding find_definitions:

@$@<FW_FindDefinitions@>==@{@-
let find_definitions =
    let re = Str.regexp @<RegexForDefinition@> in
    fold_all_groups (fun l p ->
        PortiaLog.debug "found def: %a\n"
            (List.print (Option.print
                (Tuple3.print String.print Int.print Int.print))) l ;
        match l with
        | [Some (c, _, _); Some (id, _, _); _; _; Some (_, start, stop); _] ->
            (id, c = "O", start, stop) :: p
        | _ -> assert false) [] re
@}

Now to finish with our regular expressions, we must be able to spot references to other definitions from within definition bodies. Funnelweb uses a straightforward syntax for that, again relying on the unlikelihood of the (short) sequence of @ followed by < or >:

@$@<RegexForReference@>==@{@-
"\\(@<\\([^@]+\\)@>\\)"@}

With the corresponding find_references (identical to find_inclusions but with another regular expression):

@$@<FW_FindReferences@>==@{@-
let find_references =
    let re = Str.regexp @<RegexForReference@> in
    fold_all_groups (fun l p -> match l with
        | [Some (_, start, stop); Some (id, _, _)] -> (id, start, stop)::p
        | _ -> assert false) [] re
@}

This function will be used later when untangling code fragments into output files.

Last and least, funnelweb (and probably other literate programming preprocessors as well) uses an escape character that can be used to include its control character (@) in the source code. Thus, before outputting the code we must run a final scan to unquote all these characters, especially since we have made a heavy use of this quoting mechanism in this document:

@$@<FW_Postprocess@>==@{@-
let postprocess str =
    String.nreplace ~str ~sub:"@@" ~by:"@"
@}

Of course, all these regular expressions and substring replacement do not add up to a proper parser for funnelweb syntax, which is much richer than that. It’s enough, though, to bootstrap Portia source code, so we will leave this funnelweb module here and return to the more interesting topic of generating output files.

@O@<funnelweb.ml@>==@{@-
open Batteries
open PortiaParse

@<FW_FindInclusions@>
@<FW_FindDefinitions@>
@<FW_FindReferences@>
@<FW_Postprocess@>

let () =
    PortiaConfig.find_definitions := find_definitions ;
    PortiaConfig.find_references  := find_references ;
    PortiaConfig.find_inclusions  := find_inclusions ;
    PortiaConfig.postprocess      := postprocess
@}

6. Output

Once all definitions have been gathered we can iterate over all of those which must be written into a file, retrieve their (expanded) body then write it into that file. We will not directly overwrite the destination file, though, rather create a temporary file and replace the older file only if the new one is different. We do this to avoid unnecessary touching files, thus triggering whole rebuilds, each time a single compilation unit is effectively modified.

@$@<Output@>==@{@-
open PortiaDefinition

let read_file filename =
  (BatFile.lines_of filename |>
   List.of_enum |>
   String.concat "\n") ^ "\n"

(* output a given definition *)
let definition filename def =
    if def.output then (
        PortiaLog.debug "Generating %s...\n%!" filename ;
        let text = expanded_body 0 def |>
                   !PortiaConfig.postprocess in
        let content_is_new = match read_file filename with
        | exception _ -> true
        | old_text -> old_text <> text in
        if content_is_new then (
          PortiaLog.debug "Writing output file %s\n" filename ;
          output_file ~filename ~text
        ) else PortiaLog.debug "Skipping same file %s\n" filename
    ) else (
        PortiaLog.debug "No output file for %s\n%!" filename
    )

(* output all registered definitions *)
let all () =
    Hashtbl.iter definition registry
@}

And that’s all we need in our Output module:

@O@<output.ml@>==@{@-
open Batteries

@<Output@>
@}

7. Configuration

We have seen so far only five parameters taken from the configuration, the first three being references to functions taking a file content as a string and returning substrings of interest:

find_definitions, that spots new definitions
find_references, that spots references to definitions in definition bodies
find_inclusions, that spots declarations of other files to parse

and the others being simpler function to post-process or beautify the output:

postprocess, that perform whatever modification is required on the expanded code
linenum, a function to output line number indications for the compiler

So that our Config module thus far is merely:

@O@<portiaConfig.ml@>==@{@-
let find_definitions =
    ref ((fun _txt -> []) : string -> (string * bool * int * int) list)
let find_references =
    ref ((fun _txt -> []) : string -> (string * int * int) list)
let find_inclusions =
    ref ((fun _txt -> []) : string -> string list)
let postprocess =
    ref ((fun txt -> txt) : string -> string)
let linenum =
    ref ((fun _n _f -> "") : int -> string -> string)
@}

Notice that separate compilation of this module imposes that we have to declare the types of these references.

Remember form the Main module that we will load by default the funnelweb plugin, so when running portia without option it will behave (loosely) like funnelweb. This plugins will not implement linenum, though, so no line number directives will be outputted. It would be nice if by default the linenum function was relying on output file name to choose from a set of predefined implementations, though.

8. Misc

We have glossed over many trivial details to get there, but the program would not be complete without those.

8.1. Log

For such a simple tool, we merely want to display debug messages or nothing at all, so the only implemented function is PortiaLog.debug and, depending on flag debug is either print on stderr or does nothing:

@O@<portiaLog.ml@>==@{@-
open Batteries

let verbose = ref false

let debug fmt =
  if !verbose then
    Printf.fprintf stderr fmt
  else
    Printf.ifprintf stderr fmt
@}

8.2. Asciidoc

Last but not least, let’s provide the configuration (in the form of the extraction functions) for asciidoc documents, both as an example and because that’s the documentation format I intend to use in the future:

@O@<asciidoc.ml@>==@{@-
open Batteries
open PortiaParse

let ext_of_lang = function
    | "shell" | "bash" | "csh" -> "sh"
    | "autoconf" -> "m4"
    | "docbook" -> "xml"
    | x -> x

let find_code_definitions =
    let re = Str.regexp ("^\\.\\([^:\n]+\\)\\(:[^\n]*\\)?\n" ^
                         "\\[source,\\([^]]+\\)\\]\n" ^
                         "----\n" ^
                         "\\(\\(\\([^-\n].*\\)?\n\\)+\\)" ^
                         "----\n") in
    fold_all_groups (fun l p ->
        match l with
            | Some (id,_,_)::_::Some (lang,_,_)::Some (_def, start, stop)::_ ->
                let is_file = String.ends_with id ("." ^ ext_of_lang lang) in
                (id, is_file, start, stop) :: p
            | _ -> assert false) [] re

(*$= find_code_definitions & ~printer:dump
  [ "Foo.ml", true, 30, 37 ] \
  (find_code_definitions \
    ".Foo.ml: bar\n\\
     [source,ml]\n\\
     ----\n\\
     glop.\n\\
     \n\\
     ----\n\\
     I'm out!\n\\
     ----\n")
 [ "Foo", false, 27, 38 ] \
 (find_code_definitions \
   ".Foo: bar\n\\
    [source,ml]\n\\
    ----\n\\
    pas glop.\n\\
    \n\\
    ----\n")
*)

let find_file_content =
    let re = Str.regexp ("^\\.Content of \\([^\n]*\\)\n" ^
                         "\\[source,[^\n]*\\]\n" ^
                         "----\n" ^
                         "\\(\\(\\([^-].*\\)?\n\\)+\\)" ^
                         "----\n") in
    fold_all_groups (fun l p ->
        match l with
            | Some (id,_,_)::Some (_def, start, stop)::_ ->
              (id, true, start, stop)::p
            | _ -> assert false) [] re

let find_definitions str =
    find_code_definitions str @ find_file_content str

let find_references =
    let re = Str.regexp "\\((\\* \\.\\.\\.\\([^\n\\*]+\\)\\.\\.\\. \\*)\\)" in
    fold_all_groups (fun l p -> match l with
        | [ Some (_, start, stop); Some (id, _, _) ] -> (id, start, stop)::p
        | _ -> assert false) [] re

(* Note that we have to break up the comment mark in order not to
   confuse qtest. *)
(*$= find_references & ~printer:dump
  (find_references ("xx ("^"* ...Inventory.Make functor... *"^") yy")) \
    [ "Inventory.Make functor", 3, 37 ]
*)

let find_inclusions =
    let re = Str.regexp "^include::\\([^\\[\n]+\\)\\[\\([^\\[\n]*\\)\\]\n" in
    fold_all_groups (fun l p -> match l with
        | [ Some (f, _, _); _ ] -> f::p
        | _ -> p) [] re

let () =
    PortiaConfig.find_definitions := find_definitions ;
    PortiaConfig.find_references  := find_references ;
    PortiaConfig.find_inclusions  := find_inclusions ;
@}

Notice here find_file_content which allows to specify in the documentation some content to be copied verbatim into a file. This is handy to generate files out of band for testing fixture for instance.

The small function ext_of_lang tries to match languages names that are understood by asciidoc (actually, by GNU source-highlight, its default source code highlighter) back to file extension. Most of the time, though, asciidoc also understand the file extension itself so we assume that’s what’s specified in the source command to make this function shorter.

Notice that for find_references we use a format that conveniently (for OCaml programmers) looks like OCaml comments. Since those expansion points will be completely replaced by their definition it does not really matter and those would of course work just as well regardless of the language used around. One may prefer to use comments for that language in order not to confuse the documentation generator when syntax-highlighting the code bloc. Therefore, it would be nice to allow for more comment style here, or to pick the proper one from another configuration file…

9. TODO

A mode in which portia just output Makefile compliant dependencies.

Indenting the generated source code makes column positions reported by compilers (and other tools such as annot for OCaml) different from the one in the source document, which somewhat defeats the line number information feature. Therefore it should be possible to disable indentation altogether for more accurate position reporting.

Add a warning at the beginning of generated files that they are automatically generated and should not be edited manually.

Do not output linenum for shell because they mess with the shebang. tests/*.expected is another case of annoying linenum. Implement per language linenum as suggested at the end of config.fw. In other words, find_definitions should return the language (the file extension is good enough).