# Prolog DCG Primer

## Introduction

A Prolog definite clause grammar (DCG) describes a list. Operationally, DCGs can be used to parse, generate and check lists.

A DCG is defined by rules. A DCG rule has the form:
```Head --> Body.
```
A rule's body consists of terminals and nonterminals. A terminal is a list, which stands for the elements it contains. A nonterminal refers to a DCG or other grammar construct, which stand for the elements they themselves describe. We use the nonterminal indicator f//N to refer to the nonterminal f with arity N. Note that // distinguishes it from a Prolog predicate indicator.

As an example of a DCG, let us describe lists whose every element is the atom a. We shall use the nonterminal as//0 to refer to such lists:
```as --> [].
as --> [a], as.
```
The first rule states: The empty list is such a list. In the second rule, we read the comma as "and then". Therefore, the second rule states: A list that contains as its first element the atom a and then only atoms a is also such a list.

To invoke a grammar rule, we use Prolog's built-in phrase/2 predicate. The first argument is, syntactically, a DCG body. phrase(Body, Ls) is true iff Body describes the list Ls.

For example, let us use the single nonterminal as//0 to ask for all lists that it describes, by posting the most general query:
```?- phrase(as, Ls).
Ls = [] ;
Ls = [a] ;
Ls = [a, a] ;
Ls = [a, a, a] ;
Ls = [a, a, a, a] ;
etc.
```
Examples of more specific queries and the system's answers:
```?- phrase(as, [a,a,a]).
true.

?- phrase(as, [b,c,d]).
false.

?- phrase(as, [a,X,a]).
X = a.
```

## Relating trees to lists

Let us now use a DCG to relate a binary tree to the in-order sequence of its node names. Let us assume a binary tree consists of leaves represented by the atom nil and inner nodes of the form node(Name, Left, Right), where Left and Right are themselves binary trees. To obtain the in-order sequence of node names, consider:
```tree_nodes(nil) --> [].
tree_nodes(node(Name, Left, Right)) -->
tree_nodes(Left),
[Name],
tree_nodes(Right).
```
Example:
```?- phrase(tree_nodes(node(a, node(b, nil,
node(c, nil, nil)),
node(d, nil, nil))), Ns).
Ns = [b, c, a, d].
```
You can obtain other orders by moving the terminal [Name] in the DCG body.

## Left recursion

Conversely, given a sequence of node names, what are the trees that yield this sequence:
```?- phrase(tree_nodes(Tree), [a,b,c,d]).
Tree = node(a, nil, node(b, nil, node(c, nil, node(d, nil, nil)))) ;
(nontermination)
```
The system yields one (correct) solution, then loops. This is because the grammar is left-recursive: We recursively refer to a nonterminal (tree_nodes//1) before anything else. To be able to use this grammar for finding all matching trees, we need to encode that for the second rule to apply, at least one list element must be available since the rule contains exactly one terminal, and we need to check this in advance to avoid unbounded recursion. We can do this by introducing two additional arguments that let us limit the number of rule applications to the given list's length:
```tree_nodes(nil, Ls, Ls) --> [].
tree_nodes(node(Name, Left, Right), [_|Ls0], Ls) -->
tree_nodes(Left, Ls0, Ls1),
[Name],
tree_nodes(Right, Ls1, Ls).
```
Each respective pair of additional arguments describes a so-called list difference. To understand this method and its associated terminology, consider for example the pair Ls0 and Ls1. We can interpret the difference of these lists as describing the list of elements that are consumed by the first recursive invocation of tree_nodes//3 in this example. Analogously, the difference of the lists Ls1 and Ls is the list of elements consumed by the second invocation. And the difference between the lists that occur in the head, i.e., that of [_|Ls0] and Ls, describes the list of elements that are consumed by the entire second rule.

Of course, difference is to be understood in a symbolic sense in these cases, not in an arithmetic sense. For example, the difference between the lists [A,B,C] and [C] is [A,B], the difference between [X,Y|Ls] and Ls is [X,Y], and the difference between Ls and Ls is []. By reasoning about such list differences via argument pairs, we can compose and decompose entire lists by reasoning about their sublists. This powerful method is in fact also the key idea that underlies the common transformation of DCGs to Prolog predicates. In cases where you are using it, consider using DCGs instead. In the present case, we are using list differences explicitly because we are using them within DCGs.

In the literature, you will also encounter the term "difference list". However, this terminology is misleading: We are not talking about—as the name may suggest—a special kind of list. The additional arguments are completely ordinary lists. It is their differences that matter especially in such cases.

When working with list differences, you may be tempted to pass around each pair as a single argument, for example as compound terms like Ls0-Ls1 or Ls0/Ls1. However, this is not advisable, primarily because you will likely run into conflicts with other predicates and DCG expansions, and secondarily because it incurs performance overhead for constructing, decomposing and managing these compound terms.

Example query:
```?- Ns = [a,b,c,d], phrase(tree_nodes(Tree, Ns, _), Ns).
Ns = [a, b, c, d],
Tree = node(a, nil, node(b, nil, node(c, nil, node(d, nil, nil)))) ;
Ns = [a, b, c, d],
Tree = node(a, nil, node(b, nil, node(d, node(c, nil, nil), nil))) ;
Ns = [a, b, c, d],
Tree = node(a, nil, node(c, node(b, nil, nil), node(d, nil, nil))) ;
etc.
```
Another option to parse with left-recursive grammars is to use your Prolog system's tabling mechanism.

## Semicontext notation

Using semicontext notation, in now outdated terminology also called pushback lists or right-hand context, you can insert list elements that were initially not in the list that is being parsed. A DCG rule of the form:
```Head, [T1,...,Tn] --> Body.
```
can be read operationally as: parse the list using Body, then prepend the terms T1, ..., Tn to the remaining list. For example:
```nt1, [b] --> [a].
nt2      --> [b].
```
The body of nt1//0 describes a list whose single element is the atom a. Operationally, after nt1//0 has consumed the atom a in a list that is being parsed, it inserts the atom b in front of the remaining list. nt2//0 describes a list whose single element is the atom b. The following query therefore succeeds, since nt2//0 consumes the atom b, which is left in the list after nt1//0 succeeds:
```?- phrase((nt1,nt2), [a]).
true.
```
We can also use nt1//0 in isolation. However, the following query fails since phrase/2 only succeeds if all list elements are consumed by the given DCG body:
```?- phrase(nt1, [a]).
false.
```
The list difference version phrase/3 shows what remains after nt1//0 succeeds:
```?- phrase(nt1, [a], Rest).
Rest = [b].
```
As expected, the atom b remains in the list.

Using semicontext notation, we can implement look ahead, which lets us inspect the next element in the list without removing it. Operationally, we first remove it and then push it back:
```look_ahead(T), [T] --> [T].
```
Example:
```?- phrase(look_ahead(T), [a], Rest).
T = a,
Rest = [a].
```

## Implicitly passing states around

Semicontext notation is also useful to implicitly pass around a state representation that is only accessed and changed by a subset of rules. For example, let us count the leaves in a binary tree with the above presentation. Without using DCGs, we can relate a tree to the number of its leaves as follows:
```num_leaves(Tree, N) :-
num_leaves_(Tree, 0, N).

num_leaves_(nil, N0, N) :- N #= N0 + 1.
num_leaves_(node(_,Left,Right), N0, N) :-
num_leaves_(Left, N0, N1),
num_leaves_(Right, N1, N).
```
Notice that in the second clause of num_leaves_/3, the accumulator is only passed through as N0, N1, ..., N and not modified directly. When you encounter such a pattern, consider using DCG notation to pass around the arguments implicitly. In this concrete case, the state we shall pass around is a single integer denoting the number of leaves encountered so far. To increment the state, we use Prolog's declarative integer arithmetic like above. To invoke a Prolog predicate from within a DCG body, we use the DCG language construct {}//1. Operationally, when the construct {Goal} is encountered in a DCG body, Goal is called as a Prolog goal. Since a DCG must always describe a list, we wrap the state into a list and thus describe a list containing a single element. The initial state is again sensibly specified as 0, and the number of leaves is given by the remaining list element after num_leaves_//1 succeeds:
```num_leaves(Tree, N) :-
phrase(num_leaves_(Tree), , [N]).

num_leaves_(nil), [N] --> [N0], { N #= N0 + 1 }.
num_leaves_(node(_,Left,Right)) -->
num_leaves_(Left),
num_leaves_(Right).
```
Notice that the second rule of num_leaves_//1 makes no reference at all to the state, since the number of leaves is not modified when an inner node is processed.

Example query:
```?- num_leaves(node(a,node(b,nil,nil),
node(c,nil,
node(d,nil,nil))), N).
N = 5.
```
The following nonterminals are useful for describing states with DCGs:
```state(S), [S] --> [S].

state(S0, S), [S] --> [S0].```
Note: Code in grey boxes may be useful for you to reuse in your own projects.

The nonterminal state(S) can be read as: "The current state is S". The nonterminal state(S0, S) can be read as: "The current state is S0, and henceforth it is S".

Using state//2, you can write num_leaves_//1 as:
```num_leaves_(nil) --> state(N0, N), { N #= N0 + 1 }.
num_leaves_(node(_,Left,Right)) -->
num_leaves_(Left),
num_leaves_(Right).
```
This approach is readily generalized to multiple states that are threaded through: You can for example use a term s(S1,S2,...,Sn) in such cases.

## Describing lists with list//1

We now consider an important nonterminal that is useful in many cases. We shall call it list//1 and use its single argument to refer to the list that it describes:
```list([])     --> [].
list([L|Ls]) --> [L], list(Ls).```
We can use this nonterminal to embed a given list as a subsequence of a longer list that is being described:
```?- phrase((list([a,b]),list([c,d])), Ls).
Ls = [a, b, c, d].
```
Thus, list//1 is useful to describe the concatenation of lists. In this concrete case, list//1 was actually not needed, because we could have simply used:
```?- phrase(([a,b],[c,d]), Ls).
Ls = [a, b, c, d].
```
However, using list//1, we can readily define a much more general nonterminal that describes the concatenation of an arbitrary number of lists:
```concatenation([]) --> [].
concatenation([List|Lists]) -->
list(List),
concatenation(Lists).
```
Note the declarative name, which makes sense in all directions: It is the concatenation of lists that is being described. An imperative name like concat would heavily imply only a single direction, and we avoid imperative names for this reason. Sample use:
```?- phrase(concatenation([[a,b],[c,d],[e,f]]), Ls).
Ls = [a, b, c, d, e, f].
```
Moreover, like all pure relations, list//1 can be used in other directions too. For example, we can use it to refer to subsequences at specific positions:
```?- phrase(([alpha],list(Mid),[omega]), [alpha,beta,gamma,omega]).
Mid = [beta, gamma] ;
false.
```
Note that in cases where we use DCGs only to generate lists, you might be tempted to define list//1 as follows:
```badlist(Ls) --> Ls.
```
This still works as expected in cases like:
```?- phrase(badlist([a,b,c]), Ls).
Ls = [a, b, c].
```
However, it is still not a good idea for two reasons: First, it cannot be used in the other direction and is thus not a pure relation:
```?- phrase(badlist(Ls), [a,b,c]).
ERROR: call_dcg/3: Arguments are not sufficiently instantiated
```
Second, such a simplistic definition can be tricked into invoking arbitrary code by specifying {Goal} as the argument, as in:
```?- phrase(badlist({halt}), _).
```
For these two reasons, you should use the definition of list//1 shown above to refer to arbitrary lists with DCGs.

In SWI-Prolog, DCGs can be transparently applied to files using Ulrich Neumerkel's library(pio). When you need to process input from files, first write a DCG that describes the input. This lets you test the DCG in a pure way using phrase/2, and then apply the DCG to files too by simply using phrase_from_file/2 instead of phrase/2.

In the following examples, we assume that the Prolog flag double_quotes is set to chars with the following query:
```?- set_prolog_flag(double_quotes, chars).
true.
```
or that you use the following directive in your initialisation file or program:
`:- set_prolog_flag(double_quotes, chars).`
We recommend using this directive for better readability of answers in the following examples and also in other cases.

In our first example, we read a whole file to a list of characters. This is very easy to do with list//1 as defined above. For example, to read the contents of the file dcg.html to a list of characters, we use:
```?- phrase_from_file(list(Chars), 'dcg.html').
Chars = [<, h, t, m, l, >, '\n', ' ', ' '|...] ;
false.
```
As another example, consider the following DCG that uses list//1 to refer to a subsequence of characters:
```like(What) --> "I like ", list(What), ".", list(_).
```
Like before, we can use this DCG to parse a given list of characters:
```?- phrase(like(What), "I like it. Anything can follow!").
What = [i, t] ;
false.
```
As expected, What is unified with the atoms i and t.

Moreover, using library(pio) and its phrase_from_file/2, we can transparently parse from a file with the same DCG. Assume that the file like.txt starts with the string "I like it."
```?- phrase_from_file(like(What), 'like.txt').
What = [i, t] ;
false.
```
Again, What is unified with the atoms i and t.

library(pio) uses memory efficiently when parsing files: Parts that are no longer relevant do not consume additional memory. For this reason, library(pio) is well-suited also for parsing large files.

## Parsing tricks and techniques

When parsing, we often want to consume tokens greedily, meaning that the longest match should be given priority. We can do this by ordering the DCG rules accordingly. For example, the following DCG describes a list of whitespace characters in such a way that the longest such sequence is the first solution:
```ws --> [W], { char_type(W, space) }, ws.
ws --> [].
```
Identifiers that must start with a letter and may contain either letters or digits can be parsed by applying this technique:
```identifier([A|As]) --> [A], { char_type(A, alpha) }, symbol_r(As).

symbol_r([A|As]) --> [A], { char_type(A, alnum) }, symbol_r(As).
symbol_r([])     --> [].
```
It is sometimes useful to refer to "anything at all", and we can do this with the nonterminal ...//0 which is equivalent to list(_):
`... --> [] | [_], ... .`
The grammar construct ('|')//2 and its equivalent (;)//2 correspond to the Prolog control structure (;)/2 and are read as "or". When parsing with DCGs, using ('|')/2 is preferable over (;)//2 due to existing conventions for writing grammars.

Using this definition of ...//0, we can write the earlier example equivalently as:
```?- phrase(("I like ",list(What),".",...), "I like it. Anything can follow!").
What = [i, t] ;
false.
```
As another technique that is sometimes useful for parsing, here are DCG rules that describe a list of lines that are terminated by line feed or end of file:
```lines([])     --> call(eos), !.
lines([L|Ls]) --> line(L), lines(Ls).

line([])     --> ( "\n" | call(eos) ), !.
line([C|Cs]) --> [C], line(Cs).

eos([], []).```
The nonterminal call//1 is used to call a Prolog predicate in such a way that the implicit DCG arguments are made available as explicit arguments. We need this to describe the condition "end of string". The grammar construct !//0 is used like the corresponding Prolog control structure.

You can use these definitions to obtain the contents of a file as a list of lines for further reasoning:
```?- phrase_from_file(lines(Ls), 'like.txt').
Ls = [['I', ' ', l, i, k, e, ' ', i|...]].
```
Note: eos/2 is comparatively rarely needed. In the above case it is used because anything else is the beginning of a new line. Typically, you can more directly express what you want to describe, and phrase/2 and phrase_from_file/2 only succeed if the complete list or file matches the description. Therefore, it is typically not necessary to explicitly refer to "end of the string".

## Describing output

The following definitions are useful for describing more complex output with DCGs. They define the nonterminal format_//2, which you can use in DCGs like the well-known predicate format/2. However, instead of writing output to the terminal, the DCG nonterminal format_//2 declaratively describes a list of characters:
```format_(Format, Args) --> call(format_chars(Format,Args)).

format_chars(Format, Args, Cs0, Cs) :-
format(chars(Cs0,Cs), Format, Args).```
Using these definitions, we can for example describe a table with a DCG:
```table -->
row([this,is,a]),
row([very,nice,table]).

row(Ls) --> format_("~t~w~7+~t~w~7+~t~w~7+~n", Ls).
```
Sample query and its results:
```?- phrase(table, Ts).
Ts = [' ', ' ', ' ', t, h, i, s, ' ', ' '|...].
```
An important advantage of this declarative way to describe output is that you can write test cases and reason about it: It is quite hard to reason about output that only appears on the system terminal, with no other tangible manifestation. In contrast, it is easy to state properties that must hold about a list of characters.

When needed, we can show the table on the terminal with a single call of format/2:
```?- phrase(table, Ts), format("~s", [Ts]).
this     is      a
very   nice  table
Ts = [' ', ' ', ' ', t, h, i, s, ' ', ' '|...].
```

## Implementation

To see how DCGs are internally implemented in SWI-Prolog, you can use listing//1. For example, to see the actual source code for num_leaves_//1:
```?- listing(num_leaves_//1).
num_leaves_(nil, A, D) :-
state(C, B, A, E),
B#=C+1,
D=E.
num_leaves_(node(_, A, C), B, E) :-
num_leaves_(A, B, D),
num_leaves_(C, D, E).
```
We see that internally, the two DCG rules of num_leaves_//1 were translated into Prolog rules with two additional arguments, following mechanical rewriting steps. The translation of DCGs to Prolog code is done by term_expansion/2, a mechanism analogous to macros in other languages.

For portability, it is best not to rely on a particular expansion method. Instead, use semicontext notation to refer to states and always use the phrase/[2,3] interface to invoke a DCG. This way, a Prolog system is free to apply a different expansion method that may be more efficient. For example, the last two arguments may as well be swapped and allow more efficient predicate calls this way due to specifics of a particular virtual machine.

## Using DCGs

Consider using DCGs if you are:
• describing a list
• passing around a state representation that only a few predicates actually use or modify.
In every serious Prolog program, you will find many opportunities to use DCGs due to some subset of the above reasons.

Let us study one side-by-side comparison of a regular Prolog predicate and an equivalent DCG nonterminal to recall several important aspects of using DCGs and make some additional observations. We take the well-known Prolog predicate include/3 as an example:
 regular Prolog Prolog DCG ```include(Goal, List, Included) :- include_(List, Goal, Included). include_([], _, []). include_([L|Ls], Goal, Included0) :- ( call(Goal, L) -> Included0 = [L|Included] ; Included0 = Included ), include_(Ls, Goal, Included). ``` ```include(Goal, List, Included) :- phrase(include_(List, Goal), Included). include_([], _) --> []. include_([L|Ls], Goal) --> ( { call(Goal, L) } -> [L] ; [] ), include_(Ls, Goal). ```
Usage example for both versions:
```?- include(integer, [a,b,c,1,2,e], Is).
Is = [1, 2].
```
We recall and note in particular:
• phrase/2 must of course be used to refer to a DCG nonterminal within a regular Prolog predicate
• conversely, {}//1 is used to call a regular Prolog goal within a DCG
• the DCG version needs fewer explicit arguments, making it slightly easier to read and understand
• although it is strongly related to the previous point, it is still worth mentioning explicitly that the DCG version also refers to fewer variables
• using a DCG makes immediately clear that a list is being described.
Take your time and see for yourself which of the two versions you find easier to read and understand.

Some applications of DCGs that you can download from this site and study:
• markov.pl uses DCGs to describe output of a Markov chain
• Lisprolog uses DCGs to parse Lisp expressions and to implicitly thread through the state of the environment
• Tist uses DCGs for all of the above reasons throughout the document
• The comp.lang.prolog FAQ are described with a DCG.

## Current developments

DCGs are currently being considered for inclusion in the Prolog ISO standard.