# Prolog Data Structures

## Terms

In Prolog, all data are represented by Prolog terms.

Each term is either a variable, an atomic term or a compound term:
• variables start with an uppercase letter or with an underscore (_). A single underscore denotes an anonymous variable and can be read as "any term". For example, X, Y, _爱 and Prolog are variables.
• atomic terms are:
• atoms, such as x, test and 'quotes and space'
• integers, such as 42
• floating point numbers
• depending on the Prolog system, there are also other kinds of atomic terms, such as complex numbers and rational numbers.
• compound terms are defined inductively as follows: If T1, T2, ..., TN are terms, then F(T1, T2, ..., TN) is also a term, where F is called a functor name and adheres to the same syntax rules as atoms. F/N is called the principal functor of the compound term, and N is called the arity. Examples: f(a), g(f(X)) and +(a, f(X)).
All data can be represented in this way. Terms come into existence by simply writing them down.

Prolog is dynamically typed and allows us great freedom for representing data. For example, we could represent natural numbers as follows:
• we could use the atom zero to represent 0
• we could use the compound term s(X) to represent the successor of X.
In this representation, the term s(s(s(zero))) represents the number 3. Such a representation is of course highly impractical due to its enormous size for large numbers. For this reason, Prolog provides integer arithmetic with a much more efficient number representation.

Prolog terms naturally correspond to trees. There is a standard order on terms.

 Video: ## Lists

Prolog lists are a special case of terms.

Lists are defined inductively:
• the atom [] is a list, denoting the empty list
• if Ls is a list, then the term '.'(L, Ls) is also a list.
There is a special syntax for denoting lists conveniently in Prolog:
1. The list '.'(a, '.'(b, '.'(c, []))) can also be written as [a,b,c].
2. The term '.'(L, Ls) can also be written as [L|Ls].
These notations can be combined in any way. For example, the term [a,b|Ls] is a list iff Ls is a list.

Lists naturally represent collections of elements and arise in almost all Prolog programs. When choosing between lists and any other way to represent collections of elements, ask yourself the following questions:
• Can there be arbitrarily many elements?
• Can there be zero elements?
• Is the order of elements significant?
• Are the elements of the same kind?
If your answer is "yes" to most of these questions, lists are often a good fit.

In contrast, consider for example the case where each collection comprises exactly three elements. You can use lists of the form [A,B,C] to represent such collections, and thus benefit from meta-predicates that reason over lists. On the other hand, using compound terms of the form f(A,B,C) is more memory efficient: This is because the list [A,B,C] is the compound term '.'(A,'.'(B,'.'(C,[]))) and thus takes roughly twice the space of the term f(A,B,C) in memory.

DCG notation is frequently used for describing lists.

 Video: ## Pairs

By convention, the functor (-)/2 is often used to denote pairs of elements in Prolog. For example, the term -(A, B) denotes the pair of elements A and B. In Prolog, (-)/2 is defined as an infix operator. Therefore, the term can be written equivalently as A-B.

Many commonly available predicates also use this syntax to denote pairs. Examples of this are keysort/2 and pairs_keys_values/3.

## Association lists

In many Prolog systems, association lists are available to allow faster than linear access to a collection of elements. These association lists are typically based on balanced trees like AVL trees.

There is a public domain library called library(assoc) that ships with many Prolog systems and provides operations for inserting, fetching and changing elements of a collection.

Here is one way to represent an AVL tree in Prolog:
1. the empty tree is represented by the atom t.
2. an inner node is represented as a compound term of the form t(Key, Value, Balance, Left, Right) where:
• Key and Value represent the association of Key with Value.
• Balance is an atom that denotes a balance criterion, such as a relation between the number of children of the left and right subtrees. For example, we can use the atoms <, = and > to denote different states between the subtrees.
• Left and Right are AVL trees.
Every operation on AVL trees, such as adding or changing an association, can be described as a relation between different trees, one tree before the operation and one after it. See Thinking in States for more information.

AVL trees let us perform many important operations in O(log(N)) time, where N is the number of associations. This is acceptably efficient in many cases.

## Strings?

Almost everything that needs to be said about strings has been brilliantly said by Richard O'Keefe in his Prolog Library Proposal. In particular, strings are wrong, and:
For almost any use case that involves some kind of processing, the only sensible thing to do with string data is to turn it into some kind of tree early. Prolog is brilliant with trees.
In Prolog, a convenient representation of strings is available as lists of characters, which are atoms. This representation is available if you add the following directive to your program:

`:- set_prolog_flag(double_quotes, chars).`

With this setting, we obtain for example:
```?- Cs = "test".
Cs = [t, e, s, t].
```
Thus, working with strings is reduced to working with lists, which can be easily handled in Prolog.

## Arrays?

What about arrays? Is there any data structure in Prolog to represent a collection of terms, allowing for O(1) access to individual elements?

First of all, the concept of destructive modifications is alien to logic programming. In Prolog, we describe relations between entities, not destructive effects. To express a change in a Prolog data structure, we define a predicate that relates the state of the structure before the change to a different structure after the change. For this reason, pure modifications often entail some copying of data and typically lead to at least logarithmic overhead (for example, to copy a subtree of a balanced tree).

That being said, we can access the arguments of a term in O(1) with the built-in predicate arg/3. There even is an impure predicate called setarg/3 which allows destructive modifications to a term. Use it at your own peril: If you use setarg/3, you can no longer reason about your code in the way you expect from pure relations. If you really need efficient destructive modifications of terms, it is better to use attributed variables.

## Type tests

There are predicates to test for specific types of terms.

Unfortunately, the standard predicates for type testing (atom/1, integer/1, compound/1 etc.) are logically flawed because they are not monotonic: If you use these predicates, then generalizing a query may lead to fewer solutions, preventing declarative debugging based on logical properties. For example, atom/1 fails for the most general query, even though it succeeds for more specific queries:
```?-        atom(X).
false.

?- X = a, atom(X).
X = a.
```
Therefore, new families of predicates for type testing are now becoming available. They implement type tests with desirable logical properties:
• the ..._si/1 family of predicates which are like the standard predicates if the arguments are sufficiently instantiated
• must_be/2, used for input arguments
• can_be/2, used for output arguments and general arguments.

 Video: The predicates must_be/2 and can_be/2 are especially useful for Prolog library authors. The ..._si/1 family of type tests are useful in normal Prolog programs, if you want to explicitly test for specific types without raising type errors.

Importantly, all these predicates raise instantiation errors if the terms that are tested are not sufficiently instantianted to allow a sound decision.

## Clean vs. defaulty representations

When representing data with Prolog terms, ask yourself the following question:
Can I distinguish the kind of each component from its principal functor?
If this holds, your representation is called clean. If you cannot distinguish the elements by their principal functor, your representation is called defaulty, a wordplay combining "default" and "faulty". This is because reasoning about your data will need a "default case", which is applied if everything else fails. In addition, such a representation prevents argument indexing, and is considered faulty due to this shortcoming. Always aim to avoid defaulty representations! Aim for cleaner representations instead.

For example, suppose you represent a full binary tree in Prolog. There are two kinds of elements in a full binary tree:
• leaves, which are concrete elements of the tree
• nodes, which have two children that are again full binary trees.
We can represent such trees with Prolog terms as follows:
• leaf(L) represents the leaf L
• node(Left, Right) represents a node and its two children.
The above is a clean representation: It lets us distinguish the kinds of elements by their principal functor. We can reason about such trees in a way that keeps the code very general and usable in all directions. Such a representation is also amenable to argument indexing.

You can also recognize a clean representation if it lets you describe the general structure of your data while retaining enough flexibility to keep the concrete elements unspecified. For example, in the case of full binary trees, we can represent the general outline of all trees with exactly two leaves as node(leaf(_),leaf(_)). If you had chosen to omit the leaf/1 wrapper for representing leaves, this would become node(_,_): a defaulty representation that no longer represents precisely such manifestations.

Sometimes, it will appear to you that there is no clean way out. For instance, you may be faced with a representation that requires you to distinguish different cases in an impure and non-monotonic way, such as by testing the instantiation of certain arguments via var/1. In such cases, it is good practice to restrict the impure parts of your program to small fragments, and to convert any defaulty representation to a clean one by introducing suitable wrappers that let you distinguish the cases by pattern matching. This helps to ensure that you can use your core predicates in multiple directions.

## Term inspection

In Prolog, the most natural way to reason about terms is to rely on unification. In addition, there are several predicates that let you decompose and analyze terms. The most important of these term inspection predicates are functor/3, arg/3 and (=..)/2.

Here are examples that illustrate their usage:
```?- functor(f(a,g(X)), Functor, Arity).
Functor = f,
Arity = 2.

?- functor(Term, f, 2).
Term = f(_25002, _25004).

?- arg(2, f(a,g(X)), Arg).
Arg = g(X).

?- f(a,g(X)) =.. [Functor|Args].
Functor = f,
Args = [a, g(X)].

?- Term =.. [f,a,g(X)].
Term = f(a, g(X)).
```
These predicates cannot be defined by a finite set of clauses, and can therefore be considered higher-order predicates. Do not get carried away with these predicates! Everything that can be expressed by pattern matching should be expressed by pattern matching. For example, instead of functor(Term, f, 2), you can simply write Term = f(_,_).

Every Prolog term has a canonical representation. If you are ever unsure about the structure of a term, use write_canonical/1 to obtain the canonical representation. For example:
```?- write_canonical(a+b=[x,y,z]).
=(+(a,b),'.'(x,'.'(y,'.'(z,[]))))
```
The canonical representation shows the structure of the term in such a way that it can also be easily parsed by external programs.