Σ is a ranked alphabet (i.e., an alphabet whose symbols have an associated arity) disjoint from N,
Z is the starting nonterminal, with Z ∈ N, and
P is a finite set of productions of the form A → t, with A ∈ N, and t ∈ TΣ(N), where TΣ(N) is the associated term algebra, i.e. the set of all trees composed from symbols in Σ ∪ N according to their arities, where nonterminals are considered nullary.
Derivation of trees
The grammar G implicitly defines a set of trees: any tree that can be derived from Z using the rule set P is said to be described by G.
This set of trees is known as the language of G.
More formally, the relation ⇒G on the set TΣ(N) is defined as follows:
A tree t1∈ TΣ(N) can be derived in a single step into a tree t2 ∈ TΣ(N)
(in short: t1 ⇒Gt2), if there is a context S and a production (A→t) ∈ P such that:
t1 = S[A], and
t2 = S[t].
Here, a context means a tree with exactly one hole in it; if S is such a context, S[t] denotes the result of filling the tree t into the hole of S.
The tree language generated by G is the language L(G) = { t ∈ TΣ | Z ⇒G*t }.
Here, TΣ denotes the set of all trees composed from symbols of Σ, while ⇒G* denotes successive applications of ⇒G.
A language generated by some regular tree grammar is called a regular tree language.
Examples
Let G1 = (N1,Σ1,Z1,P1), where
N1 = {Bool, BList } is our set of nonterminals,
Σ1 = { true, false, nil, cons(.,.) } is our ranked alphabet, arities indicated by dummy arguments (i.e. the symbol cons has arity 2),
The image shows the corresponding derivation tree; it is a tree of trees (main picture), whereas a derivation tree in word grammars is a tree of strings (upper left table).
The tree language generated by G1 is the set of all finite lists of boolean values, that is, L(G1) happens to equal TΣ1.
The grammar G1 corresponds to the algebraic data type declarations (in the Standard ML programming language):
Every member of L(G1) corresponds to a Standard-ML value of type BList.
For another example, let G2 = (N1, Σ1, BList1, P1 ∪ P2), using the nonterminal set and the alphabet from above, but extending the production set by P2, consisting of the following productions:
BList1 → cons(true,BList)
BList1 → cons(false,BList1)
The language L(G2) is the set of all finite lists of boolean values that contain true at least once. The set L(G2) has no datatype counterpart in Standard ML, nor in any other functional language.
It is a proper subset of L(G1).
The above example term happens to be in L(G2), too, as the following derivation shows:
If L1, L2 both are regular tree languages, then the tree sets L1 ∩ L2, L1 ∪ L2, and L1 \ L2 are also regular tree languages, and it is decidable whether L1 ⊆ L2, and whether L1 = L2.
Alternative characterizations and relation to other formal languages
^Emmelmann, Helmut (1991). "Code Selection by Regularly Controlled Term Rewriting". Code Generation - Concepts, Tools, Techniques. Workshops in Computing. Springer. pp. 3–29.
^Comon, Hubert (1990). "Equational Formulas in Order-Sorted Algebras". Proc. ICALP.
^Gilleron, R.; Tison, S.; Tommasi, M. (1993). "Solving Systems of Set Constraints using Tree Automata". 10th Annual Symposium on Theoretical Aspects of Computer Science. LNCS. Vol. 665. Springer. pp. 505–514.
^Ziv-Ukelson, Smoly (2016). Algorithms for Regular Tree Grammar Network Search and Their Application to Mining Human–viral Infection Patterns. J. of Comp. Bio. [1]
Further reading
Regular tree grammars were already described in 1968 by:
Thatcher, J.W.; Wright, J.B. (1968). "Generalized Finite Automata Theory with an Application to a Decision Problem of Second-Order Logic". Mathematical Systems Theory. 2 (1): 57–81. doi:10.1007/BF01691346. S2CID31513761.
A book devoted to tree grammars is: Nivat, Maurice; Podelski, Andreas (1992). Tree Automata and Languages. Studies in Computer Science and Artificial Intelligence. Vol. 10. North-Holland.
Algorithms on regular tree grammars are discussed from an efficiency-oriented view in: Aiken, A.; Murphy, B. (1991). "Implementing Regular Tree Expressions". ACM Conference on Functional Programming Languages and Computer Architecture. pp. 427–447. CiteSeerX10.1.1.39.3766.
Given a mapping from trees to weights, Donald Knuth's generalization of Dijkstra's shortest-path algorithm can be applied to a regular tree grammar to compute for each nonterminal the minimum weight of a derivable tree. Based on this information, it is straightforward to enumerate its language in increasing weight order. In particular, any nonterminal with infinite minimum weight produces the empty language. See: Knuth, D.E. (1977). "A Generalization of Dijkstra's Algorithm". Information Processing Letters. 6 (1): 1–5. doi:10.1016/0020-0190(77)90002-3.
Regular tree automata have been generalized to admit equality tests between sibling nodes in trees. See: Bogaert, B.; Tison, Sophie (1992). "Equality and Disequality Constraints on Direct Subterms in Tree Automata". Proc. 9th STACS. LNCS. Vol. 577. Springer. pp. 161–172.
Allowing equality tests between deeper nodes leads to undecidability. See: Tommasi, M. (1991). Automates d'Arbres avec Tests d'Égalités entre Cousins Germains. LIFL-IT.
Each category of languages, except those marked by a *, is a proper subset of the category directly above it.Any language in each category is generated by a grammar and by an automaton in the category in the same line.