In previous chapters we saw how the input source was lexed, parsed and semantically analysed and we looked at how the symbols and data types are represented. But we haven't looked at what happens once the semantic analysis finishes. In this installment we're going to talk about the AST.
Abstract Syntax Tree
Fortran has two kinds of statements: non-executable and executable. The former declare properties about entities of the program and we saw in the previous chapter that these entities or symbols are represented in a symbol table. There is usually not much more to do beyond registering and remembering the attributes implied by a non-executable statement.
Executable statements, on the other hand, describe the computation that our Fortran program has to do. So we need a mechanism to represent that computation. At some point we will use this representation to generate a program that does exactly the intended computation by the Fortran source. A common way to represent this is using an AST, or abstract syntax tree.
ASTs are called abstract because they do not represent all the lexical and syntactical details of the source. Instead, they represent the fundamental parts of the language that are going to be relevant for the computation. The flang AST does exactly that.
Example
Lets consider this very simple function below
test.f90
FUNCTION ADD(A, B)
INTEGER :: A, B, ADD
ADD = A + B
END
Symbol table
In the last chapter we talked about the symbol table but we didn't consider any specific example neither we attempted to dump any of the symbol tables. Now it is a good opportunity to do this.
The flang driver allows us to pass debug flags to flang1
using -Hq,x,y
(and to flang2
using -Mq,x,y
). The numbers x
and y
are documented in the flang documentation (file coding.html
). We can dump the symbol table doing -Hq,5,1
$ flang -c test.f90 -Hq,5,1
When we request debug information, flang will create a file named <name>.qbf
for every given <name>.f90
compiled. In our case test.qdbf
.
test.qdbf
F90 PARSER begins
{flang after parser
add integer entry
sptr: 624 hashlk: 0 nmptr: 5015 dtype: 6 scope: 0 lineno: 1 enclfunc: 0
dcld: 1 ccsym: 0 entstd: 0 entnum: 0
endline: 5 symlk: 1 paramct: 2 dpdsc: 2
funcline: 1 bihnum: 0 fval: 627 pure: 0 impure: 0 recur:0
adjarr:0 aftent:0 assumshp:0 private:0
gsame: 0
altname: 0
Parameters:
sptr = 625, a
sptr = 626, b
a integer variable
sptr: 625 hashlk: 0 nmptr: 5019 dtype: 6 scope: 624 lineno: 1 enclfunc: 0
dcld:1 ccsym:0 save:0 ref:0 dinit:0 vol:0 ptrv:0 cvlen:0
address: 0 sc:3(SC_DUMMY) symlk: 1 midnum: 0 socptr: 0 autobj: 0
addrtkn:0 eqv:0 hccsym:0 alloc:0 arg:0 seq:1 nml:0 assn:0
private:0 sdsc: 0 ptroff: 0 descr: 0
altname:0
optarg:0 intent:INOUT
b integer variable
sptr: 626 hashlk: 0 nmptr: 5021 dtype: 6 scope: 624 lineno: 1 enclfunc: 0
dcld:1 ccsym:0 save:0 ref:0 dinit:0 vol:0 ptrv:0 cvlen:0
address: 0 sc:3(SC_DUMMY) symlk: 1 midnum: 0 socptr: 0 autobj: 0
addrtkn:0 eqv:0 hccsym:0 alloc:0 arg:0 seq:1 nml:0 assn:0
private:0 sdsc: 0 ptroff: 0 descr: 0
altname:0
optarg:0 intent:INOUT
add integer variable
sptr: 627 hashlk: 0 nmptr: 5015 dtype: 6 scope: 624 lineno: 0 enclfunc: 0
dcld:1 ccsym:0 save:0 ref:0 dinit:0 vol:0 ptrv:0 cvlen:0
address: 0 sc:3(SC_DUMMY) symlk: 1 midnum: 0 socptr: 0 autobj: 0
addrtkn:0 eqv:0 hccsym:0 alloc:0 arg:0 seq:1 nml:0 assn:1
private:0 sdsc: 0 ptroff: 0 descr: 0
altname:0
result optarg:0 intent:INOUT
...
</code> The dump of the table is not super easy to read but basically we see four entries. Each entry starts with the name of the symbol and the kind of symbol. The next lines are several attributes of the symbol. The set of attributes dumped changes for each kind of symbol.
In our example we only have two kinds of symbols: a first add
which is an entry (which is the way flang names a FUNCTION
) then a
, b
and another add
. The second add
is there because it represents the result-name of the FUNCTION
(had we used RESULT(myname) in the FUNCTION statement we would not have had repeated symbol names). In general repeated names are not a problem (sometimes it happens in Fortran) but it in this case, note that the scope of the variables is 624 which is the id of the symbol. The id of symbols that entail scopes like FUNCTION
, SUBROUTINE
, MODULE
is used to define a scoping relationship. Also note that the variables are marked to have a storage-class of SC_DUMMY. In Fortran parlance a dummy-argument is a formal parameter.
AST
Ok, now we have seen what flang knows (or at least shows in the dumps) about the symbols. Let's see the AST it generates. We can get the AST using -Hq,4,256
. It is possible to combine several -Hq
flags so we can see at the same time several internal dumps.
$ flang -c test.f90 -Hq,5,1 -Hq,4,256
test.qdbf
...
AST Table
null hshlk/std: 0 opt=(0,0)
aptr: 1 <null_ast>
constant hshlk/std: 0 type:integer opt=(0,0)
aptr: 2 sptr: 609 (0)
constant hshlk/std: 0 type:integer opt=(0,0)
aptr: 3 sptr: 611 (1)
ident hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 4 sptr: 1 (..sqrt)
unaryop hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 5 lop : 4 optype:28 ptr0
ident hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 6 sptr: 2 (.sqrt)
unaryop hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 7 lop : 6 optype:28 ptr1
ident hshlk/std: 0 type:character*1 alias: 0 callfg:0 opt=(0,0)
aptr: 8 sptr: 3 (.dsqrt)
unaryop hshlk/std: 0 type:character*1 alias: 0 callfg:0 opt=(0,0)
aptr: 9 lop : 8 optype:28 ptr0c
constant hshlk/std: 0 type:integer*8 opt=(0,0)
aptr: 10 sptr: 610 (0)
constant hshlk/std: 0 type:integer*8 opt=(0,0)
aptr: 11 sptr: 612 (1)
ident hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 12 sptr: 625 (a)
ident hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 13 sptr: 626 (b)
binop hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 14 lop : 12 rop: 13 optype:1
ident hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 15 sptr: 627 (add)
assign hshlk/std: 1 type:integer opt=(0,0)
aptr: 16 dest: 15 src: 14
end hshlk/std: 2 opt=(0,0)
aptr: 17
...
For reasons I have not looked into them yet, the AST contains references to a couple of preregistered symbols .sqrt
and .dsqrt
(they look like to related to the corresponding Fortran intrinsics) and a couple of constants 0 and 1. Why no other preregistered symbols or constants appear is a bit puzzling to me. That said, the interesting bits of this dump are the following
test.qdbf
...
AST Table
...
ident hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 12 sptr: 625 (a)
ident hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 13 sptr: 626 (b)
binop hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 14 lop : 12 rop: 13 optype:1
ident hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 15 sptr: 627 (add)
assign hshlk/std: 1 type:integer opt=(0,0)
aptr: 16 dest: 15 src: 14
end hshlk/std: 2 opt=(0,0)
aptr: 17
...
An AST is made of nodes. A node represents some computation. Sometimes the computation is almost no computation, like just a constant or a reference to a variable. Some other times the computation is compound of other computations. This is, other nodes. Recall that in line 4 our Fortran program is doing
4: ADD = A + B
It is possible to relate that statement to the AST dump above. Let's see, that statement is in Fortran parlance an assignment statement. Flang represents it using a node for assignments.
assign hshlk/std: 1 type:integer opt=(0,0)
aptr: 16 dest: 15 src: 14
assign
is the kind of the node, in this case this node represents an assignment. The type
of the operation is integer. aptr
means the id
of this node (remember that these id, even if just numbers, are unrelated to the id's used for symbols or data types and the precise id number could be the same). Now we have the operands of this node, which are other computations, i.e. other AST nodes. In this case the destination (dest
) is encoded in the node 15 and the source (src
) in the node 14. What are the destination and the source of an assignment? They are the left and right-hand side of the assignment, respectively.
The left-hand side is ADD
in our Fortran program. Should be encoded in the tree 15.
ident hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 15 sptr: 627 (add)
Indeed, we could check the symbol table above and see that 627 is exactly the result name add
. So good so far.
The right-hand side of our assignment is A + B
and it is encoded in the tree 14.
binop hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 14 lop : 12 rop: 13 optype:1
This node is a bit more interesting because it is a binary operation (binop
) (again of type integer
). The left and right-hand sides of a binary operations are named by flang as lop
and rop
respectively. We can check the corresponding AST nodes.
ident hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 12 sptr: 625 (a)
ident hshlk/std: 0 type:integer alias: 0 callfg:0 opt=(0,0)
aptr: 13 sptr: 626 (b)
Looks sensible.
Kinds and attributes of AST
Flang has about 160 ASTs. They are defined in the file tools/flang1/utils/ast/ast.n
. This file is in a troff-like syntax which is not particularly readable. The form is relatively simple and we can extract information using grep.
$ grep -A2 "\.ST" tools/flang1/utils/ast/ast.n
.SI null
This AST type represents the AST at location 0 in the AST table.
Generally, the pointer (index) value of zero will not appear in
--
.SI ident LVAL EXPR
This node is a leaf node in the AST and represents an
identifer which is fully resolved (declared).
--
.SI constant EXPR
.lp
Leaf node representing a constant.
--
.SI label
.lp
Leaf node representing a label.
--
.SI binop EXPR
.lp
Binary Operator node.
--
.SI unaryop EXPR
.lp
Unary Operator node.
...
File tools/flang1/utils/ast/ast.n
is also used to generate a similar one in the build directory named tools/flang1/utils/ast/ast.out.n
. The latter includes extra information of the memory layout of the nodes. However the relevant attributes should already be documented in the former. Theoretically it should be possible to generate Sphinx documentation from this file (like it is already done with other similar files) but this does not seem to be implemented yet.
For each .SM
entry in ast.n
, look at the following entries .SI
, .SE
, .FL
and .OV
.
</tr> </tr> </tbody> </table>
For instance, AST nodes that represent a binary operation have this information, according to ast.n
.
- Kind of node:
BINOP
CALLFG
(flag). Set if a function reference appears in the left and/or right operand. (for instance in2.3 + SQRT(1.2)
)DTYPE
. Data type of the result of the operationLOP
. AST pointer to left (first) operandOPTYPE
. Type of operator (seeOP_
macros inast.h
)ROP
. AST pointer to right (second) operandALIAS
. If node evaluates to a constant, this field locates the constant node representing this value
The gory details
If you are really interested on seeing how an AST is stored, just take a look at tools/flang1/utils/ast/ast.h
in the build directory. You will see that there is a global variable astb
which represents the global block that stores all the ASTs. A single AST node is represented using a struct of the type AST
.
The names of the nodes are stored in astb.atypes
, indexed by the type of the node (i.e. A_TYPEG(ast)
).
The field astb.attr
stores some attributes intended to simplify some checks.
/* AST attributes: for fast AST checking -- astb.attr is a table indexed
* by A_<type>
*/
#define __A_LVAL 1
#define __A_EXPR 2
#define A_ISLVAL(a) (astb.attr[a]&__A_LVAL)
#define A_ISEXPR(a) (astb.attr[a]&__A_EXPR)
The astb
global variable also stores other info. We will discuss some of that extra info later in this chapter.
Creation of an AST
ASTs are created in many places of the compiler but semantic analysis is going to create the initial set that will be used by further analysis and transformations in the compiler.
Low level creation
To create an AST node we can use the function new_node
. This is a low level function that will return a new node with the given kind.
ast.c
int
new_node(int type)
{
int nd;
nd = STG_NEXT(astb);
if (nd > MAXAST || astb.stg_base == NULL)
errfatal(7);
A_TYPEP(nd, type);
return nd;
}
This function basically allocates a new ast in astb
. The macro STG_NEXT
expands to the call stg_next((STG *)&name.stg_base, 1)
. The function stg_next
is a bit too long to explain today but basically reserves a number of data and returns the first element allocated, as an integer of course. It is a generic function but as it is applied to astb
it is used to allocate ASTs.
Due to implementation details elsewhere there is a maximum number of ASTs we can allocate, so we check this. The number is fairly large 0x0400_0000 (and also gives us 5 extra bits in the top of the identifier if we ever need them).
#define MAXAST 67108864
Next thing this function does it setting the kind of the node. The kind of the node is called TYPE
in flang. A_TYPEP
is the setter we need to use.
Constructors
While new_node
will give us a node it is pretty bare. Most of the time our nodes have essential information that we will always need to create them correctly. For instance a binary operation always will have the kind of operation performed, the left hand side, the right hand side and the type of the operation. As such flang provides many helpers that ease creating those nodes. Their names are of the form mk_*.
Flang seems to distinguish, conventionally, the creation of statement ASTs from other ASTs. So there is a generic mk_stmt
. The code uses this to signal the AST is actually a statement. As a special element it receives a DTYPE
to state the data type of the statement. Statements where there is not an actual data computed, like control flow statements, are created with a data type of 0.
ast.c
int
mk_stmt(int stmt_type, DTYPE dtype)
{
int ast;
ast = new_node(stmt_type);
if (dtype)
A_DTYPEP(ast, dtype);
return ast;
}
Some statements have dedicated constuctors like mk_assn_stmt
, that is used to create an assignment statement. You can see the data type, the destination (the left hand side of the assignment) and the source (the right hand side). dest
and source
are ASTs (although not made obvious in the prototype).
int
mk_assn_stmt(int dest, int source, DTYPE dtype)
{
int ast;
ast = mk_stmt(A_ASN, dtype);
A_DESTP(ast, dest);
A_SRCP(ast, source);
return ast;
}
Statement descriptor
Flang ASTs are going to represent two different things:
- Expressions
- Statements
Expressions will have a type and denote a computation which mostly revolves about evaluating the expression to determine its value. Statements are instead executed, rather than evaluated. Or put differently, they are evaluated just for their side-effects. Because statements have side-effects the order in which we execute them may be important.
Sometimes it does not matter because there is no way to tell. For instance in the example below, it does not matter that A = 3
is executed after or before the first PRINT statement, we can't simply tell the order. But we can't run the second PRINT
before the first PRINT
or execute the assignment after the second PRINT
as that would be an observably different execution of the program.
A = 3
PRINT *, "HI"
PRINT *, A
Because not all ASTs are statements, flang decouples information that is strictly for statements in another structure called the Statement Descriptor (STD). It is an attribute of an AST that represents a statement and it is accessible via A_STDP
and A_STDG
as expected.
The STD data structure is relatively simple. The most relevant attributes are the the following.
Entry | Meaning | Accessor |
---|---|---|
.SM NAME | Kind of AST node: BINOP , UNOP , ID , etc.. | A_TYPEG(ast) will return NAME .A_TYPEP(ast, NAME) is possible but less common. |
.SI name | Textual description of the node: for debugging only | astb.atypes[A_TYPEG(i)] </tr> |
.FL NAME | Name of a flag (boolean value) of this node. | A_NAMEG(ast) A_NAMEP(ast, bool-value) |
.SE NAME | Name of an attribute of this AST, usually returning an id of another AST or data type. | A_NAMEG(ast) A_NAMEP(ast, id) |
.OV NAME | Used for attributes of integer nature that are not an id of an entity (like a kind or a hash link). | A_NAMEG(ast) A_NAMEP(ast, value) |
Attribute | Meaning | Accessor |
---|---|---|
ast | The AST that is described by this STD. Contains an AST identifier | STD_AST(std) |
next | The STD that follows this STD. This is a STD identifier | STD_NEXT(std) |
prev | The STD that precedes this STD. This is a STD identifier | STD_PREV(std) |
label | The label symbol of this statement if it has one. This is a SPTR | STD_LABEL(std) |
lineno | The line number of this statement | STD_LINENO(std) |
findex | This is the file index. It identifies the file that contains the statement. It changes because of INCLUDE lines and preprocessor #include directives. The identifier is a FIH (File Information Header) that we haven't discussed yet. | STD_FINDEX(std) |
Creation of an STD
To create an STD for a an AST the function mk_std
can be used. This will only link (bidirectionally) the STD and the AST but we still need to place the statement itself.
In the early stages of the compiler, the semantic actions of the parser are responsible for most of the creations of ASTs and STDs in the compiler. Because the creation of statements often happens in the order of the program there is a convenient function add_stmt
that will do a mk_std
to the new AST and then place the STD as the last one. Sometimes AST of statements have to be placed before or after other STDs and this can be achieved with helpers like add_stmt_after
and add_stmt_before
.
The file ast.c
contains many other functions useful to manipulate placement of statements. Note that where the code expects an statement it means an identifier of an STD, not an AST.
Summary
In this chapter we have seen the structure that flang uses to represent the computations performed by our Fortran program.
- The computation is represented using an AST. An AST is a data structure formed by AST nodes.
- There are about 160 different kinds of AST nodes. Each kind represents some different form of computation, ranging from simpler ones like just references to variables or constants to more complex ones like binary operands and statements.
- Each kind of node has different attributes. Each attribute has its own accessors. Some of these attributes are ASTs, SPTRs or DTYPEs.
- Some ASTs represent Fortran statements, where the order is important. The extra information of statements is decoupled from regular AST in a data structure called STD.
- STDs concern about ordering of the fortran statements.
In the next installment of this series we will look at ASTs a bit more, in particular we have not considered statements that change the control flow.