Project 4: Semantic Analysis¶
Optional checkpoint due Friday, Mar 26, 2021 at 8pm ET¶
Final deadline Friday, Apr 2, 2021 at 8pm ET¶
A PDF version of this document is located here.In this project, you will implement a semantic analyzer for uC, a small language in the C family. The main purpose of this project is to write a substantial piece of object-oriented code in Python and to gain practice with inheritance, modularity, and code reuse. You will also gain a basic understanding of how a compiler works.
As part of the code distribution for this project, we have provided you with a lexer and parser for uC, as well as a framework for a uC compiler. In this spec, we will provide suggestions for how to implement the remaining pieces of the semantic analyzer. You may make any modifications you like to the files marked with “modify it” in Distribution Code, as long as the interface remains the same.
The project is divided into multiple phases, and we recommend completing them in the order specified.
You may work alone or with a partner. Please see the syllabus for partnership rules. As a reminder, you may not share any part of your solution outside of your partnership. This includes both code and test cases.
Reported Time to Complete the Project¶
The following is the time students reported they spent on the project in Fall 2019.
These data are for planning purposes only. We do not consider the exact time spent on the project to be reflective of anyone’s learning or ability. Rather, completing the project regardless of how much time it takes is what is important to achieve the learning goals of the project.
Optional Checkpoint¶
The checkpoint consists of achieving at least 30% of the points on the public and private test cases before the checkpoint deadline. Your grade for the checkpoint will be computed as the better of:
\(min(0.3, score) / 0.3\), where \(score\) is the fraction of autograded points earned by your best submission before the checkpoint deadline.
\(finalScore\), where \(finalScore\) is the fraction of autograded points earned by your best submission before the final deadline.
Thus, completing the checkpoint is optional. However, doing it will work to your benefit, since you can guarantee full credit on the 20% of the project points dedicated to the checkpoint.
Overview¶
A compiler generally consists of multiple phases of analysis and transformation over a source program. The first phase is usually parsing, which includes lexical analysis to divide the program into tokens and syntax analysis to organize the tokens into a format that represents the syntactic structure of the code. The parsing phase produces an abstract syntax tree (AST), which is closely related to a derivation tree formed by deriving a program from a grammar. The main difference is that operations are generally encoded in the type of a node in an AST, rather than appearing on their own as a leaf node. As an example, consider the following uC program:
void main(string[] args)() {
println("Hello world!");
}
The AST in Figure 1 is produced by the parser in the distribution code.
The nodes in the diagram above are either AST nodes (if the name ends
with Node
), lists, or terminals, which are tokens such as names,
numbers, or strings. Each AST node has instance fields that refer to
other nodes or terminals, represented by the edges in the picture
above. The edge labels are the names of the corresponding fields.
Edges from a list represent an element in the list, labeled by its
index. The leaf nodes above are terminals, representing names such as
void
or main
, or literals such as the string "Hello
world!"
.
Once a program has been transformed into an AST, the AST can be analyzed recursively in order to compute information, check correctness, perform optimizations, and generate code. The object-oriented paradigm is particularly well-suited to analyzing an AST, since each AST node class can define its own handling, with base functionality inherited from the base AST node class.
The phases in our compiler are the following, in order:
Finding declarations. The AST is traversed to collect all function and type declarations. In a uC program, a type or function can be used before its declaration.
Resolving types. Type names that appear in the AST are matched to the actual type that they name. An error is reported if a type name does not match a defined type.
Resolving function calls. Function calls that appear in the AST are matched to the function they name. An error is reported if a function name does not match a defined function.
Checking fields and variables. The fields in a type definition and the parameters and variables in a function definition are checked for uniqueness in each scope. Local environments are built for each scope, mapping each field, parameter, or variable name to its type.
Checking basic control flow. The AST is examined to check that all uses of the
break
andcontinue
constructs occur in a valid context.Computing and checking types. Types are computed and stored for each expression in the program, and each use of a type is checked to ensure that it is valid in the context in which it is used.
Checking advanced control flow. In this optional phase, local variables are checked to ensure that they are initialized before use, and non-
void
functions are checked to ensure that they can only exit through areturn
statement.
The compiler is run as follows:
$ python3 ucc.py -S <source file>
This parses the source file and then runs the semantic analysis phases
described above. If any phase, including parsing, results in an error,
the compiler exits after the phase completes, noting how many errors
occurred. The -S
argument restricts the compiler to semantic
analysis, disabling the code generation phases that constitute the
backend of the compiler.
If the -T
command-line argument is present, then a representation
of the AST with type information is saved to a file:
$ python3 ucc.py -S -T <source file>
The compiler implementation is divided into several Python files, described below.
Distribution Code¶
Use the following commands to download and unpack the distribution code:
$ wget https://amirkamil.github.io/project-uc/frontend/starter-files.tar.gz
$ tar xzf starter-files.tar.gz
Start by looking over the distribution code, which consists of the following files:
File or directory |
Purpose |
What you need to do with it |
---|---|---|
|
Parser |
Nothing |
|
Parser |
Nothing |
|
Parser |
Nothing |
|
Driver |
Read it |
|
Driver |
Read it |
|
Driver |
Nothing |
|
Contextual Information |
Read and use it |
|
Errors |
Read and use it |
|
Core |
Read, use, and modify it |
|
Types |
Read, use, and modify it |
|
Functions |
Read, use, and modify it |
|
Statements |
Read, use, and modify it |
|
Expressions |
Read, use, and modify it |
|
Testing |
Nothing |
|
Testing |
Run it with |
|
Testing |
Read, use, and modify it |
Parser Files¶
The files lex.py
and yacc.py
are the source files for PLY, the parser generator we are using in
this project. You do not have to be familiar with the code in these
files.
The actual uC parser is specified in ucparser.py
, in the
format required by PLY. You do not have to examine the contents
closely, but if you want to know more detail about how a particular
AST node is constructed, the actual code to create a node accompanies
the corresponding grammar rule in ucparser.py
. However, you should
not construct any AST nodes yourself - the parser will create the
full AST for you.
Running the parser generates a file called parsetab.py
, which
contains the internal data structures required by the parser. You do
not have to be familiar with this file either, and we recommend
excluding it from source control.
Driver Files¶
The top-level entry point of the compiler is ucc.py
. It opens the
given uC source file, invokes the parser to produce an AST, and
then invokes each of the compiler phases on the AST. If the -T
command-line argument is present, it then writes a representation of
the AST, with annotated type information, to a file if no errors
occurred in semantic analysis. If an error occurs during an analysis
phase, the number of errors is reported and the compiler exits.
The PhaseContext
class in uccontext.py
records contextual
information needed by a compiler phase. A new context can be created
from an old one:
new_ctx = ctx.clone()
# modify new_ctx and pass it to child nodes
Contextual information is stored in the PhaseContext
, which can be
indexed as an associative container:
# whether or not the current node is within a loop
ctx['in_loop'] = False
The interface for the compiler phases is defined in ucfrontend.py
.
The top-level functions for each phase set the initial contextual
information and call the appropriate method on the root AST node. In
addition, the function for the type-checking phase also checks for a
valid main()
function, so you do not have to do so yourself.
Errors¶
The file ucerror.py
defines the function you should use to report
an error. The error()
function takes as its first argument the
current analysis phase, which you can extract from a PhaseContext
.
The function’s second argument is the source position that resulted in
the error. (Every AST node stores its position as an instance field,
which can then be passed to error()
.) The third argument is the
message to report to the user.
We evaluate correctness of your compiler by checking that the phases
and line numbers in the errors you report match ours. The
ucccheck.py
file compares your output to the expected output
included in the distribution files. See Testing and Evaluation for
more detail on what constitutes correct output.
Types¶
The file uctypes.py
contains definitions of classes that represent
uC types. The Type
class is the base class for all types, and
it implements functionality common to all types. In particular, the
array_type
property is the array type for the given type. The
lookup_field()
method looks up a field in a type, and by default
it produces an error. The mangle()
method computes the name of the
type to be used in code generation, which will not be used in this
project.
The ArrayType
class represents an array type, and it has a
corresponding element type. The check_args()
method checks that
the given arguments are valid for constructing an array of the given
type. You will need to complete the implementation of the
check_args()
and lookup_field()
methods.
The PrimitiveType
class represents a primitive type. You should
not have to modify the class definition.
The UserType
class represents a user-defined type. It is
associated with the AST node that declares it, and it stores a
reference to the set of fields defined by that node. You will need to
complete the check_args()
method, which determines whether or not
the given arguments are valid for a constructor call for the given
type. You will also need to fill in the definition for
lookup_field()
.
You will not have to directly construct an object of Type
or its
derived classes. The distribution code will do so when a type is added
to the global environment, as described below.
The is_compatible()
function determines whether or not a source
type matches or is convertible to a target type, as defined by the
uC specification.
The is_numeric_type()
function determines whether or not the given
type is a numeric type.
The is_integral_type()
function determines whether or not the
given type is an integral type.
The join_types()
function computes the type of a binary operation
with the given operand types.
The add_builtin_types()
function adds the built-in primitive types
to the given dictionary. You will not need to call this function
directly, as the distribution code does so when a global environment
is created.
Functions¶
The file ucfunctions.py
contains definitions of classes that
represent uC functions. The Function
class is the base class
for all functions. A function has a name, a return type, and a list of
of parameter types. The mangle()
method computes the name of the
function to be used in code generation. It will not be used in this
project. The check_args()
method checks whether the given
arguments are compatible with the parameters of the function. You will
have to fill in the definition.
The PrimitiveFunction
class represents a built-in function. You
will not need to modify this class.
The UserFunction
class represents a user-defined function. It has
an associated declaration node. When a UserFunction
is created,
the return type and parameter types are unknown. You will have to
manually set the return type and call the add_param_types()
method
to record the parameter types when they become available in Phase
2.
You will not have to directly construct an object of Function
or
its derived classes. The distribution code will do so when a function
is added to the global environment, as described below.
The make_conversion()
, add_conversion()
, and
add_builtin_functions()
functions construct the built-in uC
functions and add them to the global environment. You will not have to
call these functions, as the distribution code does so.
Compiler Overview¶
The following is a high-level overview of how the compiler works:
The parser reads a uC source file and produces an AST representation of the code. The root of the resulting AST is a
ProgramNode
, corresponding to the start variable of the grammar.For each phase of the compiler, the top-level function in
ucc.py
sets up the context for that phase and then calls theASTNode
method for the phase on the root of the AST.By default, the
ASTNode
method for a phase just recursively calls the method on the node’s children. This method should be overridden where necessary. Due to dynamic binding, this results in customized computation when the recursion reaches a node with an override. The overriding method can recurse on the node’s children with a super call, such as:def resolve_types(self, ctx): ... # customized computation super().resolve_types(ctx) # recurse on children
The recursion terminates when it reaches terminals in the AST, which are not
ASTNode
s themselves.
In some cases, semantic analysis only requires information that is
available directly in a node’s children. For instance, the type of a
binary expression such as 1 + 3.1
can be determined from the types
of the operands, which are the lhs
and rhs
children of the
binary operation. In other cases, however, the information is not
directly available at a node’s location. For example, the expression
in a return statement must be checked against the return type of the
function, the latter of which is not available directly to the return
statement. Instead, such information is stored in the context that is
the argument of an AST method. The contextual information is filled in
where it is available, and then the context is passed down in
recursive calls to the AST nodes that need that information. In the
case of the return type, ctx['rettype']
should be set in Phase
6 where the return type is known, so that the return statement can
read it and check its expression against the expected type. You can
see what contextual information is set for each phase in the top-level
functions in ucc.py
. The minimum information is the phase number
and global environment.
As described in Driver Files, we recommend cloning a context before modifying it, so that the modification does not have to be undone when it is no longer relevant:
def basic_control(self, ctx):
new_ctx = ctx.clone()
new_ctx['in_loop'] = True # modify clone rather than original
super().basic_control(new_ctx) # pass clone
# original still has old value of in_loop
Compiler Core¶
The core of the compiler is defined in ucbase.py
. This includes
utility functions for operating on AST nodes, classes that represent
environments, the base class for AST nodes, and nodes used in
declarations.
Node Definitions¶
AST nodes are defined as Python 3.7 data classes ; we strongly recommend reading the documentation on data classes before proceeding with the project.
An AST node has the following data members:
a unique ID (
node_id
), whose value is automatically generated when a node is createda position, which is the first argument that must be passed to its constructor
zero or more children, which are AST nodes or terminals that must also be passed to the constructor (a terminal is a token such as a name, integer, or string)
zero or more attributes, which are not passed to the constructor; instead, they are initially set to
None
, but their values may be recomputed during semantic analysis
The base ASTNode
class defines the node_id
and position
members.
A node definition introduces children by defining class-level variables with a designated type but no initialization:
name: NameNode
vardecls: List[VarDeclNode]
An attribute is specified by defining a class-level variable with an
Optional
type and initializing it with a call to the
attribute()
function:
type: Optional[uctypes.Type] = attribute()
A derived AST class inherits the children and attributes of its base class. AST classes that serve as “base nodes” do not define any children.
As an example, the following is the definition of StructDeclNode
:
@dataclass
class StructDeclNode(DeclNode):
"""An AST node representing a type declaration.
name is the name of the type and vardecls is a list of field
declarations. type is the instance of uctypes.Type that is
associated with this declaration.
"""
name: NameNode
vardecls: List[VarDeclNode]
type: Optional[uctypes.Type] = attribute()
...
The base class of StructDeclNode
is DeclNode
, which doesn’t
have any children or attributes itself. Thus, the only children of a
StructDeclNode
are those that it introduces, specifically name
and vardecls
. These are filled in by the parser when the
StructDeclNode
is created, and your code should not modify their
values. The lone attribute of a StructDeclNode
is type
, since
it is the only variable that is Optional
and defaulted. It is
initially None
, but its value will be recomputed by your code
during semantic analysis. Data members can be accessed as instance
fields: if node
is an instance of StructDeclNode
, then
node.name
, node.vardecls
, and node.type
refer to the
children and attributes, and node.position
refers to its position.
AST Functions¶
The attribute()
function is a shorthand for constructing a
data-class field
that is defaulted to None
and not passed to the constructor.
The ast_map()
function is used to map a function across the
children of an AST node. A child may be a list, in which case the
given function is mapped on the elements of the list. The optional
terminal_fn
argument, if given, is called on terminals that are
not AST nodes. Refer to the methods of ASTNode
for how
ast_map()
is used. You may not need to call ast_map()
directly, since you can use super()
to call the base class’s
implementation of a method.
Environments¶
The GlobalEnv
class represents the global environment of a uC
program. Only functions and types are contained in the global
environment. Creating a GlobalEnv
object automatically adds the
built-in types and functions into the environment. You do not have to
construct a GlobalEnv
, since the provided driver does so.
The add_type()
and add_function()
methods take in a name and a
definition node. They check whether or not an existing entity of the
same category has the same name and produce an error if that is the
case. Otherwise, they create an appropriate Type
or Function
object and add it to the global environment. They return the resulting
Type
or Function
object.
The lookup_type()
and lookup_function()
methods look up a type
or function name in the global environment. If the strict
argument
is true and the name is not found, an error is reported, and a default
type or function is returned. Otherwise, the type or function
associated with the given name is returned. If the strict
argument
is false, no error is reported if the name is not found. You will only
need to call these methods with the default strict
as true. (The
distribution calls them with strict
as false when checking for a
valid main()
function.)
The VarEnv
class represents a local scope, containing either the
fields of a user-defined type or the parameters and variables of a
user-defined function. The add_variable()
method takes in a field,
parameter, or variable name, its associated type, and a string
representing the kind of entity being added. If another entity of the
same name exists within the scope, an error is reported. Otherwise,
the given name is mapped to the given type in the local scope.
The contains()
method returns whether or not a name exists in the
local scope. The get_type()
method looks up a name and returns the
associated type. If the name is not in scope, an error is reported,
and a default type is returned.
You will need to create a VarEnv
object for each user-defined type
and function.
Base and Start Nodes¶
The ASTNode
class is the base class for all AST nodes. It defines
default implementations for each phase method that recursively call
the method on each of the node’s children.
The ProgramNode
class represents the start node, and it is the
root of a program’s AST. Its only child is a list of type and function
declarations. You will not have to modify this class.
Declarations¶
The DeclNode
class is the base class for type and function
declarations. Like other classes that represent base nodes, it does
not define any children.
The StructDeclNode
class represents the declaration of a
user-defined type. Its children are the name of the type, represented
as a NameNode
, and a list of field declarations, each represented
as a VarDeclNode
. The type
attribute is the instance of
Type
associated with this declaration, which you should set in
Phase 1. You will need to fill in the definition of
StructDeclNode
.
The FunctionDeclNode
class represents the declaration of a
user-defined function. Its children are the return type, represented
as a TypeNameNode
or ArrayTypeNameNode
, the name of the
function represented as a NameNode
, a list of parameters each
represented as a ParameterNode
, a list of local-variable
declarations each represented as a VarDeclNode
, and a body
represented as a BlockNode
. The func
attribute is the instance
of Function
associated with this declaration, which you should set
in Phase 1. You will need to fill in the definition of
FunctionDeclNode
.
A VarDeclNode
represents a variable or field declaration. It has
children representing the type of the variable as a TypeNameNode
or ArrayTypeNameNode
and the name of the variable as a
NameNode
.
A TypeNameNode
represents a simple type. It has a name,
represented as a NameNode
, and a type
attribute that must be
computed in Phase 2.
An ArrayTypeNameNode
represents an array type. It has an element
type, which is a TypeNameNode
or ArrayTypeNameNode
, and a
type
attribute that must be computed in Phase 2.
A NameNode
represents a name in a uC program. It has a raw
child, which is a string containing the name.
A ParameterNode
has the same structure as a VarDeclNode
. It is
included as a separate class for the purposes of Phase 7.
Printing Functions¶
The child_str()
function is used in constructing a string
representation of an AST. It is called by the __str__()
method of
an AST node. The resulting string representation, though it contains
the full AST structure, does not present it in a very readable format,
and it does not contain the attributes of an AST node. We recommend
you use the write_types()
method of an AST node to examine its
representation instead, since it produces a more readable format and
includes the type
attribute for AST nodes that have the attribute.
Alternatively, you can call the graph_gen()
function on a node,
which will print a graph representation of the node and its children,
in a format compatible with Graphviz, to
the given output file. If you have Graphviz installed, you can then
generate a PDF as follows:
$ dot -Tpdf -o <output file> <input file>
Alternatively, you can use Webgraphviz to generate an image from the output
of graph_gen()
.
Running ucparser.py
directly on a source file will call
graph_gen()
on the resulting AST and write the result to an output
file with extension .dot
. The AST representation shown above is
generated as follows:
$ python3 ucparser.py hello.uc
$ dot -Tpdf -o hello.pdf hello.dot
The ucc.py
driver will also generate a .dot
file, including
type information for nodes that have been assigned a type, given the
-G
option:
$ python3 ucc.py -S -G hello.uc
$ dot -Tpdf -o hello.pdf hello.dot
The .dot
file will only be generated if no errors are encountered.
You can force it to be generated by disabling errors with the -NE
option:
$ python3 ucc.py -S -NE -G hello.uc
$ dot -Tpdf -o hello.pdf hello.dot
You can also cut off semantic analysis after a given phase to generate a graph after that phase:
$ python3 ucc.py -S --frontend-phase=2 -G hello.uc
$ dot -Tpdf -o hello.pdf hello.dot
We strongly recommend you make use of graph generation for debugging and testing.
Statements¶
The file ucstmt.py
contains class definitions for AST nodes that
represent statements, as well as the BlockNode
class that
represents a sequence of statements. Read through the distribution
code for details about each class, and refer to the graph
representation of a source file’s AST if you are unsure about the
structure of a node.
Expressions¶
The file ucexpr.py
contains class definitions for AST nodes that
represent expressions. The base class for all expressions is
ExpressionNode
, which has a type
attribute that is computed in
Phase 6. Like all attributes, this is inherited by derived classes.
The LiteralNode
class is the base class for all literal
expressions. It has a text
child that consists of the literal
text as used in code generation, which will not be used in this
project.
The NameExpressionNode
class represents a name used as an
expression. Its child is a NameNode
representing the actual name.
A CallNode
represents a function call. Its children are the name
of the function, represented as a NameNode
, and a list of argument
expressions. The func
attribute refers to the instance of
Function
corresponding to the call, which you will compute in
Phase 3.
A NewNode
represents an allocation of an object of user-defined
type. Its children are the name of the type, represented as a
NameNode
, and a list of argument expressions. You should compute
the type of a NewNode
expression in Phase 2.
A NewArrayNode
represents an allocation of an array. Its children
are the element type, represented as a TypeNameNode
or
ArrayTypeNameNode
, and a list of argument expressions. You should
compute the type of a NewArrayNode
expression in Phase 2.
A FieldAccessNode
represents a field access, and its children
include an expression for the receiver object and the name of the
field, represented as a NameNode
.
An ArrayIndexNode
represents an array-indexing operation, and its
children include an expression for the receiver object and an index
expression.
The remaining nodes in the file represent unary or binary operations. Pay close attention to the inheritance structure of the classes. You should attempt to write as much code as possible in base classes to avoid repetition in derived classes. Many of the derived classes should not require any code at all to be added to them.
A unary prefix operation consists of the expression on which the operation is to be applied and the name of the operation. A binary infix operation consists of the left-hand side expression, the right-hand side expression, and the name of the operation.
The is_lvalue()
method of ExpressionNode
determines whether or
not the node produces an l-value. You will need to override this
method where appropriate and make use of it where an l-value is
required. It is defined as a static method in ExpressionNode
,
since it doesn’t depend on the receiver object, but you may override
it as either a static or instance method.
Testing Framework¶
A very basic testing framework is implemented in ucccheck.py
and
the Makefile
. The former takes in the name of a uC source
file, such as foo.uc
. It expects the result of running ucc.py
on the source to be located in a file of the same name, but with
.uc
replaced by .out
, as in foo.out
. If compiling the
source does not result in an error, it expects the type output to be
in the file generated by ucc.py
, e.g. foo.types
. Finally, the
correct outputs should be in files that end with .correct
, as in
foo.out.correct
and foo.types.correct
.
Running make
will invoke ucc.py
on each of the source files in
the tests
directory. It will then execute ucccheck.py
to check
the output against the correct files. Thus, you can use the
Makefile
rather than running ucccheck.py
directly. You may
need to change the value of the PYTHON
variable in the
Makefile
if the Python executable is not in the path or not called
python3
.
You can also run an individual test through the Makefile
by giving
make
a target that replaces the .uc
extension with .test
:
$ make tests/type_clash_user.test
Test Cases¶
The following basic test cases are provided in the tests
directory, with compiler output. If a test has no errors, then the
result of running the compiler with the -T
command-line argument
is also provided.
Correct Test |
Description |
---|---|
|
Test calling default constructors. |
|
Test equality and inequality comparisons on primitive, user-defined, and array types. |
|
Simple “hello world” example. |
|
Test accessing length field of arrays and structs. |
|
Test literals of primitive type. |
|
Complex example of a uC program. |
|
Test result of resolving types. |
|
Test using types or functions before their declaration, which should be valid. |
Error Test |
Description |
---|---|
|
Test incorrect function and allocation arguments |
|
Test incorrect array indexing |
|
Test using r-values in l-value contexts. |
|
Test the |
|
Test |
|
Test a user-defined function with the same name as a primitive |
|
Test defining multiple types with the same name. |
|
Test calling an unknown function. |
|
Test using unknown types. |
|
Test repeated variable or parameter names. |
These are only a limited set of test cases. You will need to write extensive test cases of your own to ensure your compiler is correct.
Inheritance and Polymorphism¶
There are a total of 63 node types that define a uC AST, arranged
in a five-level class hierarchy rooted at ASTNode
. You should take
advantage of inheritance and polymorphism in order to customize the
behavior of derived classes while avoiding repetition. In particular,
you should place code that is common to sibling classes in their
parent class, rather than repeating it. Also make use of super()
calls when you need to add behavior to a derived class that is in
addition to what the base class provides. With proper use of
inheritance and polymorphism, you will find that you will need far
fewer than the 378 (or 441 if you do Phase 7) methods you would if
you repeat code for every phase in every class.
Phase 1: Finding Declarations¶
The first phase of semantic analysis is collecting type and function
declarations, since types and functions can be used before their
definition. For each type or function declaration, the type or
function should be added to the global environment, referenced by
ctx.global_env
, using the appropriate insertion method. The latter
will perform error checking for you to make sure a type or function is
not redefined. Make sure to set the type
and func
attributes
of a StructDeclNode
and FunctionDeclNode
, respectively.
When you are done with this phase, your compiler should pass the following tests:
$ make tests/type_clash_user.test tests/function_clash_primitive.test
You should also write test cases of your own, as the provided ones are not comprehensive.
Phase 2: Resolving Types¶
The second phase is to match occurrences of type names to the actual
Type
object named. In particular, you should compute the types of
all TypeNameNode
s and ArrayTypeNameNode
s, as well as
allocation nodes. Use ctx.global_env
to look up a type name, and
store the result in the type
attribute of the AST node.
You will need to make use of recursion to resolve the types of
ArrayTypeNameNode
s and array-allocation nodes. For example, in
order to compute the type of an ArrayTypeNameNode
, its
elem_type
child must have its type computed first. You can then
use the array_type
property method of the resulting type to obtain
the type of the ArrayTypeNameNode
.
You will also need to use recursion to resolve types of subexpressions, such as in the following example:
new int[] { new int { 3 } }
Make use of super()
to avoid repeating code.
You should also check in this phase that the void
type is only
used as a type name in the return type of a function. Set and use the
contextual value ctx['is_return']
to help check this.
Also make sure to record the return type and parameter types of a
function in its associated Function
object in this phase. Set
the rettype
attribute of a Function
directly, and use the
add_param_types()
method to set the parameter types.
When you are done with this phase, your compiler should pass the following tests:
$ make tests/phase2.test tests/unknown_type.test
As always, you should write comprehensive tests of your own.
Phase 3: Resolving Function Calls¶
The third phase is to match function calls with the actual
Function
object that is being called. Use ctx.global_env
to
look up a function name, and store the result in the func
attribute of the AST node.
When you are done with this phase, your compiler should pass the following tests:
$ make tests/unknown_function.test
Phase 4: Checking Fields and Variables¶
In the fourth phase, you should construct VarEnv
objects
representing the local scope of a user-defined type or function. This
will ensure that field or parameter and variable names are distinct
within a type or function, as well as record the type of each such
entity. Store the VarEnv
object for each type or function in an
appropriate location, and then install the field or parameter and
variable types in the VarEnv
.
When you are done with this phase, you should pass the
variable_clash.uc
test.
Phase 5: Checking Basic Control Flow¶
The fifth phase is to check that break
and continue
statements
occur within a loop. Use contextual information to determine whether
or not this is the case.
When you are done with this phase, you should pass the
break_not_in_loop.uc
test.
Phase 6: Computing and Checking Types¶
This is the most substantial phase, as it involves computing the types
of every expression and checking that a type is used in an appropriate
context. Upon computing the type of an expression, store the result in
the type
attribute of the AST node for that expression.
Start by computing the type of literals. Refer to the uC specification for what the type of each literal is. You can use the global environment to look up a type by name.
Next, compute the type of a function call. Since uC does not have
overloading, you do not need to check that the arguments are valid to
compute the type of the call itself. You should pass the hello.uc
test case once you can compute the types of literals and calls.
Continue by implementing type checking on addition operations. You
will find the join_types()
function useful for computing the
resulting type. Refer to the uC specification for the requirements
on the operands of the +
operator. When you are done, you should
pass the literals.uc
test.
The remaining given test cases will require you to implement type computation and checking on substantially more expression and statement types. You should write your own test cases to allow you to implement and test incrementally.
You will need to set the contextual local environment and return type as appropriate in each function, so that you are able to look up variable names and check that a return statement is valid.
For statements, you should check that their child expressions have the appropriate type according to the semantics of the statement and the current context.
For each expression, you will need to compute its type according to its child expressions and terminals. Refer to the uC specification for what the type of an expression should be. Make use of the utility functions we provide you in
uctypes.py
.For simple allocations, ensure that the given type is not a primitive type (disallowing expressions such as
new int()
).For function calls and allocation expressions, you will need to check that the number of arguments and the argument types are compatible with the required types. Implement and make use of the
check_args()
methods onType
andFunction
objects. You will find the providedis_compatible()
function useful here.For field accesses, implement and make use of the
lookup_field()
methods onType
objects. You should return theint
type if the field is not defined and report an error.For array-indexing operations, you should assume that the result type is
int
if the receiver is not an array and report an error.In general, you should use the
int
type in the presence of an error if the type of an expression cannot be unambiguously determined. Use your best judgment on whether or not this is the case; it is not something we will test, except the specific cases mentioned above.Make sure to check for an l-value in contexts that require one. Implement and make use of the
is_lvalue()
method onExpressionNode
and its derived classes.
Refer to the uC specification for the type rules that must be met for each expression, which you are responsible for checking. Report an error if a rule is violated.
When you are done with this phase, you should pass all of the provided test cases.
Phase 7 (Optional): Advanced Control Flow¶
In this phase, check control flow to make sure that a local variable
is always initialized before use, and that a non-void function always
returns a value. You do not have to treat a while (true) { ... }
(or equivalent for loop) differently from any other loop.
In order to run this analysis phase, you will need to use the
--frontend-phase
command-line argument, which determines at which
phase to end semantic analysis. The default is Phase 6, so the
following overrides the default:
$ python3 ucc.py -S --frontend-phase=7 <source file>
This phase is optional and will not be graded. We will run your code with this phase turned off, with the default setting of ending at Phase 6.
Testing and Evaluation¶
We have provided a small number of test cases in the distribution code. However, the given test cases only cover a small number of possible errors, so you should write extensive tests of your own.
We will evaluate your compiler based on whether or not it reports an error on an erroneous input source file in the expected analysis phase on the expected source line numbers. You do not have to produce exactly the same number of errors as our solution, so long as the set of errors you report consists of the same phases and line numbers as ours. You also do not have to generate the exact error messages we do, but the error messages you use should be meaningful.
A compiler should never crash, even on erroneous input. Instead, your compiler should gracefully detect and report errors, and then avoid any actions that could result in a crash.
The code you write only needs to detect semantic errors, since the provided lexer and parser already detect grammatical errors. We will not test your code with input that is lexically or syntactically incorrect.
In addition to testing your error detection, we will also test your
compiler to make sure it computes types correctly. We will compare the
printed types of each AST node for a valid input source file with the
expected types using the -T
command-line argument to ucc.py
.
Grading¶
The approximate grade breakdown for this project is as follows:
20% checkpoint
70% final deadline autograded
10% final deadline hand graded
Hand grading will evaluate the comprehensiveness of your test cases as well as your programming practices, such as avoiding unnecessary repetition and making appropriate use of inheritance and polymorphism. In order to be eligible for hand grading, your solution must achieve at least half the points on the autograded, final-deadline portion of the project.
You are required to adhere to the coding practices in the course style guide. We will use the automated tools listed there to evaluate your code. You can run the style checks yourself as described in the guide.
Submission¶
All code that you write for the interpreter must be placed in
ucbase.py
, ucexpr.py
, ucfunctions.py
, ucstmt.py
, or
uctypes.py
. We will test all five files together, so you
are free to change interfaces that are internal to these files. You
may not change any part of the interface that is used by ucc.py
or ucfrontend.py
.
Submit ucbase.py
, ucexpr.py
, ucfunctions.py
,
ucstmt.py
, uctypes.py
, and any of your own test files to the
autograder before the deadline. We suggest including a README.txt
describing how to run your test cases.