Assignment 1, CSC431, Winter 2010
1 Guidelines
You and your partner must choose one of the supported class languages. Currently, that includes PLT Scheme, F#, and Scala. Let me know if you can get together a critical mass (about ten students, including one maintainer) to support another language.
2 Assignment
You must implement an interpreter for the language that you’ll later be writing a compiler for. Let’s call this language Footle. I don’t know why I picked that name. Your solution will map Footle programs to answers, as described below.
3 NO PARSING
Your assignment does not need to include a parser.
More specifically, your assignment must map programs to answers, but you must decide on (and document) your representation for the input programs (that is, your Abstract Syntax Trees, or ASTs). If you design your representation carefully, it will also make a good target for your parser.
4 How To Develop This Program
If this process sounds strange to you, follow these steps.
First, examine the specification of the inputs and outputs (that is, the Footle expressions and values) below.
Second, decide how you’re going to represent each one. The representation you choose should make it easy to access the sub-elements of each value; those would likely be classes in Scala, datatypes in F#, structs or define-types in PLT Scheme, etc.
In a statically typed language (e.g. F# or Scala or typed scheme) you’ll also need to explicitly formulate a representation for the union of all these forms. That is: what type do you give to an arbitrary value? Again, nearly every statically typed language has a means of doing this. Ocaml (and, by extension, presumably F#) has datatypes, Scala has "case class" definitions, etc.
In a dynamically typed language (e.g. PLT Scheme), you don’t need to tell the compiler about your datatypes, but you do need to tell those reading your code. You must therefore provide a comment in the form of a data definition that conveys the same information.
Third, figure out what the inputs and outputs of the interpreter are going to be. For instance, you’ll probably want to add an environment as an additional input. In my interpreter, I added another argument to hold the ’return’ jump target, but that may depend on how you implement ’return’ (more below).
Fourth, write some test cases; when the interpreter is called with this, it should produce that, etc. I cannot stress enough the importance of specifying these test cases before writing the program, and of starting with very simple test cases. It’s the simplest test cases that are going to show you how to implement the corresponding language forms; the more complex test cases help to isolate more subtle issues. If you start with a giant hairy test case, you’re going to spend hours debugging simple problems.
Finally, if you do discover a problem with a complex test case, don’t just dive into the debugger; debuggers are the fool’s first resort, and you can spend hours just watching the values bounce around. Instead:
Simplify the test case until you find the smallest test case that reproduces the problem.
Formulate a hypothesis about how the problem occurs. (Use your brain! Think!)
Use a printf or a debugger to test your hypothesis.
Repeat until finished, fired, retired, or dead.
The "thinking" part is generally the hardest.
5 Values
The values are the possible results of interpreting an expression. They include the following:
booleans
integers
floating-point numbers
void
objects : there are three kinds of objects: stringvals, closurevals, and plain objects. All three of them contain a list of slots, binding names to mutable values. Stringvals and closurevals also each contain an additional value: in the case of a stringval it’s a string, in the case of a closureval it’s a closure.
primitive functions : these represent built-in functions, listed below.
A closure contains a list of parameters, a body expression, and an environment. If it’s not clear to you what a closure is or why it should contain an environment, ask someone who’s taken CSC 430 recently.
6 Language Forms
In the Footle interpreter, there is no distinction between statements and expressions, or between programs and expressions. Everything is just an expression. This simplifies the interpreter.
The words ’interpret’ and ’evaluate’ are used interchangeably in this text.
The words ’field’ and ’slot’ are used interchangeably in this text.
literals: the set of literals should include strings, integers, floating point numbers, and booleans. In the case of strings, each literal expression should create a fresh string (with its own list of fields)
variable references: your language must include variable references. Interpreting a variable reference produces the value currently associated with that reference’s binding. The languge is lexically scoped, meaning that the binding for any reference is the nearest lexically enclosing declaration with that name.
if: this expression contains three expressions: a ’test’ clause, a ’then’ clause, and an ’else’ clause. To interpret it, first interpret the ’test’ clause. If this result is not a boolean, signal an error. Otherwise, if it’s true, produce the interpretation of the ’then’ clause. If it’s false, produce the interpretation of the ’else’ clause.
application: This form represents a call to a function. It contains a ’function’ expression indicating the function to be called, and zero or more ’argument’ expressions. To interpret it: first interpret the ’function’ expression. If it does not produce a primitive or a closure, signal an error. If it produces a primitive, apply the primitive to the result of interpreting the arguments. If it produces a closure, call the closure by evaluating each of the arguments, and then interpreting the body of the closure using the environment contained in the closure, extended with bindings from the closure’s parameters to the interpreted arguments.
sequence: This form represents a sequence of expressions (or ’statements’, if you prefer). To interpret it, evaluate each expression in the sequence, and produce the value of the last one. An empty sequence produces the ’void’ value.
variable binding: this form contains a variable name, a value-expression for the new variable, and a body for which the variable is bound. To interpret it: evaluate the value-expression, and then produce the result of interpreting the body in the current environment extended with a binding from the given name to the value-expression’s value.
Note that this means that the variable binding is only in scope for the single expression in its body. "But wait," I hear you say, "how will that work with code like this?":
var x = 14;
f(x,4);
g(x+14);
The answer is that your parser (when you write it) will parse this as a variable binding for ’x’ with a body that contains a single sequence expression. This sequence expression will contain the two function applications. So the formulation I’m proposing is rich enough to express all of the programs that we want to write.
What about multiple variable declarations? No problem; these can be parsed as nested single bindings. Once again, I’m not asking you to write a parser yet; I’m just answering questions about what will happen when we do come to write the parser.
function bindings: A set of function bindings contains a set of function declarations and a body expression. A function declaration contains the function’s name, a list of parameter names, and a single expression.
Note that unlike a variable binding, the function binding form has to handle multiple functions at once. This is necessary, in order to allow mutual recursion. If we handled multiple function declarations as nested bindings, then the "forward" references from the earlier function to the later one would not be bound correctly.
In order to allow recursion, then, the environment that goes into each of the closures defined by this form must itself contain bindings for all of the new functions. Here’s how to do this:
First, extend the environment with new bindings for the new function names; initially, each name will be bound to some bogus value. Then, using this environment, build a closure for each of the functions. Finally, use these function values to replace the bogus values in the bindings. Note that this creates cycles in our environment: the environment contains bindings for closures that contain environments that contain bindings for that same function, etc. etc.
In case it’s not already clear, this is precisely the same thing that we did in 430.
return: A return contains an expression.
Evaluating return is really (if you’ll pardon me saying it) quite horrible. You will be startled at how it messes up your interpreter. If you’re not startled, then you probably didn’t implement it correctly.
The problem with return is that it needs to discard context. That is, it needs to escape from whatever computation is occurring, be it an if or a while or an arithmetic expression or whatever, back out to the entry point of this function call. Since the interpreter recurs on subexpressions, this means that you’re going to have to escape from an arbitrary number of recursive calls in the interpreter when you encounter a return.
There are two main ways to implement this. The first one, that I recommend, is to use some kind of non-local escape mechanism that your implementing language has. For instance, exceptions work well for this.
The second one, that I don’t recommend, is to stop using the language’s stack and roll your own. In other words, rather than making recursive calls to the interpreter to evaluate subexpressions, you’ll explicitly push something onto your own stack and then just loop back to the beginning of your interpret procedure. Since the stack is explicitly represented, you can discard a whole bunch of frames (out to the function call boundary) when you encounter a ’return’. The down side of this design is that you have to manage your own stack explicitly, and this brings a whole new set of things that can go wrong.
Also, be careful of ’return’ within ’return’; it should be an error to use a ’return’ expression within a ’return’ expression.
It’s also an error to use ’return’ outside of a function body.
Finally, function bodies whose evaluation finishes without ever evaluating a ’return’ should simply produce the ’void’ value.
variable mutation: A variable mutation contains a variable name, and a new value expression. To interpret it, first find the binding corresponding to the variable name in the environment. If it’s missing, signal an error. Next, evaluate the new-value expression, and put the resulting value in place of the old binding’s value. This new value should also be the result of the interpretation.
while loop: A while loop contains a test expression and a body expression. To interpret it: first interpret the test expression. If the result is not a boolean, signal an error. Otherwise, if the result is false, the result is ’void’. If the result is ’true’, interpret the body expression, then evaluate the whole ’while’ expression again.
field lookup: A field lookup expression contains an ’object’ expression and a name. To interpret it, first evaluate the ’object’ expression. If this value is not an object, signal an error. Otherwise, look at the slots of the expression, for the named field. If the field is absent, signal an error. Otherwise, the result is the value associated with the named field.
field mutation: A field mutation expression contains an ’object’ expression, a name, and a new-value expression. To interpret it, first evaluate the ’object’ expression. If this value is not an object, signal an error. Otherwise, look at the slots of the expression, for the named field. If the field is absent, add a new slot with the given name . Finally, mutate the binding to contain the result of evaluating the new-value expression. The result is the result of the new-value expression.
method call: A method call expression contains an ’object’ expression, a name, and zero or more argument expressions. To interpret it, first evaluate the ’object’ expression. If this value is not an object, signal an error. Otherwise, look at the slots of the expression, for the named field. If the field is absent, signal an error. If the value in the field is not a closure, signal an error. Otherwise, call the closure with the results of interpreting the arguments much as described in ’application’, above, except that there is one additional binding: the identifier ’this’ is bound to the result of evaluating the ’object’ expression.
new: A new expression contains a ’function’ expression, and zero or more argument expressions. To interpret it, first interpret the ’function’ expression. If the result is not a closure, signal an error. Otherwise, create a new object with one slot, binding the name ’constructor’ to the closure. Then, call the closure as described in ’application’, above, except that there is one additional binding: the identifier ’this’ is bound to the newly created object.
That’s it! Well, unless we decide to add more things in later assignments.
6.1 A note on mutation of environments
Mutating environments: don’t do it. Your interpreter will wind up buggy. More specifically, make sure that when you extend an environment with new bindings, that references to the old environment are still unchanged, containing all the same bindings. The easiest way to guarantee this is never to mutate a variable that refers to the environment. (Note that mutating the bindings within an environment is not only okay, but in fact required in order to implement variable mutation. I’m just talking about mutating the environment itself.)
6.2 Where is mutation required?
Mutation is required in several places: the value associated with each binding must be mutable, the value associated with each slot must be mutable, and the set of slots associated with an object must be mutable (so you can add fields on the fly).
7 Primitive Functions
You’re required to implement a number of primitive functions. These primitive functions should appear in a base environment used by all calls to the interpreter.
+, -, *, /, <, >, <=, >= : binary numeric operations. Calling these with integer values should perform the corresponding "integer" operation, calling them with floating-point argument should perform the corresponding floating-point operation, and calling them with one of each should promote the integer to a floating-point number. Calling these with non-numeric arguments should signal an error.
and,or,not : boolean operators. Calling these with non-boolean arguments should signal an error. They need not be "short-circuiting"; that is, your evaluator should always evaluate both arguments to ’and’ or ’or’, even when the first one determines the result.
string-length : the string length operator. Calling it with a string produces an integer corresponding to the length of the string. Signals an error otherwise.
substring : consumes a string, a start position ’s’, and an end position ’e’. This operator returns the substring starting at position ’s’ and ending with position ’e-1’. If the first argument is not a string, or the other two are not integers, signals an error.
string-append : consumes two strings and returns the new string containing the characters of the first followed by the characters of the second.
string=? : consumes two strings, and returns true if they contain the same sequence of characters.
string<? : consumes two strings, and returns true if the first strictly precedes the second, using a dictionary ordering. Signals an error if either argument is not a string.
== : the comparison operator. This operator should return true for two integers that are equal, two floating point numbers that are equal, two booleans that are equal, two voids, or two objects that are "pointer-equal"; that is, objects ’a’ and ’b’ are pointer-equal if mutating a field in ’a’ also mutates a field in ’b’. For all other pairs of values, this operator should return false. It should never signal an error.
instanceof : this operator consumes an object and a value, and returns true when the object has a ’constructor’ slot and that slot’s value is the second argument.
integer? : this operator returns true when called with an integer, and false otherwise.
boolean? : this operator returns true when called with a boolean, and false otherwise.
floating-point? : this operator returns true when called with a floating-point number, false otherwise.
void? : this operator returns true when called with the void value, false otherwise.
string? : this operator returns true when called with a stringval, false otherwise.
closure? : this operator returns true when called with a closureval, false otherwise.
plain? : this operator returns true when called with a plain object, false otherwise.
print : this operator consumes any value and prints it out. I’m not going to nail this format down precisely at this point, but if I were you, I’d come up with an output format such that no two different values have the same printed representation.
read-line : this operator should read a single newline-terminated line from the standard input, and returns a string containing these characters, including the final newline.
8 Incomplete Specification
I’m sure there are ambiguities, missing primitives, and probably outright errors in this specification. Post to the newsgroup when you discover them.
9 Hints
For Heaven’s sake, develop this project incrementally. For instance, you can develop an interpreter that only handles constants. Then, you can add primitives, and the application of primitives. Don’t try to swallow the elephant in one bite.
10 An Example
It’s hard to provide an example, because the inputs to your function are going to be chosen from your own representations, and therefore specific to your team. However, I can show you an example of the representation that I’m using, and what it’s supposed to produce. With luck, you can read and understand this example.
| (check-equal? |
| (test-interp '(letrec ([even? (x) (return (if (zero? x) |
| #t |
| (odd? (- x 1))))] |
| [odd? (x) (return (if (zero? x) |
| #f |
| (even? (- x 1))))]) |
| (and (odd? 13) |
| (and (not (even? 13)) |
| (even? 14))))) |
| true) |
11 Requirements
Your solution should consist of several things: a README file, a TEAM file, a go.sh file, and source and test case files.
in the README file, a few lines of description on how you solved the project, including what language or languages you used.
also in the README file, a specification of the way that you represent the three forms of data required by the interpreter: the values, the expressions, and the environments. If you’re using a statically typed language, it should be enough just to point to the type that represents each one of them (That is: what type represents expressions? what type represents values? What type represents environments?) If you’re using Scheme in the PLAI language, you can specify these using define-type’s. If you’re using another dynamically typed language, you’ll probably have to specify these explicitly.
A TEAM file with precisely that name, containing one line for each team member with that member’s login id, a comma, and then his or her full name. So, if Benjamin Franklin were working with George Washington, this file might contain:
bfrankli, Benjamin Franklin
gwashing, George Washington
The order of the lines is not important. You must supply this file even if you’re working alone.
a shell script called go.sh that compiles the code (if necessary) and runs all test case, printing the results to stdout. It should be clear from the output which tests failed, if any.
source files containing the definition of the interpreter.
source files containing the test cases.
12 Handin Instructions
You’ll hand in your code simply by creating a branch called Assignment1 in your subdirectory of the private part of the class repository. Whatever’s there when the deadline occurs is your submission.