Skip to the content.

This assignment is the first stage of the Amy compiler.

You can download the zip file of the project skeleton here.

Code Skeleton

In this lab you will start your own compiler from scratch, meaning that you will no longer rely on the compiler frontend which was previously provided to you as a jar file. In this lab you will build the lexical analysis phase (lexer).

As we are now starting to work on your own compiler, we will start with a fresh scaffold. We suggest you to keep the interpreter alongside, and we will tell you at what point you can add back your interpreter in your project.

For now, you should work in this new project that will become your full compiler. Following labs will be delivered as files to add to this project.

The structure of your project src directory should be as follows:

lib 
 ├── scallion-assembly-0.6.1.jar
 ├── stainless-library_3-0.9.9.2-6-g78f42ac-SNAPSHOT-sources.jar
 └── ziplex_3-0.1.0-SNAPSHOT.jar
 

library
 ├── ...
 └── ...

examples
 ├── ...
 └── ...

amyc
 ├── Main.scala                         
 │
 ├── parsing                             
 │    ├── Lexer.scala
 │    └── Tokens.scala
 │
 └── utils                               
      ├── AmycFatalError.scala
      ├── Context.scala
      ├── Document.scala
      ├── Env.scala
      ├── Pipeline.scala
      ├── Position.scala
      ├── Reporter.scala
      ├── UniqueCounter.scala
      └── ZiplexUtils.scala

test
├── scala
│    └── amyc
│         └── test
│               ├── CompilerTest.scala
│               ├── LexerTests.scala
│               ├── TestSuite.scala
│               └── TestUtils.scala
└── resources
      └── lexer
           └── ...

This lab will focus on the following two files:

Below you will find the instructions for the second lab assignment in which you will get to know and implement a lexer for the Amy language.

A Lexer for Amy

The role of a lexer is to read the input text as a string and convert it to a list of tokens. Tokens are the smallest useful units in a source file: a name referring to a variable, a bracket, a keyword etc. The role of the lexer is to group together those useful units (e.g. return the keyword else as a unit, as opposed to individual characters e, l, s, e) and to abstract away all useless information (i.e. whitespace, comments).

Code structure

You can find the lexer in the Lexer.scala file. It is based on ZipLex, a formally verified lexer Scala library. Ziplex allows you to transform an input character stream (such as the contents of an Amy source file) into a sequence of Tokens. We include a short reference of Ziplex’s API in Lexer.scala.

To build a lexer using Ziplex, you first define rules based on regular expressions. Each rule is associated with a transformation function that converts the matched input characters into a TokenValue. Ziplex then offers a lex function that takes as input the list of your rules along with the input string and produces a sequence of tokens.

Ziplex follows what is known as the longest match semantics (or maximum munch): at each step, it finds the rule that matches the longest prefix of the remaining input to create a token. If multiple rules match the same longest prefix, the rule that appears first in the list of rules is chosen.

The TokenValue classes and the transformation functions from matched strings to TokenValues are defined in ZiplexTokens in Lexer.scala. These are provided, but you might want to see how they work.

Ziplex’s lex function returns a sequence of ziplex.Tokens, whose class definition is the following:

case class Token[C](value: TokenValue, rule: Rule[C], size: BigInt)

Here, value is the TokenValue produced by the transformation function of the matched rule, rule is the rule that matched the input, and size is the length of the matched input.

The rest of Amy’s pipeline expects tokens of type amyc.parsing.Tokens.Token. Therefore, the final step of the Lexer is to convert the ziplex.Tokens into amyc.parsing.Tokens.Tokens. This is done in the toAmyToken function in Lexer.scala. Some cases are already implemented for you, but you will need to complete it.

Positions are handled for you by the run method of the Lexer. The ziplex.Tokens returned by Ziplex do not contain position information, but since they contain the size of the matched input, it is possible to reconstruct the position of each token in the input string. The run method does this and sets the position of each token before returning it. You can have a look at ZipLexUtils.nextPosition and ZipLexUtils.addPositions in ZiplexUtils.scala to see how this is done.

The Lexer has the following components:

Your task is to complete the rules in Lexer.scala and implement the filtering of irrelevant tokens.

Notes

Here are some details you should pay attention to:

Example

For reference, below is a possible output for the example under examples/Hello.amy, which you should get when entering the sbt shell and typing run examples/Hello.amy --printTokens. Two small things to note here:

You can always get reference output for the lexer from the reference compiler by typing:

java -jar amyc-assembly-1.7.jar --printTokens <files>`

Example output for examples/Hello.amy:

KeywordToken(object)(1:1)
IdentifierToken(Hello)(1:8)
IdentifierToken(Std)(2:3)
DelimiterToken(.)(2:6)
IdentifierToken(printString)(2:7)
DelimiterToken(()(2:18)
StringLitToken(Hello )(2:19)
OperatorToken(++)(2:28)
StringLitToken(world!)(2:31)
DelimiterToken())(2:39)
KeywordToken(end)(3:1)
IdentifierToken(Hello)(3:5)
EOFToken()(4:1)

Deliverable

You are given 1.5 weeks for this assignment. Deadline: Wed, March 25, 11pm.

You will need to submit your src/amyc/parsing/Lexer.scala file to Canvas – follow this link to the assignment.