University of Helsinki Department of Computer Science
 

Department of Computer Science

Department information

 

Laudatur course on Compilers
Exercise and laboratory assignment pages, Spring 2006

End-term project

End-term project, deadline May 8th at 2100 hours.
The end-term project is a larger work. Instructions for the project will be given at tenth exercise session, April 24th.

The project work is to design and implement a scanner generator for compilers written in Java. Upon a prior agreement, projects implemented in C++ can be accepted. The work can be done either individually or in groups of 2-3 students. The grading is based on scope and quality of work: larger scope gains more maximum points and better quality gives more actual points.

Documentation
The project work requires that technical documentation of about 5-10 pages is included. The structure of the document follows scientific format: Abstract, introduction, 2-3 chapters of discussion, conclusions and references. The document should give a good overview of the chosen problem, discuss and problematize it, and finally give a conclusion of some sort.

Grading
Each star in the difficulty rating of a task represents 10% of maximum points. So, a flawless work with combined difficulty rating of 10 will give the maximum points. When the work implements multiple parts, the difficulty rating is calculated by applying the plus operator to amount of stars in each part. (See also How to talk like a computer scientist).

When working in groups, a group size of three persons gives up to 80% of the points represented by difficulty rating. Individual work boosts the rating up to 120%.

Project part description Max points Status
Basic scanner implementation fs fs fs fs fs       Mandatory
Extension of input language fs hs ns ns ns Optional
Reserved words Optional
Nondeterministic and Deterministic FA's Optional
Visualization of execution Optional
Support for backreferences       Optional
Code generation Optional

Basic scanner implementation

Interpreter for basic regexps - support for minimal set of regular operators (repetition, catenation, alternation, and grouping).

The input language matches as follows:

Expression Matches Example
c any non-operator character c     a
\c character c literally \*
r* zero or more r's a*
r1r2 r1 then r2 ab
r1|r2 r1 or r2 a|b
(r) r (a|b)

The interpreter can be an NFA or a DFA - See a DFA-based driver and an example datastructure in Scanner.java. (Bonus points for every bug located in the example :-).

If you base your work on the given example, remember to document the internal working of the example as well.

Extension of input language

Extensions to set of accepted operators and language definition.

Expression Matches Example
r+ one-or-more a+
r? zero-or-one a?
[c1-c2] character ranges, starting from c1 and ending to c2 [a-z]
r{d1-d2} counted instances, from d1 to d2 instances a{3-9}
id1 ::= r1
id2 ::= r2
named macros alphabet ::= [a-z]
number ::= [0-9]
character ::= (alphabet|number)

Different handling of reserved words

Implementation of two ways, e.g. Scott's afterwards-recognition and encoding keywords into the scanner states, to handle overlaps in input language ranges -- e.g. to distinguish between cases of
identifier ::= a-z([a-z0-9]*)
and
keyword ::= "keyword".

Comparision in terms of memory usage and speed.

NFA versus DFA

In the old days, NFA-based regexp implementations were preferred due to their smaller memory consumption. Has the situation changed?

Implement two different implementations for the interpreter: an NFA-based one and a DFA-based one. Compare the implementations in terms of memory consumption (number of states) and interpretation speed with a representative survey of regexps.

Discuss different factors and implications of choosing one approach over another.

Visualization of the scanner algorithm

Design and implement a system for visualizing the interpretation of the given input regexp. Basic operations would be at least setting break points, single stepping execution, continuing execution (until next break point) and a way to observe the relationship between given regexp and input string.

Backreferences

Extend the input language with backreferences (e.g. \1, \2, ..\9), that can be used to receive the actual matched string in the nth part of the input regexp.

Java bytecode backend

Implementation of JVM bytecode generating scanner generator. Practically every regexp generator use an interpretative approach: they define a set of regular operators, which are then interpreted when given the input string (but see additional references as well).

This work topic explores another approach: compile the regexp straight into Java bytecode. Hopefully, that would fasten up the process, as the overhead of regexp operator interpretation would be avoided. Also, a simpler structure might give benefits when the compiled regexp meets JVM's dynamic compiler.

Implement this kind of a scanner and benchmark its results against e.g. the pattern matcher found in package java.util.regex (since 1.4). Discuss the relative pros and cons of your implementation against the Java's default implementation.

Resources

There are numerous books written about the JVM byte code. It is probably very hard to complete this task without a hand reference to one of these (see resources listing). If you're not familiar with the JVM in the general level, it is anyways a good idea to browse through this material.

A comparative review is available at JavaWorld.
A list of all JVM instructions is compiled by Jon Meyer, first author of the O'Reilly's book. It is available at http://cat.nyu.edu/~meyer/jvmref/.


Pietu.Pohjalainen@cs.Helsinki.FI