Laudatur course on Compilers
Exercise and laboratory assignment pages, Spring 2006

Course page General Exercises Mid-term project End-term project

End-term project

End-term project, deadline May 8th at 2100 hours.
The end-term project is a larger work. Instructions for the project will be given at tenth exercise session, April 24th.

The project work is to design and implement a scanner generator for compilers written in Java. Upon a prior agreement, projects implemented in C++ can be accepted. The work can be done either individually or in groups of 2-3 students. The grading is based on scope and quality of work: larger scope gains more maximum points and better quality gives more actual points.

Documentation
The project work requires that technical documentation of about 5-10 pages is included. The structure of the document follows scientific format: Abstract, introduction, 2-3 chapters of discussion, conclusions and references. The document should give a good overview of the chosen problem, discuss and problematize it, and finally give a conclusion of some sort.

Grading
Each star in the difficulty rating of a task represents 10% of maximum points. So, a flawless work with combined difficulty rating of 10 will give the maximum points. When the work implements multiple parts, the difficulty rating is calculated by applying the plus operator to amount of stars in each part. (See also How to talk like a computer scientist).

When working in groups, a group size of three persons gives up to 80% of the points represented by difficulty rating. Individual work boosts the rating up to 120%.

Project part description	Max points	Status
Basic scanner implementation		Mandatory
Extension of input language		Optional
Reserved words		Optional
Nondeterministic and Deterministic FA's		Optional
Visualization of execution		Optional
Support for backreferences		Optional
Code generation		Optional

Basic scanner implementation

Interpreter for basic regexps - support for minimal set of regular operators (repetition, catenation, alternation, and grouping).

The input language matches as follows:

Expression	Matches	Example
c	any non-operator character c	a
\c	character c literally	\*
r*	zero or more r's	a*
r₁r₂	r₁ then r₂	ab
r₁\|r₂	r₁ or r₂	a\|b
(r)	r	(a\|b)

The interpreter can be an NFA or a DFA - See a DFA-based driver and an example datastructure in Scanner.java. (Bonus points for every bug located in the example :-).

If you base your work on the given example, remember to document the internal working of the example as well.

Extension of input language

Extensions to set of accepted operators and language definition.

Expression	Matches	Example
r+	one-or-more	a+
r?	zero-or-one	a?
[c₁-c₂]	character ranges, starting from c₁ and ending to c₂	[a-z]
r{d₁-d₂}	counted instances, from d₁ to d₂ instances	a{3-9}
id₁ ::= r₁ id₂ ::= r₂	named macros	alphabet ::= [a-z] number ::= [0-9] character ::= (alphabet\|number)

Different handling of reserved words

Implementation of two ways, e.g. Scott's afterwards-recognition and encoding keywords into the scanner states, to handle overlaps in input language ranges -- e.g. to distinguish between cases of
identifier ::= a-z([a-z0-9]*)
and
keyword ::= "keyword".

Comparision in terms of memory usage and speed.

NFA versus DFA

In the old days, NFA-based regexp implementations were preferred due to their smaller memory consumption. Has the situation changed?

Implement two different implementations for the interpreter: an NFA-based one and a DFA-based one. Compare the implementations in terms of memory consumption (number of states) and interpretation speed with a representative survey of regexps.

Discuss different factors and implications of choosing one approach over another.

Visualization of the scanner algorithm

Design and implement a system for visualizing the interpretation of the given input regexp. Basic operations would be at least setting break points, single stepping execution, continuing execution (until next break point) and a way to observe the relationship between given regexp and input string.

Backreferences

Extend the input language with backreferences (e.g. \1, \2, ..\9), that can be used to receive the actual matched string in the nth part of the input regexp.

Java bytecode backend

Implementation of JVM bytecode generating scanner generator. Practically every regexp generator use an interpretative approach: they define a set of regular operators, which are then interpreted when given the input string (but see additional references as well).

This work topic explores another approach: compile the regexp straight into Java bytecode. Hopefully, that would fasten up the process, as the overhead of regexp operator interpretation would be avoided. Also, a simpler structure might give benefits when the compiled regexp meets JVM's dynamic compiler.

Implement this kind of a scanner and benchmark its results against e.g. the pattern matcher found in package java.util.regex (since 1.4). Discuss the relative pros and cons of your implementation against the Java's default implementation.

Resources

There are numerous books written about the JVM byte code. It is probably very hard to complete this task without a hand reference to one of these (see resources listing). If you're not familiar with the JVM in the general level, it is anyways a good idea to browse through this material.

A comparative review is available at JavaWorld.
A list of all JVM instructions is compiled by Jon Meyer, first author of the O'Reilly's book. It is available at http://cat.nyu.edu/~meyer/jvmref/.

Ken Thompson: Regular Expression Search Algorithm, Communications of the ACM, Volume 11, Issue 6 (June 1968). Online
Russ Cox: Implementing Regular Expressions. A clarifying presentation. Online
comp.compilers: tips for writing regular expression interpreter/compiler Online archive
A DFA-based scanner driver and data structure implementation in Java 5
How Regexes Work
Henry Spencer: A Regular-Expression Matcher, chapter 3 in Software Solutions in C, edited by Dale Schumacher, AP Professional, 1994.
The Java Virtual Machine Specification (Sunsoft)
Java Virtual Machine (O'Reilly)
Programming for the Java Virtual machine (Addison-Wesley)
Inside the Java Virtual Machine(McGraw-Hill)
Some chapters of this book are available online at here.

Pietu.Pohjalainen@cs.Helsinki.FI

Department of Computer Science

Department information

Laudatur course on Compilers
Exercise and laboratory assignment pages, Spring 2006

End-term project

Basic scanner implementation

Extension of input language

Different handling of reserved words

NFA versus DFA

Visualization of the scanner algorithm

Backreferences

Java bytecode backend

Resources

Department of Computer Science

Department information

Laudatur course on Compilers Exercise and laboratory assignment pages, Spring 2006

End-term project

Basic scanner implementation

Extension of input language

Different handling of reserved words

NFA versus DFA

Visualization of the scanner algorithm

Backreferences

Java bytecode backend

Resources

Laudatur course on Compilers
Exercise and laboratory assignment pages, Spring 2006