Laudatur course on Compilers
Exercise and laboratory assignment pages, Spring 2006
End-term project
End-term project, deadline May 8th at 2100 hours.
The end-term project is a larger work. Instructions for the
project will be given at tenth exercise session, April 24th.
The project work is to design and implement a scanner generator for compilers written in Java. Upon a prior agreement, projects implemented in C++ can be accepted. The work can be done either individually or in groups of 2-3 students. The grading is based on scope and quality of work: larger scope gains more maximum points and better quality gives more actual points.
Documentation
The project work requires that technical documentation of about
5-10 pages is included. The structure of the document follows
scientific format: Abstract, introduction, 2-3 chapters of
discussion, conclusions and references. The document should give
a good overview of the chosen problem, discuss and problematize
it, and finally give a conclusion of some sort.
Grading
Each star in the difficulty rating of a task represents 10% of
maximum points. So, a flawless work with combined difficulty
rating of 10 will give the maximum points. When the work implements
multiple parts, the difficulty rating is calculated by applying
the plus operator to amount of stars in each part. (See also How to talk
like a computer scientist).
When working in groups, a group size of three persons gives up to 80% of the points represented by difficulty rating. Individual work boosts the rating up to 120%.
Project part description | Max points | Status |
---|---|---|
Basic scanner implementation | Mandatory | |
Extension of input language | Optional | |
Reserved words | Optional | |
Nondeterministic and Deterministic FA's | Optional | |
Visualization of execution | Optional | |
Support for backreferences | Optional | |
Code generation | Optional |
Basic scanner implementation
Interpreter for basic regexps - support for minimal set of regular operators (repetition, catenation, alternation, and grouping).
The input language matches as follows:
Expression | Matches | Example |
---|---|---|
c | any non-operator character c | a |
\c | character c literally | \* |
r* | zero or more r's | a* |
r1r2 | r1 then r2 | ab |
r1|r2 | r1 or r2 | a|b |
(r) | r | (a|b) |
The interpreter can be an NFA or a DFA - See a DFA-based driver and an example datastructure in Scanner.java. (Bonus points for every bug located in the example :-).
If you base your work on the given example, remember to document the internal working of the example as well.
Extension of input language
Extensions to set of accepted operators and language definition.
Expression | Matches | Example |
---|---|---|
r+ | one-or-more | a+ |
r? | zero-or-one | a? |
[c1-c2] | character ranges, starting from c1 and ending to c2 | [a-z] |
r{d1-d2} | counted instances, from d1 to d2 instances | a{3-9} |
id1 ::= r1 id2 ::= r2 |
named macros |
alphabet ::= [a-z] number ::= [0-9] character ::= (alphabet|number) |
Different handling of reserved words
Implementation of two ways, e.g. Scott's afterwards-recognition
and encoding keywords into the scanner states, to handle
overlaps in input language ranges -- e.g. to distinguish between
cases of
identifier ::= a-z([a-z0-9]*)
and
keyword ::= "keyword".
Comparision in terms of memory usage and speed.
NFA versus DFA
In the old days, NFA-based regexp implementations were preferred due to their smaller memory consumption. Has the situation changed?
Implement two different implementations for the interpreter: an NFA-based one and a DFA-based one. Compare the implementations in terms of memory consumption (number of states) and interpretation speed with a representative survey of regexps.
Discuss different factors and implications of choosing one approach over another.
Visualization of the scanner algorithm
Design and implement a system for visualizing the interpretation of the given input regexp. Basic operations would be at least setting break points, single stepping execution, continuing execution (until next break point) and a way to observe the relationship between given regexp and input string.
Backreferences
Extend the input language with backreferences (e.g. \1, \2, ..\9), that can be used to receive the actual matched string in the nth part of the input regexp.
Java bytecode backend
Implementation of JVM bytecode generating scanner generator. Practically every regexp generator use an interpretative approach: they define a set of regular operators, which are then interpreted when given the input string (but see additional references as well).
This work topic explores another approach: compile the regexp straight into Java bytecode. Hopefully, that would fasten up the process, as the overhead of regexp operator interpretation would be avoided. Also, a simpler structure might give benefits when the compiled regexp meets JVM's dynamic compiler.
Implement this kind of a scanner and benchmark its results against e.g. the pattern matcher found in package java.util.regex (since 1.4). Discuss the relative pros and cons of your implementation against the Java's default implementation.
Resources
There are numerous books written about the JVM byte code. It is probably very hard to complete this task without a hand reference to one of these (see resources listing). If you're not familiar with the JVM in the general level, it is anyways a good idea to browse through this material.
A comparative review is available at JavaWorld.
A list of all JVM instructions is compiled by Jon Meyer, first
author of the O'Reilly's book. It is available at
http://cat.nyu.edu/~meyer/jvmref/.
- Ken Thompson: Regular Expression Search Algorithm, Communications of the ACM, Volume 11, Issue 6 (June 1968). Online
- Russ Cox: Implementing Regular Expressions. A clarifying presentation. Online
- comp.compilers: tips for writing regular expression interpreter/compiler Online archive
- A DFA-based scanner driver and data structure implementation in Java 5
- How Regexes Work
- Henry Spencer: A Regular-Expression Matcher, chapter 3 in Software Solutions in C, edited by Dale Schumacher, AP Professional, 1994.
- The Java Virtual Machine Specification (Sunsoft)
- Java Virtual Machine (O'Reilly)
- Programming for the Java Virtual machine (Addison-Wesley)
-
Inside
the Java Virtual Machine(McGraw-Hill)
Some chapters of this book are available online at here.
Pietu.Pohjalainen@cs.Helsinki.FI