Tree language project work tools

On this page you'll find exact definitions for input and output formats, tools and references to tools, example code and links to papers necessary for completing a coding-oriented tree language course project work.

Input and output formats

You may assume that the character encoding of all input files is ISO-8859-1 and do not have to have support for other encodings.

Grammar

The tree-regular grammar will be specified in the following normalized format (in pseudo-BNF):

rule         :=  non-terminal \s* '->' \s* terminal (\s+ RE)? \s* '\n'
non-terminal :=  \w[\w\d._-]*
terminal     :=  \w[\w\d._-]*
RE           := any regular expression over non-terminals
                using * ? | , ( )

(\w: any letter, \d any digit, \s whitespace)

An example:

P_book  -> book P_frontmatter,P_mainmatter,P_backmatter
P_frontmatter   -> frontmatter
P_mainmatter    -> mainmatter (P_chapter,P_chapter*)
P_backmatter    -> backmatter
P_chapter       -> chapter (P_xref)*
P_xref  -> xref

Notes: terminals are not allowed in the regular expression, and you do not have to support the + quantifier. Since the alphabet of terminals is words, sequences are denoted by ,. Each rule ends with an end-of-line \n. Each terminal is represented by a rule with no regular expression. Character content and mixed content are effectively ignored.

The start expression r_0 is not explicitly given. Given grammar rules G_1 ... G_n in the form above, with the first rule G_1 being X -> a r_1 you may assume r_0 = X. An alternative approach is to look at the name of the root of the document and construct an r_0 that contains as alternatives all non-terminals with that label.

Queries

Queries are grammars with a number of non-terminals on the right-hand side marked with a #. An example:

P_book  -> book P_frontmatter,P_mainmatter,P_backmatter
P_frontmatter   -> frontmatter
P_mainmatter    -> mainmatter (#P_chapter,P_chapter*)
P_backmatter    -> backmatter
P_chapter       -> chapter (P_xref)*
P_xref  -> xref

is a query whose result set is the first chapter of the book.

Test cases

There will be a formal set of test cases with expected outcomes available. For the moment, you can see some example grammars and documents. Note that some of them are not meant to be valid.

Command-line syntax and output of a recognizer

A recognizer is a single runnable program (or script) that can be run with a command line PROGRAM name_of_grammar_file name_of_xml_file. It may produce any output on stdout and stderr, but must return a status code such that status 0 means that the input document matches the grammar given, 1 means that it does not match and any other status code means that the program encountered an error.

Example code

Regular-expression parser

Building a parser for regular expressions is not hard, but takes some time and is not the focus of this course. Below are the relevant bits of a lexer and parser for the regular-expressions used in the grammar above. They should enable you to implement such a parser in your favorite language. If you are totally unfamiliar with C++ and cannot read the code, drop me a line and I'll write some pseudocode (and for those of you that are, yes, it will leak memory when encountering an error).

Lexer

token_val get_token(input_if<charT>& buf) {
	for(;;) {
		charT c=buf.get();
		if (c==0 || isequal(c, '\n')) return token_val(END, "");
		if (isspace(c)) continue;
		if (isequal(c, '(')) return token_val(OPENBRACE, "");
		if (isequal(c, ')')) return token_val(CLOSEBRACE, "");
		if (isequal(c, '?')) return token_val(QUESTION, "");
		if (isequal(c, ',')) return token_val(COMMA, "");
		if (isequal(c, '*')) return token_val(STAR, "");
		if (isequal(c, '|')) return token_val(BAR, "");
		if (isalpha(c)) {
			typename re_traits<charT>::string name;
			name.push_back(c);
			while ( isnamechar(c=buf.get()) ) {
				name.push_back(c);
			}
			buf.put(c);
			return token_val(NAME, name);
		}
		typename re_traits::string tok; tok.push_back(c);
		return token_val(ERROR, tok);
	}
}

Parser

This is a straight-forward, hand-written recursive-descent parser. Probably not the prettiest way to implement it, but simple.

retype* parse() {
	lookahead=_lexer.get_token(_buf);
	retype* ret=seq();
	if (lookahead.first!=lexertype::END) {
		error();
	}
	if (ret) return ret;
	return new empty<charT>(_mgr);
}
void match(typename lexertype::token t) {
	if (lookahead.first==t) {
		lookahead=_lexer.get_token(_buf);
	} else {
		error();
	}
}
retype* single() {
	retype* ret=0;
	if (lookahead.first==lexertype::OPENBRACE) {
		match(lexertype::OPENBRACE);
		ret=seq();
		match(lexertype::CLOSEBRACE);
	} else if (lookahead.first==lexertype::NAME) {
		ret=new identifier<charT>(_mgr, lookahead.second);
		match(lexertype::NAME);
	}
	if (ret) {
		if (lookahead.first==lexertype::STAR) {
			match(lexertype::STAR);
			ret=new star<charT>(_mgr, ret);
		} else if (lookahead.first==lexertype::QUESTION) {
			match(lexertype::QUESTION);
			ret=new optional<charT>(_mgr, ret);
		}
	}
	if (!ret) {
		error();
		return 0;
	} else {
		return ret;
	}
}
retype* single_or_choice() {
	retype* first=single();
	choice<charT>* c=0;
	while (lookahead.first==lexertype::BAR) {
		if (!c) {
			c=new choice<charT>(_mgr);
			c->add_child(first);
		}
		match(lexertype::BAR);
		c->add_child(single());
	}
	if (c) return c;
	return first;
}
		
retype* seq() {
	if (lookahead.first==lexertype::END) {
		return 0;
	}
	retype* first=single_or_choice();
	sequence<charT>* s=0;
	while (lookahead.first==lexertype::COMMA) {
		if (!s) {
			s=new sequence<charT>(_mgr);
			s->add_child(first);
		}
		match(lexertype::COMMA);
		s->add_child(single_or_choice());
	}
	if (s) return s;
	return first;
}

Tools

XML Parsing

There are XML parsers for all languages available. If you are not familiar with them, here are some usage examples with libraries and languages available at the department:

I can help you get started with the XML parsing side if you are having problems.

DTD-to-grammar

This Simple perl script will convert a DTD to our grammar format. It should be run agains an XML file that uses a DTD, not directly against a DTD.


Last updated by Mika Raento, mikie (at) iki.fi on 2004-10-21