Custom version of Wikipedia content

Gazillion is a Gaze-to-Word mapping web browser for Wikipedia content and provides a search interface which allows the use of recorded eye-movement trajectories in implicit relevance feedback.

To be able to map from screen coordinates to individual words, the software needs to have each individual word inside a <span> tag. This provides the bounding box (maximum extent on screen) for those words. Unfortunately I haven't found out a better (easier) way on how to get bounding boxes for individual words from browser layout engines.

Here is an example of the formatting the browser accepts. Some 'stoplisted' words have been omitted.

<script src="gazillion.js"></script>
...
<p id="segment:1">
A
<span id="word:1">large</span>
<span id="word:2">block</span>
of
<span id="word:3">text</span>
<span id="word:4">should</span>
go
<span id="word:5">here</span>.
</p>

This page describes a hacked version of Mediawiki which shows wiki content in this format.

Javascript

The browser expects some JavaScript functions be embedded on the page. This section describes functions from file gazillion.js.

The browser-side JavaScript populates trackWords and trackSegments arrays which contain to word and segment bounding boxes on page loads and window resizes by calling populateTrack() function. Currently it requires spans to have node id word:n or for paragraphs segment:n for them to be tracked. The client accesses this data using the WebBrowser control's interface.

Array trackWords contains word span id:s and bounding boxes for words to be tracked in the following fields:

The browser calls these functions to determine the word or segment under x,y coordinates:

The following JavaScript functions are used to highlight relevant words in page:

Search functionality requires the page to contain search results in a data structure embedded in the page. Array searchResult contains search results in the following fields:

The following JavaScript functions are used to interface with search:

Installing Mediawiki

The following packages and versions were installed as prerequisites (on Ubuntu feisty):

mysql-server 5.0.38-0ubuntu1
apache2 2.2.3-3.2build1
libapache2-mod-php5 5.2.1-0ubuntu1.4
php5-mysql 5.2.1-0ubuntu1.4
sun-java6-jre 6-00-2ubuntu2

Create MySQL database wikidb and add user wikiuser with permissions to that database.

Download mwdumper. This tool is used to convert XML formatted Wikipedia dump to SQL inserts. The tool required Sun's Java to work correctly, otherwise would throw exception reading the dump XML file.

Download Mediawiki 1.10.1 from mediawiki.org. Extract the package to /var/www/wiki and give group ownership to www-data group (on Debian based systems). Navigate your browser to http://server/wiki/ to see your new wiki. Follow the installation instructions to complete LocalSettings.php. Once you have a working Mediawiki installation, extract the provided theme and modified Parser.php in that directory. Change the theme configuration (setting $wgDefaultSkin) at your LocalSettings.php file to read the following:

$wgDefaultSkin = 'gazillion';

The extensions ParserFunctions and Cite are the two bare minimum extensions required for the Wikipedia pages to render correctly. These extensions are available from the Mediawiki SVN repository. Copy these extensions to /var/www/wiki/extensions and add the following lines at the end of your LocalSettings.php file:

require_once("$IP/extensions/ParserFunctions/ParserFunctions.php");
require_once("$IP/extensions/Cite/Cite.php");

Setting up Wikipedia content

Wikipedia database dumps in XML format are available from download.wikipedia.org. You would want the pages-articles.xml.bz2 file which contains the most recent versions of pages, for example enwiki-20070802-pages-articles.xml.bz2.

Run the following command to import the file in Wikipedia database. NOTE: Importing a complete English language wikipedia dump took approx. a day.

$ java -server -jar mwdumper.jar --format=sql:1.5 \
        enwiki-20070802-pages-articles.xml.bz2 | mysql -u wikiuser wikidb

The following Java runtime options might help with importing speed:

java -Xmx512m -Xms128m -XX:NewSize=32m -XX:MaxNewSize=64m \
	-XX:SurvivorRatio=6 -XX:+UseParallelGC \
	-XX:GCTimeRatio=9 -XX:AdaptiveSizeDecrementScaleFactor=1

Search interface

The Gazillion browser sends in the search text and individual word fixations inside a http GET request. Server is defined in File > Server Settings dialog of the client. Search text is sent right after question mark. Leave the search text empty to denote a suggest search, using only the word fixations. Individual word fixation times and msec are sent as additional GET-parametres with the following syntax:

word=fixations,msec

For example, an search for the word pet where the user has looked the word dog once for 100 msec and the word cat twice for total of 200 msec would result in the following query string:

?pet&dog=1,100&cat=2,200

Define your search results in an array named searchResult (see section on Javascript) to show these on the left hand side search results list of the browser.

Return to main page