Formats

Data formats

Multiple data files

In case the data is imported as a set of files, the program expects:

Variables formats:

Right-hand side and left-hand side variables are stored in separate files. This import mode does not allow for mixed types of variables, contrarily to the single XML data import mode. Several formats are available:

The format is determined from the filename extension, below is the table of extensions, - means the format is not available for the given variable type:
dense sparse dat
1. Boolean .densebool .sparsebool .datbool
2. categorical .densecat - -
3. numerical .densenum .sparsenum -

Coordinates format:

Each entity should be located by a pair of coordinates (i.e. latitude and longitude). Coordinates are stored one entity per line, with the two coordinates separated by a comma. If the number of coordinates found in the coordinates file does not match the number of entities, coordinates will be ommitted.

Names:

Finally, the files containing the names of variables are obtained from the corresponding data filename by replacing its extension by .names. For example, if the LHS variables data file is some_path/LHS_file.densenum the program expect to find the LHS variables names in a file called some_path/LHS_file.names. Names of variables are stored one variable per line. Names may contain spaces and unicode characters but no equality or comparison signs (i.e. =,>,<). If the number of names found in the names file does not match the number of variables on the given side, names will be ommitted.

Single XML file

Here is an example of a simple XML data file. In this example, the data consist of two Boolean variables on the left-hand side, one categorical variable and two numerical variables on the right-hand side, describing a total of ten geolocated entities.

Each variable is stored between variable tags, with it name, type_id, number of entities (numbers of entities should match accross all variables on both sides), etc.

  1. For Boolean variables (cf. A_red and B_red in example above) the ids of entities for wich the variable holds true are separated by commas and stored between rows tags (cf. dat format).
  2. For categorical variables (cf. A_blue in example above) the categories of the entities are separated by commas and stored in order between values tags (cf. dense format).
  3. Numerical variables can be stored either in sparse or dense format, depending on convenience. The format used should be indicated using store_type tags.

Inside a siren package, the data is stored as a single XML file.

Redescriptions formats

Queries

A query is formed by combining literal using Boolean operators.

For several reasons, Siren evaluates the queries from left to right irrelevant of the operator precedence. In other words, it supports only queries that can be parsed in linear order, without trees. For example, (a ∧ b) ∨ ¬c is supported, but (a ∧ b) ∨ (c ∧ d) is not. Parenthesis delimiting groups of literals combined with the same operator can be added to ease readability.

We consider three types of literals, defined over a Boolean, categorical or numerical variable respectively.

Below is an unformal grammar of Siren's query language, parenthesis denote optional elements. The preferred syntax for editing queries is marked with bold.

conjunction operatorAND &, ∧, \land
disjunction operatorOR |, ∨, \lor
operatorOP AND, OR
negationNEG !, ¬, \neg
variableVAR integer, name
categoryCAT integer
interval boundIBD float with at least one decimal precision
less-than signLEQ <, ≤, \leq{}
Boolean literalBLIT (NEG) VAR
categorical literalCLIT (NEG) VAR = CAT
VAR (\not)\in CAT
VAR ∈ CAT
VAR ∉ CAT
numerical literalNLIT (NEG) [ (IBD LEQ) VAR (LEQ IBD) ]
(NEG) VAR (> IBD) (< IBD)
literalLIT BLIT, CLIT, NLIT
queryQRY LIT (OP LIT)*

Naturally, the type of literal and the type of variable should match, i.e., [4.0 < Va < 8.32] is a valid numerical literal only if the corresponding variable Va is a numerical variable. Furthermore, the upper bound of a numerical variable should always be greater or equal to the lower bound and either of them should be specified.

Redescription statistics

The statistics of a redescription include:

Exporting Redescriptions

Redescriptions can be exported in three formats, determine by the extension of the provided filename:

When exporting redescriptions in the latter two formats, diabled redescriptions will not be printed.

Importing Redescriptions

Both .queries and .xml format can be imported into Siren, .tex cannot.

Inside a siren package, the redescriptions are stored in a XML file together with display order and disabled status.


Siren --- Last modified: Wed Aug 1 2012, galbrun@cs.helsinki.fi