For redescription mining, one considers entities discribed by variables divided into two sets, hereafter arbitrarily called left-hand side and right-hand side. This can be seen as a pair of data matrices, where entities are identified with rows and variables with columns. Both sets of variables describe the same entities, hence, the matrices have the same number of rows.
In Siren, data include:
Data can be imported to Siren via the interface menu File → Import → Import Data, either as a set of separate files (Menu: Import from separate files) or as a single XML file (Menu: Import from XML file). Below, we present the data formats supported by Siren.
In case the data is imported as a set of files, the program expects:
Right-hand side and left-hand side variables are stored in separate files. This import mode does not allow for mixed types of variables, contrarily to the single XML data import mode. Several formats are available:
entity_id [SPC] variable_id [SPC] valueTo make sure no empty entity or variable is omitted, the first row must be
nb_of_entities [SPC] nb_of_variables [SPC] 0This correspond roughly to Matlab representation of sparse matrices.
dense | sparse | dat | |
---|---|---|---|
1. Boolean | .densebool | .sparsebool | .datbool |
2. categorical | .densecat | - | - |
3. numerical | .densenum | .sparsenum | - |
Each entity should be located by a pair of coordinates (i.e. latitude and longitude). Coordinates are stored one entity per line, with the two coordinates separated by a comma. If the number of coordinates found in the coordinates file does not match the number of entities, coordinates will be ommitted.
Finally, the files containing the names of variables are obtained from the corresponding data filename by replacing its extension by .names. For example, if the LHS variables data file is some_path/LHS_file.densenum the program expect to find the LHS variables names in a file called some_path/LHS_file.names. Names of variables are stored one variable per line. Names may contain spaces and unicode characters but no equality or comparison signs (i.e. =,>,<). If the number of names found in the names file does not match the number of variables on the given side, names will be ommitted.
Here is an example of a simple XML data file. In this example, the data consist of two Boolean variables on the left-hand side, one categorical variable and two numerical variables on the right-hand side, describing a total of ten geolocated entities.
Each variable is stored between variable tags, with it name, type_id, number of entities (numbers of entities should match accross all variables on both sides), etc.
Inside a siren package, the data is stored as a single XML file.
The product of redescription mining is a list of redescriptions. A redescription consist of a pair of queries over the variables describing the entities, one query for each set. The two sets of variables are arbitrarily called left-hand side and right-hand side, and so are the corresponding queries.
The support of a query is the set of entities for which the query holds. Any given redescription partitions the entities into four sets:
Redescriptions can be imported to Siren via the interface menu File → Import → Import Redescriptions. More importantly, they can be exported via the interface menu File → Export Redescriptions. Below, we present the redescription formats supported by Siren.
A query is formed by combining literal using Boolean operators.
For several reasons, Siren evaluates the queries from left to right irrelevant of the operator precedence. In other words, it supports only queries that can be parsed in linear order, without trees. For example, (a ∧ b) ∨ ¬c is supported, but (a ∧ b) ∨ (c ∧ d) is not. Parenthesis delimiting groups of literals combined with the same operator can be added to ease readability.
We consider three types of literals, defined over a Boolean, categorical or numerical variable respectively.
Below is an unformal grammar of Siren's query language, parenthesis denote optional elements. The preferred syntax for editing queries is marked with bold.
conjunction operator | AND | &, ∧, \land |
disjunction operator | OR | |, ∨, \lor |
operator | OP | AND, OR |
negation | NEG | !, ¬, \neg |
variable | VAR | integer, name |
category | CAT | integer |
interval bound | IBD | float with at least one decimal precision |
less-than sign | LEQ | <, ≤, \leq{} |
Boolean literal | BLIT | (NEG) VAR |
categorical literal | CLIT | (NEG) VAR = CAT |
VAR (\not)\in CAT | ||
VAR ∈ CAT | ||
VAR ∉ CAT | ||
numerical literal | NLIT | (NEG) [ (IBD LEQ) VAR (LEQ IBD) ] |
(NEG) VAR (> IBD) (< IBD) | ||
literal | LIT | BLIT, CLIT, NLIT |
query | QRY | LIT (OP LIT)* |
Naturally, the type of literal and the type of variable should match, i.e., [4.0 < Va < 8.32] is a valid numerical literal only if the corresponding variable Va is a numerical variable. Furthermore, the upper bound of a numerical variable should always be greater or equal to the lower bound and either of them should be specified.
The statistics of a redescription include:
Redescriptions can be exported in three formats, determine by the extension of the provided filename:
Both .queries and .xml format can be imported into Siren, .tex cannot.
Inside a siren package, the redescriptions are stored in a XML file together with display order and disabled status.