Note
For redescription mining, one considers entities discribed by variables divided into two sets, hereafter arbitrarily called left-hand side and right-hand side. This can be seen as a pair of data matrices, where entities are identified with rows and variables with columns. Both sets of variables describe the same entities, hence, the matrices have the same number of rows.
In Siren, data include:
Obviously, this is required.
Data can be imported to Siren via the interface menu File ‣ Import ‣ Import Data. Below, we present the data formats supported by Siren.
Data can be imported into Siren as CSV files. The program expects a pair of files, one for either side in character-separated values, as can be imported and exported to and from spreadsheet programms, for instance.
There are two main formats,
The two data files need not be in the same format.
If entities names and/or coordinates are provided, they will be used to match entities across the two sides. Otherwise, rows will be match in order and an error will occur if the two side do not contain the same number of rows.
The data is stored as a table with one column for each variable and one row each entity. The first row can contain the names of the variables. The entities names can be included as columns named id. Similarly the coordinates can be included as a pair of columns named longitude and latitude, respectively.
This format allows to store data that contains few non-zero entries more compactly, as in the Matlab sparse format (or like the edge list of a bipartite graph).
Each line contains an entry of the data as a triple (entity, variable, value). This way, the data is stored as in three columns and as many rows as there are entries. In this case the first line of the data file must contains id, cid and value, indicating the three columns containing the enities, variables and corresponding value, respectively. Coordinates can be provided in a similar way under the variable names longitude and latitude.
Variable names can be provided inline, that is, simply by using the name of the variable for each entry involving it. Alternatively, variable names can be specified separately with a special “-1” entity. Similarly, entity names can be provided inline or separatly with a special “-1” variable. For example, the following four lines
Espoo; population; 260981
Helsinki; population; 614074
Tampere; population; 220609
Turku; population; 182281
are equivalent to the following:
20; -1; Espoo
7; -1; Tampere
2; -1; Turku
13; -1; Helsinki
-1; 3; population
2; 3; 182281
7; 3; 220609
13; 3; 614074
20; 3; 260981
Finally, in case of fully Boolean data without coordinates, the value can be left out. Each pair of (entity, variable) appearing is considered as True, the rest as False.
Note
The product of redescription mining is a list of redescriptions. A redescription consist of a pair of queries over the variables describing the entities, one query for each set. The two sets of variables are arbitrarily called left-hand side and right-hand side, and so are the corresponding queries.
The support of a query is the set of entities for which the query holds. Any given redescription partitions the entities into four sets:
Redescriptions can be imported to Siren via the interface menu File ‣ Import ‣ Import Redescriptions. More importantly, they can be exported via the interface menu File ‣ Export Redescriptions. Below, we present the redescription formats supported by Siren.
A query is formed by combining literal using Boolean operators.
While ReReMi only generate linearly parsable query (see references for more details), Siren can actually evaluates arbitrary queries, as long as they are well formed following the informal grammar below. In particular, parenthesis should be used to separated conjunctive blocks and disjunctive block, alternating between operators. For example, while the later cannot be generated by ReReMi, \((a \land{} b) \lor{} \lnot{} c\) and \((a \land{} b) \lor{} (c \land{} d)\) are both supported. \((a \land{} b) \land{} (c \land{} d)\) is not, because of incorrect alternance of operators between parenthesis blocks. It should simply be written as \(a \land{} b \land{} c \land{} d\).
We consider three types of literals, defined over a Boolean, categorical or numerical variable respectively.
Below is an unformal grammar of Siren‘s query language.
Tip
Naturally, the type of literal and the type of variable should match, i.e., \([4.0 \leq{} Va \leq{} 8.32]\) is a valid numerical literal only if the corresponding variable \(Va\) is a numerical variable. Furthermore, the upper bound of a numerical variable should always be greater or equal to the lower bound and either of them should be specified.
The statistics of a redescription include:
Redescriptions from the Redescriptions tab can be exported to a file, one redescription per line, with both queries and basic statistics tab separated. Three of formatting options are available, determined by the provided filename:
Inside a siren package, the redescriptions are stored in tab separated format together with disabled status.
Tab separated formats can be imported into Siren, TeX cannot.