Multi- Relational Data Mining



Multi-Relational Rule Discovery

Progress Report

November 2002

Mahmut Uludag

Eastern Mediterranean University

Supervisor: Prof. Dr. Mehmet Tolun

Summary

This report presents the current status of the multi-relational rule discovery research. As part of the research, a data mining system is being developed that allows discovery of multi-relational rules using data from a relational database. The basic assumption of the system is that objects analysed are stored in a set of tables.

Multi-relational rules discovered would either be used in predicting an object attribute value (by looking at the other object attribute values), or they would make it possible to see the hidden relationship between the object attribute values. The second way of using the discovered rules can be described as ‘data summarisation’.

The system already developed is not yet able to use data available from every possible ‘connected’ schema but it requires that there is a central table directly connected to all other tables. However, this report gives details of the general solution for a ‘connected’ schema. We plan to implement the general solution during December 2002.

The report also gives details of the possible usage of the rule discovery algorithm in SRS, Lion biosciences’ data integration system [14]. Basically, when a relational library is defined in SRS most of the input required by the rule discovery algorithm becomes automatically defined. The rule discovery process can be started by having a button on the SRS query results page allowing the user to link to “hidden relations in the results”.

1. Introduction

Most of the current data mining algorithms are designed to use datasets from a single table. They require that each object be described by a fixed set of attributes. Compared to a single table dataset, a relational database containing multiple tables makes it possible to represent more complex and structured data. In addition, today, a significant amount of scientific data is stored in relational databases, such as gene expression data. For these reasons, it is important to have discovery algorithms running for relational data in its natural form without requiring the data to be viewed in one single table.

Previous work on multi-relational data mining concentrated on producing decision tree classifiers [11] [5] or association rule mining [7]. However, in a more recent study [12] multi-relational data mining was applied to a rule discovery algorithm as a test case for a new paradigm.

In [10], a multi-relational data-mining framework was described that can be used to analyse data in relational databases containing multiple relations. In this study, we have applied the described framework to a specific rule induction algorithm, namely ILA (Inductive Learning Algorithm). ILA is a ‘covering’ type machine learning algorithm that takes each class in turn and seeks a way of covering all instances in it, at the same time excluding instances not in the class [2]. The concepts suggested in the framework, such as selection graphs, target table, target attribute, all helped much during the process of building the multi-relational version of the ILA algorithm. However we needed to extend the framework to have a sense of distinction between covered and not yet covered objects.

A relational data model consisting of multiple tables may represent several object classes, i.e., within a schema while one set of tables represents a class of object a different set of tables may represent another class. Before starting the multi-relational data mining process one should analyse the schema and select the list of tables that represents the kind of objects to be analysed. One of the selected tables is central for the object and each row in the table should correspond to a single object in the database. In the previous multi-relational data mining publications, this central table is named as ‘target table’ in [11] and [5], ‘primary table’ in [7] and [8], ‘master relation’ in [13], and ‘hub table’ in [14]. Part of the object information stored in other selected tables can be found by following the associations between the target table and the other tables.

The target table in multi relational data mining terminology has a similar meaning to the fact table in data warehouses, where the database usually consists of a fact table, storing all measures, with dimensional tables around it. Having one table per dimension leads to a star schema, and by normalising the dimension tables we a snowflake schema is obtained [9]. The fact table is related to several dimension tables using a foreign key relationship to each dimension table. That is, each row inserted into the fact table will have values in the foreign key columns that correspond to primary key data in the various dimensions.

In many cases a database may have several kinds of objects. In these cases multiple target tables can be selected. However Relational-ILA can analyse examples of only one kind of object at a time so target tables should be selected one by one for each execution of Relational-ILA.

ILA requires a particular feature of the object under consideration to be used as a dependent attribute for classification. For this reason one of the columns within the target table should be defined as the target attribute.

The most apparent difference between a standard rule and a multi-relational rule is that in case of a multi-relational rule, attribute names in conditions are annotated using the name of the relational table to which the attribute is related. However, the table name detail does not need to be presented to the user, see for example Figure 1.

[pic]

Figure 1. An example multi-relational rule, from PharmaDM web site[15].

2. The Algorithm

This section summarises the basic multi-relational ILA algorithm. Input to the algorithm can be listed as follows.

□ The list of table names,

□ The name and type of each column in each table,

□ The list of foreign key relationships between the tables,

□ The name of the target table,

□ The name of the target attribute.

Given the input, the multi-relational ILA algorithm generates a list of multi-relational rules that cover all objects stored in the given instances of the list of tables. It is assumed that the target table is connected to all other tables through foreign key relations.

Relational-ILA starts by selecting the first value of the target attribute, and then attempts to generate rules that will correctly classify objects having this attribute value, in other words objects belonging to the current class. For this purpose, the following steps are repeated until all examples of the current class are covered, i.e. correctly classified. Then the same steps are repeated for the other class values.

2.1 Candidate generation

The first step in the ILA algorithm is to generate the hypotheses. In ILA terminology, ‘hypotheses’ means a set of potentially interesting rules. Relational-ILA first attempts to cover all the class examples by finding hypotheses with only one condition. After exploiting all possible size-one hypotheses, ILA starts to refine the set of hypotheses by adding further conditions one at a time. The refinement process continues until either all examples are covered or the maximum size for rules is reached.

The following is the current version the function, which is used to build the initial hypotheses.

Vector buildHypothesis(String classValue){

Vector hypothesis = new Vector();

For each table {

For each column in the selected table{

If ( column is not the target attribute

&& column is not a primary key

&& column is not a foreign key) {

Check whether the table is the target table

Check whether the column is numeric

Select the proper SQL template and generate SQL

resultSet = execute generated SQL

hypothesis += generate hypothesis using resultSet

Execute SQL in the database

Retrieve hypotheses and their frequency values

For each hypothesis {

If the false values are less then a predefined threshold

Then append the hypothesis to the current hypothesis set

}

}

}

}

return the hypotheses set

}

For the general solution, we need to change this function such that it will not search the table space by just iterating over the table list, but will search by following the relations between tables. So we need a table name as a second argument. Also the loop that iterates over the table list will be replaced by a loop that will iterate over the foreign key relations from/to the current table. So, the general solution will be as follows. When the function first executed, it should be provided by the target table name.

Vector buildHypothesis(String classValue, String tableName){

Vector hypothesis = new Vector();

For each column in the selected table{

If ( column is not the target attribute

&& column is not a primary key

&& column is not a foreign key) {

Check whether the table is the target table

Check whether the column is numeric

Select the proper SQL template and generate SQL

resultSet = execute generated SQL

hypothesis += generate hypothesis using resultSet

Execute SQL in the database

Retrieve hypotheses and their frequency values

For each hypothesis {

If the false values are less then a predefined threshold

Then append the hypothesis to the current hypothesis set

}

}

}

For each table linked by a foreign key relation{

call buildHypothesis(classValue, linkedTableName)

}

}

return the hypotheses set

}

2.1.1 Candidates in the target table

A list of hypotheses is generated by first searching the target table using the following SQL template. The template is applied for each column other than the class and the primary key column.

Select attr, count(attr) from targetTable, covered where targetAttr = currentClass and covered.id = targetTable.pk and covered.mark=0 group by attr;

Here,

□ attr is one of the column names in the target table which cannot be the primary key column(s) or the class column, i.e. the target attribute,

□ targetTable is the central table which has one row for each object for which classification rules are being searched,

□ targetAttr is the class column which is to be used in the right side of the classification rules,

□ pk is the name of the primary key column in the target table.

In the template, the table ‘covered’ is a temporary table generated by Relational-ILA. In order to reduce the complexity of communication between Relational-ILA and the database system, the information about covered objects is stored in the temporary table rather than in an internal data structure. An alternative approach would be to keep this information in the Relational-ILA main memory. This requires the hypothesis generation SQL templates to be extended such that the list of identifiers of the covered elements is included in the SQL. The alternative approach was not selected because it can make the SQL statements too long when the number of marked objects is large.

The temporary table ‘covered’ has two attributes named as ‘id’ and ‘mark’. When ILA starts processing a class, this table is initialised as follows. A new row is inserted for each object belonging to the current class. The ‘id’ field gets the value of the primary key and the mark field is set to zero. When a new rule is generated, the ‘mark’ fields of the rows that refer to the covered objects are set to one.

The above SQL template generates hypotheses together with the frequency values. However ILA also needs to know about the frequency of the hypotheses in other classes. For this reason, the following SQL template is used to calculate the frequency values of the hypotheses in the other classes.

Select attr, count(attr) from targetTable where targetAttr currentClass group by attr;

2.1.2 Candidates in the other tables

After generating the list of all possible size-one hypotheses in the target table, the next step is to find the size-one hypotheses using the other tables. The following SQL template is used for this purpose. The SQL queries are repeated for each table and for each column except the foreign key column(s).

Select attr, count(attr) from table, covered, targetTable where table.fk = targetTable.pk and targetTable.targetAttr = currentClass and covered.id = targetTable.pk and covered.mark=0 group by attr

Here,

□ table refers to the current table where the hypothesis is being searched,

□ fk is the foreign key column which is connected to the target table.

Similar to the hypotheses in the target table case we need to calculate the frequency of the hypotheses in the other classes. The frequency values are calculated using SQL statements derived from the following SQL template.

Select attr, count(attr) from table, targetTable where table.fk = targetTable.pk and targetTable.targetAttr currentClass group by attr

After each time the above queries are executed, using the returned result set, ILA builds an internal list of hypotheses using the Hypothesis class, which has the following attributes.

□ The table attribute is used to store the table name information.

□ The column attribute is used to store the column name information.

□ The value attribute refers to the column value for which the hypothesis stands.

□ The trueValues attribute is used to store the number of times the hypothesis is true in the not yet covered elements of the training data.

□ The falseValues attribute is used to store the number of times the hypothesis is false.

The Hypothesis class has also an attribute called connectedTo, which relates hypotheses to each other so that it is possible to form multi-relational hypotheses. A multi-relational hypothesis is composed of simple hypotheses all connected by the ‘and’ logical operator. Similar to selection graphs [10] multi-relational hypotheses can be translated into SQL in a straightforward manner. For this purpose we have implemented the toSQL method for the Hypothesis class. The method simply traverses the hypotheses connected to each other and prepares the SQL equivalent of the hypothesis object.

2.2 Hypothesis Evaluation and Rule Generation

At this stage, the list of the hypotheses is ready with the frequency values, for each hypothesis showing how each hypothesis classifies the training data. This is the typical information needed by many rule discovery algorithms for selection of rules. This means that the results of this study are not bound by the ILA algorithm but can be used for other rule discovery algorithms to have their multi-relational versions.

The rule selection criterion of ILA is simple, i.e. select the hypothesis with the maximum number of the trueValues but the falseValues should be nil. If there is any hypothesis satisfying this criterion then it is used to generate a new rule. Rules in Relational-ILA are represented using the following attributes.

□ The hypothesis attribute refers to the hypothesis object from which the rule was derived. The hypothesis object forms the left-hand side of a rule.

□ The targetTable attribute is used to store the name of the target table.

□ The targetAttr attribute is used to store the name of the target attribute.

In summary, the hypothesis attribute forms the left-hand side of a multi-relational rule while the targetTable and the targetAttr attributes forms the right-hand side of it. Different to a conventional classification rule, a multi-relational rule includes the name of the relational table to which the target attribute belongs.

After a new rule has been generated, the objects classified by the new rule are marked as covered in the table ‘covered’ as described in the previous section. Then, in the further steps of the algorithm, the new set of hypotheses is generated only from the objects that have not yet been covered.

The hypothesis generation and the rule generation steps are repeated until no valid rule can be found in the current hypothesis set or all objects of the current class are covered by the already generated rules. If no valid rule can be produced but there are objects not covered then the active set of hypotheses is refined as described in the next section. Before describing the refinement algorithm we may first describe the basic algorithm as follows.

For each class {

Reset covered examples table (classValue);

Repeat {

Vector hypothesis = Build Hypothesis (classValue);

Rule rule = Select Rules (hypothesis, classValue);

If no rule was selected {

int size = 1;

Refine Search (hypothesis, classValue, size);

Stop processing for the current class

}

Else {

Update covered examples information(rule);

If all examples are covered

Then stop processing for the current class

}

}

}

2.3 Refinement of Hypothesis

Refinement of a multi-relational hypothesis means extending the actual description of the hypothesis. It results in a new selection of objects that is a subset of the selection associated with the original multi-relational hypothesis.

The refine search function simply extends the given list of hypotheses, first using the attributes in the target table then using the attributes in other tables. After a list of extended hypotheses is generated the usual rule selection criterion is applied to the extended set. If a rule is generated out of the extended hypothesis set then the hypothesis is deleted from the current hypothesis set and the refinement process is repeated. When no more rules can be generated, the refine search function recursively calls itself by providing the last extended set as its input. Each time the refine search function recursively calls itself it tries to extend all hypotheses by adding one more condition.

The following is the pseudo code description of the refine search function in the system being developed.

RefineSearch(Vector currentHypothesis, String classValue, int size){

If size is larger than the maximum size then return;

Repeat{

extHypothesis = extendHypothesis(currentHypothesis, classValue);

Rule rule = selectRules(extendedHypothesis, classValue);

If no rule was selected {

Refine Search(extHypotheses, classValue, size+1);

Stop processing for the current class;

}Else{

Update covered examples information(rule);

If all examples are covered

Then stop processing for the current class

}

}

}

The following is the pseudo code description of the extend hypothesis function which is used by the refine search function. It is not the current version but the one for the general solution.

Vector extendHypothesis(String classValue, Vector currentHypotheses, String tableName){

Vector extendedHypothesis = new Vector();

For each column in the input table{

If ( column is not the target attribute

&& column is not a primary key

&& column is not a foreign key) {

For each hypothesis in currentHypothesis{

Check whether the table is the target table

Check whether the column is numeric

Select the proper SQL template

generate SQL(selected hypothesis)

Execute SQL

Append all returned hypothesis to extendedHypothesis

}

}

For each table linked by a foreign key relation{

call buildHypothesis(classValue, linkedTableName)

}

}

}

2.3.1 Extending candidates using the target table attributes

The following is the SQL statement template to extend a hypothesis by adding a new condition from the target table.

Select attr, count(attr) from targetTable, covered, [table_list] where targetAttr = currentClass and [condition_list] covered.id = targetTable.pk and covered.mark=0 group by attr;

Here the table_list variable includes the list of the tables to which the hypothesis being extended refers. The condition_list variable represents the conditions covered by the hypothesis being extended.

As before, Relational-ILA needs to know the frequency of the extended hypotheses in the classes other than the current class. The following SQL template is used for this purpose.

Select attr, count(attr) from targetTable, [table_list] where targetAttr currentClass and [condition_list] group by attr;

2.3.2. Extending candidates using the other tables attributes

The following is the SQL statement to extend a hypothesis by adding a new condition from one of the tables, which is not the target table.

Select attr, count(attr) from targetTable, covered, [table_list] where targetAttr = currentClass and table.fk = targetTable.pk and [condition_list] covered.id = targetTable.pk and covered.mark=0 group by attr;

The following SQL statement is used to find the frequency of the hypotheses, generated by the previous SQL, in the classes other than the current class.

Select attr, count(attr) from targetTable, [table_list] where targetAttr currentClass and table.fk = targetTable.pk and [condition_list] group by attr;

3. Discretization

The ILA algorithm can only handle symbolic attributes. For this reason, it needs to discretize the numeric attribute values in the pre-processing step. In the data-mining system being developed, for discretizing the numeric attributes, the Weka machine-learning library is used [6]. Weka has a number of discretization methods implemented in its DiscretizeFilter class. Given the list of instances and the index of the attribute to be discretized, different flavours of the calculateCutPoints methods in the DiscretizeFilter class produce the list of cut points according to the selected discretization method. The generated cut points list can be retrieved by the getCutPoints method in the same class.

When the user starts a rule induction process, Relational-ILA checks each table for numeric attributes. Then a list of Weka instance objects is generated for only the numeric attributes. Then ILA calls related Weka functions, first to discretize the instances, then to retrieve the cut points. The cut points are stored in a temporary table, named ‘disc’, in the connected database. The table should have the following attributes:

□ table_name: name of the table the numeric attribute is from.

□ attribute_name: name of the column the numeric attribute is associated with.

□ interval_name: the interval name, given by Weka when the cut points are calculated.

□ min_val: minimum value for the current interval.

□ max_val: maximum value for the current interval.

Ideally, the table_name, the attribute_name, and the interval_name columns should be indexed to reduce the time ILA retrieves this information.

In the previous section all the database primitives were given for symbolic attributes. When they are to be used for a numeric attribute Relational-ILA slightly modifies them. One more join condition is added using the following template:

attrname = disc.attribute_name and attr > disc.min_val and attr < disc.max_val

The ‘select’ part of the primitives is also changed such that the attribute column is replaced by the three column names; interval_name, min_val and max_val. Accordingly, the ‘group by’ part of the queries should have the same replacement.

Comparative studies indicate that the entropy-based MDL heuristic outperforms other methods of discretization in most cases [4]. For this reason we selected the entropy-based MDL heuristic as the default method of discretization for Relational-ILA.

4. Missing Attributes

For the missing attributes values problem we plan to implement two of the well-known solutions. The first solution described in [14] simply treats any missing attribute values as just another attribute value. The second solution would be to replace missing values by the most common value of the attribute for the class. For a numeric attribute the discretization procedure should first be applied to the non-null numeric attribute values. Then the null attribute values should all be related to the ‘missing’ attribute value for the first solution, but to the most common value for the second solution.

5. Experiments

We have made experiments using data available in MS Access and MySQL databases. In the experiments, the following three datasets, well known by the machine-learning researches, were used.

- objects,

- promoter,

- iris.

Both single and multi table versions of the objects dataset was used. With its small size the objects dataset was especially useful to check the correctness of the algorithm while doing the initial development. In the multi table version of the objects dataset we had a target table with only two attributes; the ‘id’ column to uniquely identify each object and the ‘class’ column, which is the target attribute. All other attributes of the objects dataset were stored in separate tables. Each table having two columns one for the attribute associated and the other is the ‘id’ column, which is a foreign key that refers to the primary key in the target table. We made experiments using the promoter dataset, both single table and multi-table versions, just to see whether the developed system is fine when the number of tables/columns/rows are a bit large. We experimented with the single table version of the iris dataset to test the discretization solution.

We decided to use the genes dataset [1] of KDD Cup 2001 for further testing of the implementation of the basic multi-relational ILA algorithm. The dataset was downloaded from KDD Cup 2001 web site and then imported into a Microsoft Access database. In the first experiments using the Sun odbc-jdbc bridge driver, the program retrieved several errors so we transferred the dataset to a MySQL database.

There are two tables in the genes dataset.

- One relational table (genes_relation) specifies which genes (or their protein products) interacts which other genes.

- The other relational table (interaction_relations) specifies a variety of properties of individual genes or proteins.

The genes_relation table has information about 862 different genes. There could be more than one row for each gene. The attribute gene_id identifies a gene uniquely. The dataset has one numeric attribute and several attributes with missing values.

Current version of the Relational-ILA requires that the target table must have a primary key. In order to have a target table with a primary key we have normalized the genes_relation table as shown in Figure 2, similar to the way as described in [5]. The Interaction table having two foreign keys on it would be a good test bed for the general solution.

[pic]

Figure 2. Schema of the KDD Cup 2001 genes data after normalization

6. How can it be used in SRS?

The relational rule discovery algorithm described can be used in SRS either to analyse all objects in a relational database or it may be used to analyse a selected list of objects after a user query (here objects correspond to what we call entries in SRS). Compared to running the algorithm on whole data, in the second way the discovery process may need a relatively short time depending on the size of the query result set. Also the discovered rules would have more potential of helping users in understanding the query results. In order to have the discovery algorithm running on partial data, we need to extend the conditions of the generated SQL queries using the conditions in the user SQL query.

As a starting point, to execute the discovery algorithm we may have a button on the SRS query results page such as “hidden relations in the results”.

We have two possibilities for integrating the rule discovery algorithm into SRS. One alternative would be to define it as an SRS application. The other alternative would be to implement the algorithm in SRS native code. The second way will certainly give better performance results in terms of time.

In SRS-Relational, when a library is defined, most of the inputs necessary for the discovery algorithm become defined automatically. The hub table normally corresponds to the target table. When the relational rule discovery process is started the user will only need to select the target attribute. Then the algorithm should produce a list of rules showing the relation between the target attribute and the other attributes of the object. The generated list of rules can be displayed using an SRS view. As shown in the example in Figure 3, we may use some metaphors or other means to display rules in natural language as far as possible.

[pic]

Figure 3. Another multi-relational rule example from PharmaDM[15]

7. Discussion

Until now we have addressed the case where we have a primary key in the target table and all other tables have foreign key relations to the target table. However we have found a general solution and described in this report. Basically, the solution will change how we search the table space; instead of visiting tables by just iterating the list of tables the general solution starting from the target table will visit all connected tables in a depth-first search manner by following the foreign-key relations.

The implementation of the general solution will be the highest priority task for December. However, one of the high-priority tasks in GUI side will be to implement a GUI component that will allow users to define the foreign key relations between the selected tables.

In the later stages, we should investigate the possibility of having the classification attribute in any of the tables. This was done in [5] but no details were given.

It seems that having a “discover hidden relations in your results” feature in SRS can make SRS a more useful tool for the researchers.

References

1. Cheng, J., Krogel, M., Sese, J., Hatsiz, C., Morishita, S., Hayashi, H., and Page, D., KDD Cup 2001 Report, ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Explorations, vol. 3, issue 2, January 2002.

2. M.R. Tolun and Saleh M. Abu-Soud, ILA: An Inductive Learning Algorithm for Rule Extraction, Expert Systems with Applications, 14(3), April 1998, 361-370.

3. Holte, R.C. 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning 11:63-91.

4. Ron Kohavi, Mehran Sahami, Error-based and entropy-based discretization of continuous features, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996.

5. H. Leiva and V. Honavar. In Saso Dzeroski, Luc De Raedt, and Stefan Wrobel, editors, Experiments with MRDTL—A Multi-relational Decision Tree Learning Algorithm, Proceedings of the Workshop on Multi-Relational Data Mining (MRDM-2002), pages 97--112. University of Alberta, Edmonton, Canada, July 2002.

6. “Data Mining: Practical machine learning tools with Java implementations,” by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco.

7. Viviane Crestana-Jensen, Nandit Soparkar: Frequent Itemset Counting Across Multiple Tables. PAKDD 2000: 49-61.

8. V. Crestana and N. Soparkar. Mining decentralized data repositories. Technical Report: CSE-TR-385-99. The University of Michigan, EECS Dept. Ann Arbor, USA. 1999.

9. Themistoklis Palpanas: Knowledge Discovery in Data Warehouses. SIGMOD Record 29(3): 88-100(2000).

10. Knobbe, A.J., Blockeel, H., Siebes, A., Van der Wallen, D.M.G. Multi-Relational Data Mining, In Proceedings of Benelearn’99, 1999.

11. Knobbe, A.J., Siebes, A., Van der Wallen, D.M.G. Multi-Relational Decision Tree Induction, In Proceedings of PKDD’99, 1999.

12. A.J. Knobbe, A. Siebes, and B. Marseille. Involving Aggregate Functions in Multi-Relational Search, In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, pages 1--1. Springer-Verlag, August 2002.

13. Stefan Wrobel, An algorithm for multi-relational discovery of subgroups, Proceedings of PKDD’97, Springer-Verlag, Berlin, New York, 1997.

14. SRS-Relational White Paper, Working with relational databases using SRS, LION Bioscience Ltd.



15. PharmaDM is an integrated drug discovery solutions company.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download