Course: Seminar (CSI 5902)

Course: Seminar (CSI 5902)

Presentation Topic: Link mining and link based clustering

Course Coordinator: Dr. S. Some


Supervisor: Dr. Herna L. Viktor

Student: Nadia Azam


Abstract:


Link mining is one of the newly emerging areas in data mining. The main concept behind link mining is to find interesting and hidden information from highly structured and/or relational datasets. In traditional data mining, the data we have are usually contained in a single relation where, most of the time, the instances are assumed to be independent. Various statistical models are subsequently used for mining these types of data. However, if the dataset is multi-relational and instances are related or linked to each other, then existing statistical models may not give accurate and correct result.

One of the key applications of link mining is to cluster linked data. The goal of clustering is to group data in such a way that objects in a group are similar to each other, whereas objects in two different groups are very different from each other. Cluster analysis algorithms group similar data together based on the distance between the data points and vary in the distance measures they use to perform this grouping. Here, the similarity of cluster members are defined using measures such as the Eucilidian distance, the Mirkin metric, Jaccard index or Rand index. In the case of linked data, objects can be seen as individual objects, a group of linked object or a subgraph of original. However, traditional clustering techniques cannot be applied directly to these relational or linked datasets, since the relationships (or joins) in between the relations need to be considered. Such link-based cluster analyses techniques need to identify frequent patterns within the data through measuring the level of connection, or similarity, between table content and links. It follows that the concepts of “level of connection”, “similarity” and “cluster distance” needs special clarification, when considering the relational database setting. New models that can deal with richly structured datasets, with various links or relationships, are thus necessary. One such approach which has gained interest in this field is various graph partitioning algorithms. This is where link mining, and in particular link-based clustering, comes to play.