Active Learning as an Enhancer of De-duplication


Divine Muhivuwomunda – University of Ottawa


Abstract


We have observed, in a recent past, an imperative need for solutions to problems related to the existence of duplicate records in databases which usually lead to inaccurate results when manipulating and analyzing data. A promising algorithm uses active learning which continuously learns to label a pair of records as duplicate or non-duplicate. To accomplish this, a classifier repeatedly train some labeled data to which new instances are continuously added. The choice of what instance to append to the training data is done through ensemble learning. The motivating idea behind this is that the disagreement among different learned hypotheses will lead to an enhanced selection of an instance that will bring most information gain to the training session.