Authors: Nathalie Japkowicz and Robert Holte
The AAAI-2000 workshop on learning from imbalanced data sets provided a venue for researchers to discuss fundamental questions pertaining to machine learning and to challenge some of the field's institutional practices.
Several observations were made and certain issues were explored in particular depth. First, it was observed that a large number of applications suffer from the class imbalance problem. A distinction, nonetheless, was drawn between the small sample versus the imbalance problem and it was remarked that although smart sampling can, sometimes, help, it is not always possible. Among the issues that received a lot of attention was the problem of evaluating learning algorithms in the case of class imbalances. It was emphasized that the use of common evaluation measures can yield misleading conclusions. More accurate measures include ROC Curves and Cost Curves. An evaluation measure was also proposed for the case where only data from one class is available. The other issues concerned the design of learning algorithms. It was shown that concept-learning methods can use a one-sided approach focusing on either the majority or the minority class. If both classes are used, however, avoiding fragmentation in the minority class is useful. Another important issue concerned the close connection between the class imbalance problem and cost-sensitive learning. Finally, the goal of creating a classifier that performs well across a range of costs/priors was declared to be an important one.