Learning from Imbalanced Data Sets

AAAI 2000 Workshop:
Learning from Imbalanced Data Sets

Date: Monday, July 31, 2000

Location: Austin, Texas

Organizers: Rob Holte, University of Ottawa (holte@site.uottawa.ca), Nathalie Japkowicz, Dalhousie University (nat@cs.dal.ca), Charles Ling, University of Western Ontario (ling@csd.uwo.ca) and Stan Matwin, University of Ottawa (stan@site.uottawa.ca)

Workshop Description: As the field of machine learning makes a rapid transition from the status of "academic discipline" to that of "applied science", a myriad of new issues, not previously considered by the machine learning research community, is now coming to light. One such issue is the problem of imbalanced data sets. Indeed, the majority of learning systems previously designed and tested on toy problems or carefully crafted benchmark data sets usually assumes that the training sets are well balanced. In the case of concept-learning, for example, classifiers typically expect that their training set contains as many examples of the positive as of the negative class.

Unfortunately, this balanced assumption is often violated in real world settings. Indeed, there exist many domains for which one class is better represented than the other. This is the case, for example, in fault- monitoring tasks where non-faulty examples are plentiful since they typically involve recording from the machine during normal operation whereas faulty examples involve recording from a malfunctioning machine, which is not always possible, easy, or financially worthwhile. More generally, the problem of imbalanced data sets occurs anytime one class represents a circumscribed concept, while the other represents the counterpart of that concept. The imbalanced data set problem can thus take two distinct forms: either the counterpart class is under-sampled relative to the concept class (as in the above example) or it is over-sampled but particularly sparse (e.g., it includes the profile of a large number of patients who do not have lung cancer).

Although the imbalanced data set problem is starting to attract researchers' attention, attempts at tackling it have remained isolated. It is our belief that much progress could be achieved from a concerted effort and a greater amount of interactions between researchers interested in this issue. The purpose of our workshop is to provide a forum to foster such interactions and identify future research directions.

To this day, we have identified four categories of methods capable of tackling the imbalanced set problem in concept-learning tasks:

Methods in which the class represented by a small data set gets over-sampled so as to match the size of the other class.
Methods in which the class represented by the large data set can be down-sized so as to match the size of the other class.
Methods that ignore (or makes little use of) one of the two classes, altogether, by using a recognition-based instead of a discrimination- based inductive scheme.
Methods that internally bias the discrimination-based process so as to compensate for the class imbalance.

It is our goal to gather a crowd of researchers who have experimented with systems falling in one of these four categories or other categories that we have not yet identified.

Proposed Format: The workshop will consist of four panels corresponding to the categories identified above. A fifth panel will be created for papers falling in categories which we did not anticipate. [Please, note that this structure may be revisited once contributions have been received]. Each panel will consist of a short introduction by an invited discussant, of a series of paper presentations, and of a discussion also led by the discussant. The workshop will conclude with a general panel discussion during which four distinguished guests will comment on the presentations of the day, discuss future directions, and open the floor for general discussion.

Proposed Length: One Day during which each panel will be allocated 1 to 2 hours, depending on the number of contributions and the expected length of the discussion session.

Submissions: Authors are invited to submit papers on the topics outlined above or on other related issues. Submissions should be 6 pages, and be in line with the AAAI style sheet. Electronic submissions, in Postscript format, are prefered and should be sent to Nathalie Japkowicz at nat@cs.dal.ca. If electronic submissions are inconvenient, please send four hard copies of your submission to:

Nathalie Japkowicz
Faculty of Computer Science
DalTech/Dalhousie University
6050 University Avenue
Halifax, Nova Scotia
Canada, B3H 1W5

Timetable:

Submission deadline: March 10, 2000
Notification date: March 24, 2000
Final date for camera-ready copies to organizers: April 26, 2000

AAAI 2000 Workshop: Learning from Imbalanced Data Sets

AAAI 2000 Workshop:
Learning from Imbalanced Data Sets