Date: Monday, July 31, 2000
Location: Austin, Texas
Organizers: Rob Holte, University of Ottawa (firstname.lastname@example.org), Nathalie Japkowicz, Dalhousie University (email@example.com), Charles Ling, University of Western Ontario (firstname.lastname@example.org) and Stan Matwin, University of Ottawa (email@example.com)
As the field of machine learning makes a rapid transition from the status
of "academic discipline" to that of "applied science", a myriad of new
issues, not previously considered by the machine learning research
community, is now coming to light. One such issue is the problem of
imbalanced data sets. Indeed, the majority of learning systems previously
designed and tested on toy problems or carefully crafted benchmark data
sets usually assumes that the training sets are well balanced. In the
case of concept-learning, for example, classifiers typically expect that
their training set contains as many examples of the positive as of the
Unfortunately, this balanced assumption is often violated in real world settings. Indeed, there exist many domains for which one class is better represented than the other. This is the case, for example, in fault- monitoring tasks where non-faulty examples are plentiful since they typically involve recording from the machine during normal operation whereas faulty examples involve recording from a malfunctioning machine, which is not always possible, easy, or financially worthwhile. More generally, the problem of imbalanced data sets occurs anytime one class represents a circumscribed concept, while the other represents the counterpart of that concept. The imbalanced data set problem can thus take two distinct forms: either the counterpart class is under-sampled relative to the concept class (as in the above example) or it is over-sampled but particularly sparse (e.g., it includes the profile of a large number of patients who do not have lung cancer).
Although the imbalanced data set problem is starting to attract researchers' attention, attempts at tackling it have remained isolated. It is our belief that much progress could be achieved from a concerted effort and a greater amount of interactions between researchers interested in this issue. The purpose of our workshop is to provide a forum to foster such interactions and identify future research directions.
To this day, we have identified four categories of methods capable of tackling the imbalanced set problem in concept-learning tasks:
Proposed Format: The workshop will consist of four panels corresponding to the categories identified above. A fifth panel will be created for papers falling in categories which we did not anticipate. [Please, note that this structure may be revisited once contributions have been received]. Each panel will consist of a short introduction by an invited discussant, of a series of paper presentations, and of a discussion also led by the discussant. The workshop will conclude with a general panel discussion during which four distinguished guests will comment on the presentations of the day, discuss future directions, and open the floor for general discussion.
Proposed Length: One Day during which each panel will be allocated 1 to 2 hours, depending on the number of contributions and the expected length of the discussion session.
Authors are invited to submit papers on the topics outlined above or
on other related issues. Submissions should be 6 pages, and be in line with
the AAAI style sheet. Electronic submissions, in Postscript format, are
prefered and should be sent to Nathalie Japkowicz at firstname.lastname@example.org. If
electronic submissions are inconvenient, please send four hard copies of
your submission to: