Choosing a marginal class distribution for classifier induction

Choosing a marginal class distribution for classifier induction

Foster Provost
New York University

(work done with Gary Weiss of AT&T)



Practitioners often face the question of choosing the marginal class distribution with which to learn. This is especially the case when the class distribution is unbalanced, in which case practitioners often learn models using a larger percentage of the minority class (a practical rule of thumb is to train with a balanced distribution). I will talk about various aspects and intricacies of this problem, and present the results of an empirical study examining the relationship between class distribution and generalization performance. I also will present a new "budget-sensitive" progressive sampling algorithm, that selects a class distribution while staying within a predetermined budget for procuring/preprocessing data.