CSI5388: Topics in Machine Learning

Performance Evaluation for Classification


Nathalie Japkowicz

Office: STE 5029
E-mail: nat@site.uottawa.ca
Telephone: 562-5800 ext. 6693 (Note: e-mail is more reliable)

Meeting Times and Locations

    Mondays: 2:30pm-4pm STE B0138
        Wednesdays: 2:30pm-4pm LMX 223

(Please, note the different rooms!)

Office Hours and Locations

  • Times: Mondays and Wednesdays, 1pm-2pm
  • Location: STE 5029


CSI5387, although the course can be taken with permission from the instructor (Permission will be granted if the book entitled "Machine Learning" by Tom Mitchell (see below) has been read and well understood prior to taking the course).


Machine Learning is the area of Artificial Intelligence concerned with the problem of building computer programs that automatically improve with experience. An important problem in Machine Learning is how to evaluate our algorithms. While the routine approach consists of averaging the error rate over 10 cross-validation folds and running a t-test, this method may be problematic, at least in certain cases. In this course, we will look in-depth into the issue of machine learning evaluation in an attempt to discover more suitable evaluation approaches.

The course is a seminar course that will consist of a mixture of regular lectures and student presentations. The regular lectures will cover broad introductions to some of the major areas of research currently under investigation in the subfield of Machine Learning evaluation. The student presentations will consist of research paper presentations: based on recent research papers that describe new results in the areas discussed in class. The presentations will involve one or two papers that will need to be contrasted and put in the context of the class discussion.


Students will be evaluated as follows:

  • They will have to write short critical commentaries of the assigned research papers on six different weeks (Weeks 3, 4, 7, 8, 10, 11)  [12%],
  • They will have to give three Research Paper Presentation, (Weeks 5, 9 and 12) [18%]
  • They will have to complete 3 assignments [30%]
  • They will have to propose and carry out the research for their final project. Suggestions for potential projects are given below, but the student is welcome to pick his/her own topic. Project proposals will be due in mid-semester. [40%]

Topics Covered

  • Week 1: Review of Machine Learning's main concepts


               Review of Tom Mitchell and/or Witten & Frank’s textbook

  • Week 2: Current approaches for the evaluation of Machine Learning and their shortcomings


Drummond, 2006: “Machine Learning as an Experimental Science (Revisited).”, 2006 AAAI-Workshop on Evaluation Methods for machine Learning I

Japkowicz, 2006: “Why Question Machine Learning Evaluation Methods?” AAAI-Workshop on Evaluation Methods for machine Learning I

Japkowicz and Drummond, 2008 (Draft): Warning: Statistical Benchmarking is Addictive. Kicking the Habit in Machine Learning

David Hand, 2006: “Classifier Technology and the Illusion of Progress, Statistical Science 2006, vol. 21, No. 1, pp. 1-15.

  • Week 3:  Evaluation Metrics I: ROC Analysis / Cost Curves.


Provost, F., Fawcett, T., and Kohavi, R. (1998): The case against accuracy estimation for comparing induction algorithms. In Proceedings of the Fifteenth International Conference on Machine Learning, pp. 43-48.

Davis & Goadrich, 2006: “The Relationship between Precision-Recall and ROC Curves”, ICML-2006

                  Corinna Cortes and Mehryar Mohri,''AUC

 Optimization vs. Error Rate Minimization'', NIPS 2004



Rich Caruana and Alexandru Niculescu-Mizil: “Predicting Good Probabilities with Supervised Learning”, ICML-05

Luke Hope and Kevin Korb: “A Bayesian Metric for Evaluating Machine Learning Algorithms”, The Australasian 2004 Conference

J. Huang and C.X. Ling. Constructing New and Better Evaluation Measures for Machine Learning. The Twentieth International Joint Conference on Artificial Intelligence (IJCAI 2007). (pdf file)


               Clay Helberg, : “Pitfalls of Data Analysis (or How to Avoid Lies and Damned Lies)”


Resampling Methods: Concepts, Applications, and Justification by Chong Ho Yu in Practical Assessment, Research and Evaluation, 8(19).

       Dietterich, 1998 : Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10, pp. 1895-1923.

       T. G. Dietterich and  E. B. Kong, 1995: Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms  

       Evegniou, T., Pontil, M. and Elisseef, A., 2004: “Leave-one-out Error, stability, and generalization of voting combination of classifiers, Machine Learning, 65 (1): 95-130

       Kohavi, R., 1995: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Proceedings of the 1995 International Joint Conference on Artificial Intelligence.

              Y. Bengio and Y. Grandvalet: 2004, “No unbiased estimator of the variance of k-fold cross-validation, Journal of Machine Learning Research: 5, pp.1089-1105.



     Ioannidis, 2005: “Why Most Published Research Findings are False”

      R. R. Bouckaert: 2003, “Choosing between two learning algorithms based on calibrated tests, Proceedings of the Twentieth International Conference on

           Machine Learning


            J. Demšar, 2006: “Statistical Comparisons of Classifiers over Multiple Data Sets”, JMLR: 7, pp.1-30


               Gascuel, O. and Caraux, 1992: “Statistical Significance in Inductive Learning”, Proceedings of the 1992 European Conference on Artificial Intelligence,



               Mukherjee et al., 2003: Permutation Tests for Classification


      Salzberg, S.L., 1999: “On Comparing Classifiers: A Critique of Current Research and Methods, Data Mining and Knowledge Discovery: 1, pp.1-12.


               Bay, Kibler, Pazzani, Smyth, 2000: The UCI KDD Archive of Large Data Sets for Data Mining Research and Experimentation, SIGKDD Explorations, ………2(2): 14.





              Marcus Hutter, 2007:   The Loss Rank Principle for Model Selection


         Arlot et al., 2007: Re-sampling-based confidence regions and multiple-tests for a correlated random vector       

         Hesterberg, 2004: Unbiasing the Bootstrap – Bootknife Sampling versus Smoothing

         Mason and Newton, 1992: A Rank Statistics approach to the consistency of a general bootstrap



A compilation of selected research papers from the recent literature.


Machine Learning


Course Support:

Machine Learning Ressources on the Web:


Class Notes and Assignments

Class notes and assignments are available here