ICML 2008 Workshop Proposal: The Third Workshop on 
                      Evaluation Methods for Machine Learning 

    William Klement, Chris Drummond, Nathalie Japkowicz and Sofus Macskassy


1. Description of the topic
===========================
The workshop would be the third in a series, the previous ones having taken place 
at AAAI over the past two years. Our continuing goal is to encourage debate within 
the machine learning community into how we experimentally evaluate new algorithms. 
The earlier workshops [8,9] were successful in that they began the process of 
presentation, and discussion, of new ideas for evaluation. However, they did not 
raise all the high-level questions we believe must be addressed by the community. 
For this reason, we decided to change the format of the workshop in the following 
two ways:

    * First, by holding it at ICML. Here, with access to a much larger group of ML 
      researchers we would expect to hear from many more voices that have an 
      interesting take on the issue.
    * Second, by soliciting position papers rather than research papers. This way 
      instead of getting lost into the nitty-gritty details of particular new 
      evaluation methods, we can address the important, high-level, issues 
      surrounding machine learning evaluation. 

Here we list some of the issues raised in previous workshops and other ones which 
we feel warrant further discussion. We also suggest the possible range of positions 
that might be taken on these issues:

    * Should we substantially change how evaluation is performed in machine learning?
        o Yes: There are some serious problems with how we do things, making our 
          experimental results of questionable value.
        o No: We largely have the methodology worked out. A new metric and a few 
          more data sets should solve any outstanding problems.

    * Is evaluation the fundamental role of experiments?
        o Yes: Careful experimental evaluation is critical for separating good 
          ideas from bad ones.
        o No: Experiments should be used to explore ideas, discover relationships, 
          compare alternatives, testing hypotheses is only a small part of the 
          process.

    * Should we use one well chosen, community wide, evaluation measure (e.g., 
      Accuracy, AUC, F-measure)?
        o Yes: It gives a clear and definitive answer to which algorithm is the 
          best.
        o No: There are many different, and important, properties of algorithms. 
          More than a few measures are needed to capture them all.

    * Are statistical tests critical to evaluation?
        o Yes: They are very critical, but we need a deeper understanding of their 
          meaning.
        o No: Their value is considerably overstated, they are of limited merit.

    * Are the UCI data sets sufficient for evaluation? 
        o Yes: They represent  many practical problems and new additions, focusing 
          on new domains, are added all time
        o No: They are not a representative sample of possible problems. Many more 
          data sets, plus artificial data, are needed to explore all variations 
          that may occur. 

This list certainly does not capture all the issues worthy of discussion nor the 
possible positions.  We expect, and very much encourage, position papers raising 
other issues that members of the machine learning community think important. 


2. Timeliness of the topic
==========================
The timeliness of the topic is supported by the success of both workshops that were 
previously held at AAAI'06 [9] and AAAI'07 [6] Attendees all agreed that evaluation 
was a serious issue that we didn't yet know how to address. 

Other, related, workshops have been held recently, ICML'06 "ROC Analysis in ML" 
Workshop [5], NIPS'04 "Verification, Validation, and Testing of Learning Systems" 
[7] and NIPS'06 "Testing of Deployable Learning and Decision Systems" [6].

In addition, a plethora of papers have recently appeared questioning the usefulness 
of our current research in machine learning (papers by David Hand [3], Huang [4], 
Caruana [1], Flach [2], and so on), which shows the need for a community-wide 
discussion of these issues. 


3. Proposed Format
==================
The format will be very different from the format we used in previous years. Rather 
than requesting research papers, we will invite a number of prominent machine 
learning researchers to write position papers on the questions previously mentioned 
or related ones. At the same time, we will open up the workshop to everyone in the 
field, with a call for position papers on the same topic.

We will gather all these papers and invite their authors to present their ideas in 
a brief session. The workshop organizers will present the ideas of the people who 
are not able to attend the workshop.

We will organize three round tables debating the main points in evaluation:
    * Metrics
    * Sampling and Statistical Tests
    * Data Sets
The audience will also be able (and expected) to participate in these debates.


4. Length
=========
We are envisioning a one-day workshop.


References
==========
[1] Caruana, R. and Niculescu-Mizil, A., "Data Mining in Metric Space: An 
    Empirical Analysis of Supervised Learning Performance Criteria", KDD'04.

[2] Flach, P.A., "Putting Things in Order: On the Fundamental Role of Ranking in 
    Classification and Probability Estimation" (invited talk).  In Proceedings of 
    the 18th European Conference on Machine Learning and 11th European Conference 
    on Principles and Practice of Knowledge Discovery in Databases. Joost N. Kok, 
    Jacek Koronacki, Ramon Lopez de Mantaras, Stan Matwin, Dunja Mladenic, Andrzej 
    Skowron, (eds.). ISBN 978-3-540-74975-2, pp. 2-3. September 2007.

[3] Hand, D., "Classifier Technology and the Illusion of Progress". Statistical 
    Sciences. Vol 21:1, 1-15, 2006.

[4] Huang, J. and Ling, C.X., "Constructing New and Better Evaluation Measures for 
    Machine Learning". In proceedings of the 20th International Joint Conference on 
    Artificial Intelligence (IJCAI 2007): 859-864.

[5] The 3rd Workshop on ROC Analysis in ML. 
    http://www.dsic.upv.es/~flip/ROCML2006/

[6] Testing of Deployable Learning and Decision Systems. 
    http://www.dmargineantu.net/NIPS06-TDLDS/

[7] Workshop on Verification, Validation, and Testing of Learning Systems. 
    http://www.dmargineantu.net/nips2004/

[8] The AAAI-07 Workshop on Evaluation Methods for Machine Learning II 
    http://www.site.uottawa.ca/~welazmeh/conferences/AAAI-07/workshop/

[9] The AAAI-06 Workshop on Evaluation Methods for Machine Learning
    http://www.site.uottawa.ca/~welazmeh/conferences/AAAI-06/workshop/