ICML 2008 Workshop Proposal: The Third Workshop on Evaluation Methods for Machine Learning William Klement, Chris Drummond, Nathalie Japkowicz and Sofus Macskassy 1. Description of the topic =========================== The workshop would be the third in a series, the previous ones having taken place at AAAI over the past two years. Our continuing goal is to encourage debate within the machine learning community into how we experimentally evaluate new algorithms. The earlier workshops [8,9] were successful in that they began the process of presentation, and discussion, of new ideas for evaluation. However, they did not raise all the high-level questions we believe must be addressed by the community. For this reason, we decided to change the format of the workshop in the following two ways: * First, by holding it at ICML. Here, with access to a much larger group of ML researchers we would expect to hear from many more voices that have an interesting take on the issue. * Second, by soliciting position papers rather than research papers. This way instead of getting lost into the nitty-gritty details of particular new evaluation methods, we can address the important, high-level, issues surrounding machine learning evaluation. Here we list some of the issues raised in previous workshops and other ones which we feel warrant further discussion. We also suggest the possible range of positions that might be taken on these issues: * Should we substantially change how evaluation is performed in machine learning? o Yes: There are some serious problems with how we do things, making our experimental results of questionable value. o No: We largely have the methodology worked out. A new metric and a few more data sets should solve any outstanding problems. * Is evaluation the fundamental role of experiments? o Yes: Careful experimental evaluation is critical for separating good ideas from bad ones. o No: Experiments should be used to explore ideas, discover relationships, compare alternatives, testing hypotheses is only a small part of the process. * Should we use one well chosen, community wide, evaluation measure (e.g., Accuracy, AUC, F-measure)? o Yes: It gives a clear and definitive answer to which algorithm is the best. o No: There are many different, and important, properties of algorithms. More than a few measures are needed to capture them all. * Are statistical tests critical to evaluation? o Yes: They are very critical, but we need a deeper understanding of their meaning. o No: Their value is considerably overstated, they are of limited merit. * Are the UCI data sets sufficient for evaluation? o Yes: They represent many practical problems and new additions, focusing on new domains, are added all time o No: They are not a representative sample of possible problems. Many more data sets, plus artificial data, are needed to explore all variations that may occur. This list certainly does not capture all the issues worthy of discussion nor the possible positions. We expect, and very much encourage, position papers raising other issues that members of the machine learning community think important. 2. Timeliness of the topic ========================== The timeliness of the topic is supported by the success of both workshops that were previously held at AAAI'06 [9] and AAAI'07 [6] Attendees all agreed that evaluation was a serious issue that we didn't yet know how to address. Other, related, workshops have been held recently, ICML'06 "ROC Analysis in ML" Workshop [5], NIPS'04 "Verification, Validation, and Testing of Learning Systems" [7] and NIPS'06 "Testing of Deployable Learning and Decision Systems" [6]. In addition, a plethora of papers have recently appeared questioning the usefulness of our current research in machine learning (papers by David Hand [3], Huang [4], Caruana [1], Flach [2], and so on), which shows the need for a community-wide discussion of these issues. 3. Proposed Format ================== The format will be very different from the format we used in previous years. Rather than requesting research papers, we will invite a number of prominent machine learning researchers to write position papers on the questions previously mentioned or related ones. At the same time, we will open up the workshop to everyone in the field, with a call for position papers on the same topic. We will gather all these papers and invite their authors to present their ideas in a brief session. The workshop organizers will present the ideas of the people who are not able to attend the workshop. We will organize three round tables debating the main points in evaluation: * Metrics * Sampling and Statistical Tests * Data Sets The audience will also be able (and expected) to participate in these debates. 4. Length ========= We are envisioning a one-day workshop. References ========== [1] Caruana, R. and Niculescu-Mizil, A., "Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria", KDD'04. [2] Flach, P.A., "Putting Things in Order: On the Fundamental Role of Ranking in Classification and Probability Estimation" (invited talk). In Proceedings of the 18th European Conference on Machine Learning and 11th European Conference on Principles and Practice of Knowledge Discovery in Databases. Joost N. Kok, Jacek Koronacki, Ramon Lopez de Mantaras, Stan Matwin, Dunja Mladenic, Andrzej Skowron, (eds.). ISBN 978-3-540-74975-2, pp. 2-3. September 2007. [3] Hand, D., "Classifier Technology and the Illusion of Progress". Statistical Sciences. Vol 21:1, 1-15, 2006. [4] Huang, J. and Ling, C.X., "Constructing New and Better Evaluation Measures for Machine Learning". In proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007): 859-864. [5] The 3rd Workshop on ROC Analysis in ML. http://www.dsic.upv.es/~flip/ROCML2006/ [6] Testing of Deployable Learning and Decision Systems. http://www.dmargineantu.net/NIPS06-TDLDS/ [7] Workshop on Verification, Validation, and Testing of Learning Systems. http://www.dmargineantu.net/nips2004/ [8] The AAAI-07 Workshop on Evaluation Methods for Machine Learning II http://www.site.uottawa.ca/~welazmeh/conferences/AAAI-07/workshop/ [9] The AAAI-06 Workshop on Evaluation Methods for Machine Learning http://www.site.uottawa.ca/~welazmeh/conferences/AAAI-06/workshop/