FEATURE ENGINEERING FOR TEXT CLASSIFICATION

Sam Scott
Cognitive Science (IIS)
Carleton University
Ottawa, ON, K1S 5B6 (Canada)
sscott2@chat.carleton.ca

Stan Matwin
Computer Science Department
University of Ottawa
Ottawa, ON, K1N 6N5 (Canada)
stan@site.uottawa.ca

Abstract

Most research in text classification has used the "bag of words" representation of text. This paper examines some alternative ways to represent text based on syntactic and semantic relationships between words (phrases, synonyms and hypernyms). We describe the new representations and try to justify our suspicions that they could have improved the performance of a rule-based learner. The representations are evaluated using the RIPPER rule-based learner on the Reuters-21578 and DigiTrad test corpora, but on their own the new representations are not found to produce a significant performance improvement. Finally, we try combining classifiers based on different representations using a majority voting technique. This step does produce some performance improvement on both test collections. In general, our work supports the emerging consensus in the information retrieval community that more sophisticated Natural Language Processing techniques need to be developed before better text representations can be produced. We conclude that for now, research into new learning algorithms and methods for combining existing learners holds the most promise. Keywords: text classification, feature engineering, text representations, phrases, synonyms, hypernyms, WordNet, RIPPER, semantic relationships (From Extractor: representations, phrases, classification, feature, learning, learning algorithms, performance, machine learning, RIPPER, semantic relationships, test collections).

Full Paper