Producing an Evaluation Corpus
Manual Tagging is difficult and time-consumming. It cannot be done automatically like part-of-speech-tagging.
SEMCOR is a 200,000 word corpus manually tagged by lexicographers as part of the WordNet Project.
SENSUS, a large scale ontology was used to map the WordNet Tags into LDOCE Tags.
Because some of the corpus contains proper names, function words, and some of the mapping from Wordnet to LDOCE was not effective, 36,869 tagged words resulted from the SEMCOR corpus.
This corpus, however, was still larger than those usually used in Word Sense Disambiguation research and it is more systematic & reliable than if it had been tagged by the authors.