Projects for undergraduate students -- CSI 4900 Fall 2003 and Winter 2004 1. Project code: inkpen1 Title: Acquisition of collocations from the Web Status: taken Description: In natural language, some words are often used together while some words would sound awkward if used together. This is called collocational behaviour of words. This behaviour can be learned from a big collection of texts. In particular, the Web is the biggest source of text we will use. This tool will implement a method I developed for differentiating between collocations of words in a group. For example if the group is composed of the synonyms: "blunder", "mistake", "slip", the tool will learn that "serious mistake" and "serious blunder" are good collocations, while "serious slip" is not. Other groups could contain words that sound similar, such as "night" and "knight"; the tool will discover that they collocate with different words. The method is based on counting the frequency of words and pairs of words on the Web. It also requires extracting fragments of texts from the Web. For example, "blunder" can be a noun or a verb. We need to use a part-of-speech tagger before counting the occurrences of the noun "blunder". The tool is useful in the automatic choice of words in several natural language processing applications: natural language generation, machine translation, and speech understanding. The tool will be implemented in Perl, C++ or Java (to be decided which one). 2. Project code: inkpen2 Title: Language models for the texts of the Web Status: available Description: A language model reflects the distribution of the words in a large collections of texts. It computes probabilities of occurrence of individual words (unigrams) and pairs of consecutive words (bigrams). There are tools that compute language models for a given collection of texts. This project will modify such a tool to work with word co-occurrence counts collected from the Web. In this way, the probabilities of rare words will be computed more accurately. The implementation will be done in C++, Java, or Perl (to be determined). 3. Project code: inkpen3 Title: Acquisition of near-synonyms from a large collection of texts Status: available Description: Near-synonyms are words that have the same core meaning but differ in nuances of meaning. An example of near-synonyms is: "error", "mistake", "blunder", "slip", "blooper". Existing lists of near-synonyms do not cover all the words of the English language. The goal of this project is to automatically acquire near-synonyms from texts. This can be done by extracting verbs that have similar syntactic behaviour (the same arguments in parse trees). Then, nouns and adjectives that occur with similar verbs will be extracted. This methods will collect near-synonyms and other related words. Antonyms will be filtered out. The implementation will be done in Perl.