Projects for undergraduate students -- CSI 4900 Fall 2003 and Winter 2004


1. Project code: inkpen1
    Title: Acquisition of collocations from the Web
    Status: taken 
    Description:  In natural language, some words are often used 
together while some words would sound awkward if used together. This is 
called collocational behaviour of words. This behaviour can be learned 
from a big collection of texts. In particular, the Web is the biggest 
source of text we will use. This tool will implement a method I developed 
for differentiating between collocations of words in a group. For example 
if the group is composed of the synonyms:  "blunder", "mistake", "slip", 
the tool will learn that "serious mistake" and "serious blunder" are good 
collocations, while "serious slip" is not. Other groups could contain 
words that sound similar, such as "night" and "knight"; the tool will 
discover that they collocate with different words.  The method is based on 
counting the frequency of words and pairs of words on the Web. It also 
requires extracting fragments of texts from the Web. For example, 
"blunder" can be a noun or a verb. We need to use a part-of-speech tagger 
before counting the occurrences of the noun "blunder". The tool is useful 
in the automatic choice of words in several natural language processing 
applications: natural language generation, machine translation, and speech
understanding. The tool will be implemented in Perl, C++ or Java (to be
decided which one).
      

2. Project code: inkpen2
    Title: Language models for the texts of the Web 
    Status: available
    Description:   A language model reflects the distribution of the words in
a large collections of texts. It computes probabilities of occurrence of
individual words (unigrams) and pairs of consecutive words (bigrams). 
There are tools that compute language models for a given collection of
texts. This project will modify such a tool to work with word co-occurrence
counts collected from the Web. In this way, the probabilities of rare words
will be computed more accurately. The implementation will be done in C++,
Java, or Perl (to be determined). 


3. Project code: inkpen3
    Title: Acquisition of near-synonyms from a large collection of texts
    Status: available
    Description:  Near-synonyms are words that have the same core meaning but
differ  in nuances of meaning. An example of near-synonyms is: "error",
"mistake", "blunder", "slip", "blooper". Existing lists of near-synonyms do
not cover all the words of the English language. The goal of this project
is to automatically acquire near-synonyms from texts. This can be done by
extracting  verbs that have similar syntactic behaviour (the same arguments
in parse trees). Then, nouns and adjectives that occur with similar verbs
will be extracted. This methods will collect near-synonyms and other
related words. Antonyms will be filtered out. The implementation will be
done in Perl.