Abstract

       Building a Lexical Knowledge-Base of Near-Synonym Differences

                            Diana Inkpen
                        Doctor of Philosophy
                    Department of Computer Science
                       University of Toronto
                                2004

Current natural language generation or machine translation systems cannot
distinguish among near-synonyms - words that share the same core meaning but
vary in their lexical nuances. This is due to a lack of knowledge about
differences between near-synonyms in existing computational lexical resources.

The goal of this thesis is to automatically acquire a lexical knowledge-base of
near-synonym differences (LKB of NS) from multiple sources, and to show how it
can be used in a practical natural language processing system.

I designed a method to automatically acquire knowledge from dictionaries of
near-synonym discrimination written for human readers. An unsupervised
decision-list algorithm learns patterns and words for classes of distinctions.
The patterns are learned automatically, followed by a manual validation step.
The extraction of distinctions between near-synonyms is entirely automatic. The
main types of distinctions are: stylistic (for example, "inebriated" is more
formal than "drunk"), attitudinal (for example, "skinny" is more pejorative than
"slim"), and denotational (for example, "blunder" implies "accident" and
"ignorance", while "error" does not).

I enriched the initial LKB of NS with information extracted from other sources.
First, information about the senses of the near-synonym was added (WordNet
senses). The other near-synonyms in the same dictionary entry and the text of
the entry provide a strong context for disambiguation.  Second, knowledge about
the collocational behaviour of the near-synonyms was acquired from free text.
Collocations between a word and the near-synonyms in a dictionary entry were
classified into: preferred collocations, less-preferred collocations, and
anti-collocations. Third, knowledge about distinctions between near-synonyms was
acquired from machine-readable dictionaries (the General Inquirer and the
Macquarie Dictionary). These distinctions were merged with the initial LKB of
NS, and inconsistencies were resolved.

The generic LKB of NS needs to be customized in order to be used in a natural
language processing system. The parts that need customization are the core
denotations and the strings that describe peripheral concepts in the
denotational distinctions. To show how the LKB of NS can be used in practice, I
present Xenon, a natural language generation system system that chooses the
near-synonym that best matches a set of input preferences. I implemented Xenon
by adding a near-synonym choice module and a near-synonym collocation module to
an existing general-purpose surface realizer.