Noun-modifier
relations data
Why this document?
I have received many requests in the past few years for my
tagged set
of noun-modifier pairs. This document explains the data, the tag set
used, the formatting of my data file, and how you can get it if you
want, and the conditions attached.
The data
The file contains 600 tagged base noun phrases (modifier-noun
pairs).
These phrases were collected from Judith
Levi's "The syntax and semantics of complex nominals" (1978)
(manually) , Nancy Larrick's "The
junior science book of rain, hail, sleet and snow" (1961) (automatically), SemCor
- the
version annotated with WordNet 1.6 senses (semi-automatically) and some
examples were constructed and added for relations infrequent in the
previous texts. The examples that were not extracted from SemCor were
manually annotated with WordNet 1.6 senses (which was the rage at the
time). All the pairs were annotated with semantic relations from a list
of 47 (only 30 relations of these have instances in this data set), which
I will tell you briefly about in the next section. There is also a
second file which contains the same data, annotated with 5 more coarse
relation tags (which are really relation classes, so to speak --
causal, temporal, spatial, participant, quality).
The set of semantic relations
While there is no consensus on a comprehensive list of semantic
relations, the one we used contains 47 relatively generic relations (in
the sense that they are not domain specific), and which are necessary
and sufficient for the analysis of pairs extracted from semi-technical
texts (Ken Barker showed that in his PhD thesis). You
can see the list
of relations with examples here.
If you are curious to know how this list was developed, here is the
short story: Ken Barker developed three lists of relations for three
separate syntactic levels (clause level, intra-clause (cases) and noun
phrase) based on the literature on semantic relations at the time
(around 1997). I then combined these three lists (by aligning, grouping
and splitting) such that the same set of relations will cover phenomena
at all three syntactic levels. If you want the long story, check Ken's thesis
and my thesis.
It happened that somebody whom I gave this set to asked me why certain
relations were assigned to certain pairs, when more relations looked
like possible options. First of all, it is our premise that one and
only one relation should be assigned to a pair of units (words,
clauses, etc.). But there are ambiguities, that is true. However, when
this set was annotated, each pair was discussed by two judges, and the
one that seemed more appropriate was assigned.
I was asked, for example, why concert
hall is assigned PURPOSE relation and why LOCATION is not a
better choice. The reason is that you would call something a concert hall if it is a place
designed with the purpose of holding concerts there, while other events
may also take place, but a room, or hall, where concerts are
occasionally held is not necessarily a concert hall.
The file format
The data is in Prolog format, as facts that give information for modifier-noun pairs
(base NPs):
rel(nmr,HeadNoun,HeadInfo,Modifier,ModifierInfo,Relation).
The name of each variable is self explanatory. Relation is one of the 47
relations in our list. For each word in the pair there is a bit (two
bits) of additional information ( both HeadInfo, ModifierInfo will
contain these two bits of information):
[PartOfSpeech,WordNet1.6_sense]
How you can get the data, and the
conditions attached
You can get the data annotated with 47 relations here, or annotated
with 5 general relations here.
Citing/References
Vivi Nastase, Jelber Sayyad-Shiarabad, Marina Sokolova, Stan Szpakowicz,
Learning noun-modifier semantic
relations with corpus-based and
WordNet-based features, AAAI 2006
Vivi Nastase and Stan Szpakowicz, Exploring Noun-Modifier Semantic
Relations , IWCS 2003