Dear Colleague,

You have inquired about the ACL/DCI CD-ROM. This disk, as well
as much other material, is available from the Linguistic Data
Consortium (LDC). A current LDC catalog is appended. More information
is available by anonymous ftp from ftp.cis.upenn.edu in directory
/pub/ldc.

To place orders, or to ask other questions, please send email to
ldc@unagi.cis.upenn.edu, or call 215-898-0464. Note that it is possible
that your institution is already an LDC member; be sure to check this.

      Best wishes,

          Mark Liberman                 myl@unagi.cis.upenn.edu

          619 Williams Hall             
          University of Pennsylvania        Phone: 215-898-0141
          Philadelphia, PA 19104-6305       Fax:   215-573-2175
__________________________________________________________________________

\documentstyle[11pt]{article}
\setlength{\parskip}{0in}
\setlength{\oddsidemargin}{0in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{9.0in}

\begin{document}
\thispagestyle{empty}
\begin{center}
{\Large \bf Corpora Available from\\ The Linguistic Data
Consortium\\August 1994}
\vspace{0.2in}
\end{center}

\pagebreak

\tableofcontents

\pagebreak

\section{Introduction}
%\begin{bf} INTRODUCTION \end{bf}

Each section below describes a corpus or a set of corpora.  They are
listed by type (speech, text, lexicons, other) and within type by the
membership year (MY) in which they were released by LDC.  In the case
of series or sets of corpora, each individual corpus or segment is
described in a separate subsection.  The descriptions are not intended
to be complete; in most cases a more complete document can be found
by logging in to the ftp server at LDC: ftp.cis.upenn.edu, and
browsing the pub/ldc directory for a README file on the corpus.  

In the catalog, each description is followed by six items of
information:
\begin{itemize}
\item The name by which the corpus is generally known
\item The LDC catalog order number(s), explained below
\item The NIST Catalog numbers assigned to the discs, if they have 
ever been available through NIST or NTIS, otherwise ``NA''
\item The date and membership year of official release by LDC
\item The current price for nonmembers, if available
\item  Whether a separate license (User Agreement) is required
\end{itemize}

The {\bf LDC catalog order number} is a unique identifier for
convenience in referring to corpora, parts of corpora, and individual
discs as needed.  It is made up of the following:
\begin{itemize}
\item The letters ``LDC'' and two digits for the membership year (MY)
of release
\item A letter indicating whether the contents are mainly speech
``S'', text ``T'', lexicon ``L'', or other ``X''
\item A number designating the corpus
\item A letter, used only where needed, designating a subcorpus
or corpus segment, in case several segments or configurations of the
corpus can be ordered separately
\item After a hyphen, the number of the disc in the corpus or segment,
used only when needed to refer to individual discs
\item  After a period,  the revision number of the individual disc, 
used only where a revised version has been released
\end{itemize}

Here are two examples to illustrate both why this system was adopted
and how it works.

{\bf Example 1: The United Nations Parallel Text Corpus} was published
in March of 1994, thus in membership year (MY) 94, consists entirely
of text (T), and was assigned text corpus number 4.  It comprises
three discs: the first contains English texts, the second the
corresponding French texts, and the third the corresponding Spanish.
They are available either (A) as a set of three or (B) separately.
Thus LDC94T4A refers to the UN corpus as a whole, LDC94TB-1 to the
English disc alone, LDC94T4B-2 to the French alone, and LDC94T4B-3 to
the Spanish alone.

Shortly after release, the Spanish disc was found to have a
manufacturing defect and was replaced with a new one, so if there is
need to refer to the them individually, the original is now called 3.0
and the replacement 3.1.

{\bf Example 2} The second Continuous Speech Recognition corpus,
collected in 1993 and distributed in early 1994, was assigned corpus
number 8.  It contains 14 discs of speech recorded over a Sennheiser
microphone (a {\em de facto} standard in ARPA evaluations); 15 discs
with the same speech recorded over another microphone; and 5 discs
containing unique (unpaired) data: speech recorded only once,
transcriptions, test or evaluation data, etc., much of which is also
needed to make full use of the paired speech recordings.  To satisfy
customer preferences, the corpus is offered by LDC in three
configurations: (A) the complete corpus of 34 discs; (B) the
``Sennheiser corpus,'' i.e., the whole corpus minus the ``other
microphone'' data, on 19 discs; and (C) the ``other microphone''
corpus, i.e., the whole corpus minus the Sennheiser data, 20 discs.
These are designated as follows:
\vspace{.2in}
\noindent CSR-II Complete: LDC94S13A, consisting of LDC94S13A-1 through
LDC94S13A-34 \\

CSR-II Sennheiser: 19 discs, LDC94S13B, consisting of LDC94S13B-1
through S13B-7, S13B-11, S13B-13 through S13B-16, S13B-18 through S13B-21,
and S13B-32 through S13B-34 \\

CSR-II Other: 20 discs, LDC94S13C, consisting of LDC94S13C-8 through
S13C-10, S13C-12 through S13C-14, S13C-17, S13C-22 through S13C-34 \\

\pagebreak

\section{Prices and Conditions of Purchase}

%{\Large \bf PRICES AND CONDITIONS OF PURCHASE}

The following are the procedures and conditions for obtaining corpora
from the LDC:
\vspace{.25in}

{\bf For LDC Members:}

\vspace{.25in}
LDC membership is annual, with the membership year (MY) running from 1
September to 31 August.  Each LDC corpus is identified by the MY of
its release and membership fees purchase a paid-up license to that
MY's LDC corpora.

For each MY, Senior Members receive extra copies of each requested
corpus per approved site at no extra cost.

Other Members receive one copy of each requested LDC corpus at no
charge; there may be charges for corpora owned or produced by others
and distributed by LDC.

Members may also purchase extra "convenience copies" of LDC corpora,
at \$100 per disk or the catalog price, for use at approved sites.
These convenience copies are subject to the same restrictions and
covered by the same license, if any, as the primary copies.

Notices will be mailed to all members when new data sets are
available.  When corpora are re-issued in revised, enhanced, or
supplemented form, unless the reason is defective materials, they will
be distributed only to those whose LDC membership is current in the MY
of re-issue.

\vspace{.25in}
{\bf For Nonmembers:}

\vspace{.25in}
Nonmembers may purchase single copies of most listed items, at prices
which are set by the LDC from time to time, and normally only under
a ``research-only'' license.  No commercial licenses are granted to
nonmembers.  Payment may be made by check drawn from a bank with
branches in the United States or payment may be wired to: Mellon Bank
East, ABA NO.  03100003, Philadelphia, PA, for credit to The Trustees
of the University of Pennsylvania, Account No 2945020, Attn: Judith
Storniolo, 215-898-0464.

Prices are subject to change; the prices below are effective until
December 31, 1994.  Nonmembers add a shipping charge for each order:
\$30 US and Canada, \$50 overseas.

\pagebreak

\begin{center}
{\large\bf 1993 RELEASES}
\end{center}
\begin{tabular}{|r|r|l|l|}
\cline{1-4}
Price & Set-of & Title            &               LDC Catalog No. \\
\cline{1-4}
%-----   ------  -----------				------------

  \$100  & 1  &   TIMIT               &              LDC93S1  \\
  250 &	 2 &      NTIMIT		&	LDC93S2 \\
 1000 &  6 &   Resource Management Complete Set & LDC93S3A \\
  600 &	 4  &   Resource Management RM1  & LDC93S3B  \\
  600 &	 2 &   	Resource Management RM2 &  LDC93S3C \\
1000 &   6 &      ATIS0 Complete Corpora Set &  LDC93S4A\\
500 &	 1 &     	ATIS0 Pilot  &          LDC93S4B-1\\
  500 &	 1 &     	ATIS0 Read   &          LDC93S4B-2\\
  500 &	 4 &     	ATIS0 SD Read &		LDC93S4B-3\\
 2500 &	 4 &	ATIS2                &          LDC93S5\\
4000 &  15 &   CSR-I (WSJ0) Complete  &             LDC93S6A\\
 2000 &	 9 &	CSR-I (WSJ0) Sennheiser  &         LDC93S6B\\
 2000 &	 9 &	CSR-I (WSJ0) Other     &           LDC93S6C\\
10000 &	28 &    	SWITCHBOARD  &             LDC93S7\\ 
1000 &	 1 &     	SWITCHBOARD Credit Card &  LDC93S8\\
  125 &	 1 &     	TI 46-Word  &              LDC93S9\\
   750 & 3 &     	TIDIGITS               &   LDC93S10\\
  500 &	 1 &     	Road Rally  &              LDC93S11\\
  200 &    8 &	HCRC Map Task Corpus   &           LDC93S12\\
  25 & 1 &	ACL/DCI			&               LDC93T1\\ 
 2500 &	 1 &      Penn Treebank Corpus 	&		LDC93T2\\
 1000 &	 1 &      TIPSTER Volume 1		&	LDC93T3-1.1 \\
 1000 &	 1 &      TIPSTER Volume 2	&		LDC93T3-2.1 \\
 1000 &    1 &	TIPSTER Volume 3 		&	LDC93T3-3.1 \\
\cline{1-4}
\end{tabular}

\pagebreak

\begin{center}
{\large\bf 1994 RELEASES} \\
\end{center}
\begin{tabular}{|r|r|l|l|}
\cline{1-4}
Price & Set-of & Title            &               LDC Catalog No. \\
\cline{1-4} 
%-----   ------  -----------				------------
10000 & 34 &    CSR-II (WSJ1) Complete		& 	LDC94S13A\\
 5000 &	19 &	CSR-II (WSJ1) Sennheiser	&	LDC94S13B\\
 5000 &	20 &	CSR-II (WSJ1) Other		&	LDC94S13C\\
2500 &	 8 &    Air Traffic Control 	&	     LDC94S14\\
 2500 &  2 &      SPIDRE                      &      LDC94S15\\
 1000 &	 1 &	YOHO Speaker Verification	&	LDC94S16\\
  200 &	 1 &	OGI Multilanguage Corpus	&    LDC94S17\\
  100 &    1 &	OGI Spelled \& Spoken Word      &    LDC94S18\\
 5000 &	 3 & 	ATIS3				&    LDC94S19\\
 2500 &	 9 &	BRAMSHILL                       &    LDC94S20\\
10000 &	8 &	MACROPHONE (American English) 	&    LDC94S21\\
 5000 & 3  &  UN Parallel Text (Complete)  & 	      LDC94T4A\\
 2500 &	 1 &	UN Parallel Text (English)	&	LDC94T4B-1  \\
 2500 &	 1 &	UN Parallel Text (French)	&	LDC94T4B-2  \\
 2500 &	 1 &	UN Parallel Text (Spanish)	&	LDC94T4B-3.1\\
 35 & 	 1 &    ECI Multilingual Text           & 	LDC94T5 \\
  150 &    1 &      CELEX Lexical Database      &    LDC94L1 \\
10000 &	 1 &	COMLEX English Syntax Lexicon, Version 0 & LDC94L2   \\
10000 &	 1 &	COMLEX Pronouncing Dictionary, Version 0 & LDC94L3  \\
\cline{1-4}
\end{tabular}
 
\vspace{.5in}

\begin{center}
{\large\bf 1995 RELEASES (TENTATIVE)}
\end{center}
\begin{tabular}{|r|r|l|l|}
\cline{1-4}
%-----   ------  -----------				------------
Price & Set-of &  Description	& Release Date \\
\cline{1-4}
\$5000 & 5 &    PHONEBOOK: NYNEX Isolated Words &	Fall 1994 \\
 2500 &	 2 &	KING Speaker Verification	&	Fall 1994 \\
  TBD &  2 &     Hansard			&	Fall 1994 \\
  TBD &	 1 &	Speech Collection Interface 	&       Fall 1994  \\
  TBD &	19 &	CSRS-III    			&	January 1995 \\
  TBD &	19 &	CSRO-III 			&	January 1995 \\
10000 &	10 &	POLYPHONE-II (American Spanish) &	January 1995 \\
  2000 &1 & 	TIPSTER Volume 4		&	Fall 1994  \\
  TBD &  1 &    Treebank-2			&       Winter 1995 \\
\cline{1-4}
\end{tabular}

\pagebreak
\section{Speech Corpora: Descriptions and Ordering Information}

\subsection {TIMIT Acoustic-Phonetic Continuous Speech Corpora}

The TIMIT corpus of read speech is designed to provide speech data for
acoustic-phonetic studies and for the development and
evaluation of automatic speech recognition systems.  TIMIT contains broadband
recordings of 630 speakers of 8 major dialects of American English, each
reading 10 phonetically rich sentences.  The TIMIT corpus includes time-aligned
orthographic, phonetic, and word transcriptions as well as speech waveform data
for each utterance.  Corpus design was a joint effort among the Massachusetts
Institute of Technology (MIT), SRI International (SRI), and Texas Instruments,
Inc.  (TI).  The speech was recorded at TI at 16 kHz , transcribed at
MIT, and verified and prepared for CD-ROM production by the National
Institute of Standards and Technology (NIST).

The TIMIT corpus transcriptions have been hand verified.  Test and
training subsets, balanced for phonetic and dialectal coverage, are
specified.  Tabular computer-searchable information is included as
well as written documentation.

\subsubsection{Original ARPA-sponsored Version (TIMIT)}

This is the original 16 kHz version, recorded over a high quality
microphone in studio conditions.

\vspace{.25in}
\noindent Item Name:	TIMIT \\
LDC Catalog No.:        LDC93S1\\
NIST Catalog No.: 	\#1-1 \\
Release date:		10/90  (MY93) \\
Nonmember price: 	\$100 \\
Special license:	NO \\


\subsubsection{NYNEX Telephone Version of TIMIT Corpus (NTIMIT)}
The NYNEX Science and Technology Laboratories produced a telephone
channel version of the TIMIT corpus by transmitting all 6300 TIMIT
utterances through a handset and across various NYNEX telephone
channels in a controlled manner.  The data have been prepared for
CD-ROM production by NIST.  Waveform files use the NIST SPHERE format.
For more information about NTIMIT, see "NTIMIT: A Phonetically
Balanced, Continuous speech, Telephone Bandwidth Speech Database", by
C.  Jankowski, et al. in Volume 1 of Proceedings of ICASSP-90, pp.
109-112).


\vspace{.25in}
\noindent Item Name:		NTIMIT  \\
LDC Catalog No.:  	LDC93S2\\
NIST Catalog No.: 	\#10-1.1, 10-2.1  \\
LDC Release date:	8/92 (MY93) \\
Nonmember price: 	\$250  \\
Special license:	NO  \\


\subsection{The Resource Management Corpora}
The DARPA Resource Management Continuous Speech Corpora (RM) consist
of digitized and transcribed speech for use in designing and
evaluating continuous speech recognition systems.  There are two main
sections, often referred to as RM1 and RM2.  RM1 contains four
CD-ROMs, two with Speaker-Dependent (SD) training data, one with
Speaker-Independent (SI) training data, and one with test and
evaluation data.  RM2 has 2 CD-ROMs with an additional and larger SD
data set, including test material.

All RM material consists of read sentences from a naval resource
management task.  The complete corpus contains over 25,000 utterances
from more than 160 speakers representing a variety of American
dialects.  The material was recorded at 16KHz, with 16-bit resolution,
using a Sennheiser HMD-414 headset microphone.  All discs conform to
the ISO-9660 data format.

RM sentences are consistent with a limited language model with a
1000 word vocabulary that allows queries about ships, ports, etc.,
along with commands to control a graphics display system, but little
else.  There is no "official" language model, but a simple
non-probabilistic word-pair grammar that provides complete coverage of
the sentences in this corpus is provided.

The Resource Management text corpus was designed at BBN Laboratories,
Inc. and SRI International.  BBN also developed and made available the
"Word-Pair" grammar that has been used in the benchmark tests.  Texas
Instruments, Inc. recruited the subjects and recorded and digitized
the speech.  For more information about the design and collection of
this corpus see: P. Price, W.M. Fisher, J. Bernstein and D.S. Pallett,
"The DARPA 1000-Word Resource Management Database for Continuous
Speech Recognition", Proceedings of the 1988 International Conference
on Acoustics, Speech and Signal Processing (Paper S.13.21, pp. 651-
654).

A series of benchmark speech recognition performance assessment tests
were conducted beginning in March 1987 using this corpus in
conjunction with standardized scoring software.  For more information
see D.S.  Pallett, "Benchmark Tests for DARPA Resource Management
Database Performance Evaluations", in Proceedings of the 1989
International Conference on Acoustics, Speech and Signal Processing
(Paper S10.b.6, pp. 536-539) and related papers in the Proceedings of
the February 1989, October 1989, June 1990, and February 1991 DARPA
Speech and Natural Language Workshops.


\subsubsection {Complete Resource Management Corpus (RM Complete)}

In addition to the RM1 and RM2 subsets as described in the following
sections, LDC now offers the entire RM series at a reduced price,
bundled as follows:

\vspace{.25in}
\noindent Item Name:		RM Complete  \\
LDC Catalog No.:  	LDC93S3A\\
NIST Catalog No.: 	\#2-1.1 through 2-4.2, 3-1.2 and 3-2.2  \\
LDC Release date:	MY93 \\
Nonmember price: 	\$1000  \\
Special license:	NO  \\


\subsubsection {Resource Management SD and SI Training and Test Data (RM1)} 
      
The first two CD-ROMs contain Speaker-Dependent (SD) Training Data: 12
subjects, each reading a set of 600 "training sentences", 2 "dialect"
sentences, and 10 "rapid adaptation" sentences, for a total of 7344
recorded sentence utterances.  The 600 sentences designated as
training cover 97\% of the lexical items in the corpus.

The third CD-ROM contains the Speaker-Independent (SI) Training Data:
80 speakers each read the 2 "dialect" sentences plus 40 sentences from
the Resource Management text corpus, for a total of 3360 recorded
sentence utterances.  Any given sentence from a set of 1600 Resource
Management sentence texts was recorded by two subjects, while no
sentence was read twice by the same subject.

The fourth CD-ROM contains all SD and SI system test material used in
5 DARPA benchmark tests conducted in March and October of 1987, June
1988, and February and October 1989, along with scoring and diagnostic
software and documentation for those tests.  Documentation is also
provided outlining use of the Resource Management training and test
material at CMU in development of the SPHINX system.  Example output
and scored results for state-of-the-art speaker-dependent and
speaker-independent systems (i.e., the BBN BYBLOS and CMU SPHINX
systems) for the October 1989 benchmark tests are included, as well as
SPeech HEader REsources (SPHERE) software and SPHERE-to-SAM conversion
software. 

\vspace{.25in}
\noindent Item Name:		RM1 \\
LDC Catalog No.:  	LDC93S3B\\
NIST Catalog No.: 	\#2-1.1 through 2-4.2  \\
LDC Release date:	8/92 (MY93) \\
Nonmember price: 	\$600  \\
Special license:	NO  \\


\subsubsection {Extended Resource Management Speaker-Dependent Corpus (RM2)}

This 2-disc set forms a speaker-dependent extension to the Resource
Management (RM1) corpus.  The corpus consists of a total of 10,508
sentence utterances (2 male and 2 female speakers each speaking 2,652
sentence texts).  These include the 600 "standard" Resource Management
speaker-dependent training sentences, 2 dialect calibration sentences,
10 rapid adaptation sentences, 1800 newly-generated extended training
sentences, 120 newly-generated development-test sentences, and 120
newly-generated evaluation-test sentences.  The evaluation-test
material on the discs was used as the test set for the June 1990 DARPA
SLS Resource Management Benchmark Tests (see the Proceedings.)

The RM2 corpus was recorded at Texas Instruments.  The NIST speech
recognition scoring software originally distributed on the RM1 "Test"
Disc was adapted for RM2 sentences, and is included on these discs as
well as the SPHERE speech file header manipulation software.

\vspace{.25in}
\noindent Item Name:		RM2 \\
LDC Catalog No.:  	LDC93S3C\\
NIST Catalog No.: 	\#3-1.2, 3-2.2  \\
LDC Release date:	8/90 (MY93) \\
Nonmember price: 	\$600  \\
Special license:	NO  \\

\pagebreak

\subsection{Air Travel Information System (ATIS) Corpora}

During 1989 and 1990, the DARPA Spoken Language Systems (SLS) Program
initiated plans for development of a "common corpus" for both speech
recognition and natural language research, using "spontaneous
goal-directed" speech, rather than "read speech."  The common task
domain that was chosen is termed the "Air Travel Information System"
(ATIS).  The corpora developed to date in order to train and test
systems in this domain are known as ATIS0, ATIS2, and ATIS3.  (ATIS1
will not be published.)

In all the ATIS corpora, users make spoken inquiries to simulated
(ATIS0) or prototypical (ATIS2, ATIS3) speech understanding systems to
obtain air travel information.  The system has the information in the
form of a relational database derived from the Official Airline
Guide; the initial ATIS0 relational database, for example, contains
information relevant to travel among 9 major airports serving 11
cities.  To measure performance, the system's answers to the spoken
inquiries are expressed in a logical form known as the "canonical
answer specification" (CAS) language, and compared with canonical
answers reviewed by human experts.  There are thus a number of
auxiliary files associated with each utterance, including orthographic
transcriptions and, for answerable queries, ``reference answers''.

Texas Instruments developed ATIS0, the pilot corpus for this program,
using a "Wizard of Oz" technique to simulate an ATIS SLS. (See
Hemphill, Godfrey and Doddington's paper ``The ATIS Spoken Language
Systems Pilot Corpus'' in the Proceedings of the June 1990 DARPA
Speech and Natural Language Workshop.)

Since 1991, the data for ATIS2 and ATIS3 have been collected at
multiple sites and pooled for common use.  The number of speakers and
utterances, the coverage of the travel information database, the
collection scenarios and platforms, have all changed as documented in
each corpus section.

For further information on the ATIS domain, on the test paradigm, and
on ATIS-domain benchmark tests, see the Proceedings of the DARPA
Speech and Natural Language Workshops held in October 1989, June 1990
and February 1991. (Morgan Kaufman, Publishers, Inc., 2929 Campus
Drive, San Mateo, CA 94403.  ISBN numbers: 1-55860-112-0,
1-55860-157-0, and 1-55860-207-0.)

\subsubsection{ATIS0 Spontaneous Speech Pilot Corpus and Relational Database}

The ATIS0 Corpus totals 6 CD-ROMs: one with spontaneous data from 36
speakers; one with read versions of the data from 20 of those
speakers, along with some adaptation material; and four with extensive
speaker dependent material from the ATIS domain, read by 10 of the
same speakers.  

All ATIS speech data is recorded at 16kHz sample rate, 16 bit
quantization, from two different microphones, a close-talking
(Sennheiser) and a desk-top (Crown PCC-160) model.

The first disc (ATIS0 Pilot) contains spontaneous utterances elicited
in a "Wizard-of-Oz" simulation, along with the relational database
containing the travel information (excluding connecting flights).
Thirty-six speakers produced a total of 912 utterances.

The second disc (ATIS0 Read) contains ``read'' versions of the
spontaneous utterances for 20 of the 36 speakers above, for a total of
478 productions.  This is supplemented by a set of 40 ``adaptation''
sentences read by each of the 20 speakers.
 
The third through the sixth discs (ATIS0 SD-Read) contain "read"
speech in the ATIS domain for ten of the speakers on the first disc.
They read a total of 3171 utterances, or approximately 317 utterances
per speaker.  This data was collected for the purpose of training
speaker-dependent speech recognition systems for the ATIS0 domain.
Two of these four discs contain the close-talking (Sennheiser)
microphone data, and the other two contain corresponding data for the
desk-top (Crown PCC-160) microphone.  Thus there are 6342 waveform
files on the four discs.

The entire ATIS0 set of six discs is now offered at a reduced price:

\vspace{.25in}
\noindent Item Name:		ATIS0 Complete \\
LDC Catalog No.:  	LDC93S4A\\
NIST Catalog No.: 	\#5-1.1 through 5-6.1  \\
LDC Release date:	4/94 (MY93) \\
Nonmember price: 	\$1500  \\
Special license:	NO  \\

Any individual subcorpus from the ATIS0 set can be purchased for \$500:

\vspace{.25in}
\noindent Item Names:		ATIS0 Pilot/Read/SD-Read \\
LDC Catalog Nos.:  	LDC93S4B-1/LDC93S4B-2/LDC93S4B-3\\
NIST Catalog No.: 	\#5-1.1 through 5-6.1 \\
LDC Release date:	4/92 (MY93) \\
Nonmember price: 	\$500 per subcorpus \\
Special license:	NO  \\


\subsubsection{ATIS2}

The ATIS2 corpus, on four CD-ROMs, contains approximately 15,000
utterances recorded from approximately 450 subjects at five sites:
AT\&T, BBN, CMU, MIT's Laboratory for Computer Science, and SRI.  All
utterances are been transcribed and almost 10,000 of them annotated
with categorizations and canonical reference answers.  Unlike the
ATIS0 corpus, much of the data in ATIS2 was collected using partially
or fully-automated data collection systems.  The fully-automated data
collection systems were, in fact, working ATIS prototypes.

For ATIS2, the 10-city relational database of ATIS0 was revised to
accommodate connecting flights and fares and some table headings were
renamed.

In addition to training data, the February and November '92 ATIS
Benchmark Tests are included as well.  Each contains approximately
1,000 utterances from the pool of data collected by the five sites.

\vspace{.25in}
\noindent Item Name:		ATIS2  \\
LDC Catalog No.:  	LDC93S5\\
NIST Catalog No.: 	\#12-1.1 through 12-4.1  \\
LDC Release date:	4/92 (MY93) \\
Nonmember price: 	\$2,500  \\
Special license:	NO  \\

\subsubsection{ATIS3}

The ATIS3 corpus, on three CD-ROMs, includes over 774 scenarios
completed by 137 subjects, yielding a total of over 7,300 utterances.
All utterances are transcribed and 2,900 of them have been categorized
and annotated with canonical reference answers.

The relational database for this dataset included flight information
for 46 cities and 52 airports.  Data was collected at BBN, CMU, MIT,
and SRI, using their own ATIS systems, and at NIST using systems
provided by BBN and SRI.

Two 1000-utterance test sets were set aside from the data pooled by
the collection sites.  The first set was used in a December 1993 ARPA
test, and is included in ATIS3.  The second has been reserved for
future testing. 

\vspace{.25in}
\noindent Item Name:		ATIS3  \\
LDC Catalog No.:  	LDC94S20\\
NIST Catalog No.: 	\#17-1.1 through 17-3.1
LDC Release date:	8/94 (MY94) \\
Nonmember price: 	\$5000  \\
Special license:	NO  \\

\pagebreak

\subsection {Continuous Speech Recognition (CSR) Corpora sponsored 
by ARPA}

During 1991, the DARPA Spoken Language Program initiated efforts to
build a new corpus to support research on large-vocabulary Continuous
Speech Recognition (CSR) systems.

The first two CSR Corpora consist primarily of read speech with texts
drawn from a machine-readable corpus of Wall Street Journal text, and
are thus often known as WSJ0 and WSJ1.  (Later sections of the CSR set
of corpora, howver, will consist of read texts from other sources of
North American business news, and eventually from other news domains.)

The read data was collected using 5,000-word and 20,000-word subsets
of the WSJ text corpus.  Some spontaneous dictation is included in
addition to the read speech.  The dictation portion was collected
using journalists who dictated hypothetical news articles.

Two microphones are used throughout: a close-talking Sennheiser model,
and a secondary microphone which may vary.  The corpora are thus
offered in three configurations: the speech from the Sennheiser, the
speech from the other microphone, and the speech from both; all three
sets include all transcriptions, tests, documentation, etc.

In general, transcriptions of the speech, test data from ARPA
evaluations, scores achieved by various speech recognition systems,
and software used in scoring are included on separate discs from the
waveform data.

\subsubsection {ARPA Continuous Speech Recognition Corpus I: Wall Street
Journal Sentences (WSJ0, or CSR-I)}

MIT's Laboratory for Computer Science, SRI International and Texas
Instruments collected approximately 40 hours of speech and over 31,000
utterances.  Prompts were taken from the Wall Street Journal.

Development and evaluation test sets are included and so marked.


\vspace{.25in}
\noindent Item Name:		CSR-I Complete \\
LDC Catalog No.:  	LDC93S6A\\
NIST Catalog No.: 	\#11-1.1 through 11-12.1, 11-14.1, 11-15.1  \\
LDC Release date:	7/93 (MY93) \\
Nonmember price: 	\$4,000  \\
Special license:	NO  \\

\vspace{.25in}
\noindent Item Name:		CSR-I Sennheiser \\
LDC Catalog No.:  	LDC93S6B\\
NIST Catalog No.: 	\#11-1.1 through 11-6.1, 11-13.1 through 11-15.1  \\
LDC Release date:	4/93 (MY93) \\
Nonmember price: 	\$2,000  \\
Special license:	NO  \\

\vspace{.25in}
\noindent Item Name:		CSR-I Other \\
LDC Catalog No.:  	LDC93S6C-7\\
NIST Catalog No.: 	\#11-7.1 through 11-15.1\\
LDC Release date:	4/93 (MY93) \\
Nonmember price: 	\$2,000  \\
Special license:	NO  \\


\subsubsection {ARPA Continuous Speech Recognition Corpus II: Wall Street
Journal Sentences (WSJ1, or CSR-II)}

The complete WSJ1 corpus contains approximately 78,000 training
utterances (~73 hours of speech), 4,000 of which are the result of
spontaneous dictation by journalists with varying degrees of
experience in dictation.  The corpus contains approximately 8,200
"conventional" development test utterances (~8 hours of speech), 6,800
of which are from spontaneous dictation.  As with the pilot corpus,
the entire corpus was collected using 2 microphones, so the amount of
speech in the entire corpus is about 162 hours.

In early 1993, a "Hub and Spoke" test paradigm was designed, calling
for eleven test sets, each a specific variation of the basic or
``hub'' condition.  The eleven Hub and Spoke Development and
Evaluation Test sets each contain approximately 7500 waveforms (~11
hours of speech).

WSJ1 waveforms have been compressed by about 2:1 using the
SPHERE-embedded ``Shorten'' compression algorithm developed at
Cambridge University.

\vspace{.25in}
\noindent Item Name:		CSR-II Complete \\
LDC Catalog No.:  	LDC94S13A\\
NIST Catalog No.: 	\#13-1.1 through 13-34.1  \\
LDC Release date:	7/93 (MY94) \\
Nonmember price: 	\$10,000  \\
Special license:	NO  \\

\vspace{.25in}
\noindent Item Name:		CSR-II Sennheiser \\
LDC Catalog No.:  	LDC94S13B\\
			
NIST Catalog No.: 	\#13-1.1 through 13-7.1, 13-11.1, 13-13.1 \\
			through 13-16.1,13-18.1 through 13-21.1, and \\
                        13-32.2 through 13-34.1 \\
LDC Release date:	7/93 (MY94) \\
Nonmember price: 	\$5,000  \\
Special license:	NO  \\

\vspace{.25in}
\noindent Item Name:		CSR-II Other \\
LDC Catalog No.:  	LDC94S13C\\
NIST Catalog No.: 	\#13-8.1 through 13-10.1, 13-12.1 through \\
			13-14.1, 13-17.1, and 13-22.1 through 13-34.1  \\
LDC Release date:	7/93 (MY94) \\
Nonmember price: 	\$5,000  \\
Special license:	NO  \\


\subsection{Switchboard Corpus of
	Recorded Telephone Conversations}

SWITCHBOARD is a collection of about 2400 two-sided telephone
conversations among 543 speakers (302 male, 241 female) from all areas
of the United States. A computer-driven "robot operator" system
handled the calls, giving the caller appropriate recorded prompts,
selecting and dialing another person to take part in a conversation,
introducing a topic for discussion, and recording the speech from the
two subjects into separate channels until the conversation was
finished.  About 70 topics were provided, of which about 50 were used
frequently.  Selection of topics and callers was constrained so that:
(1) no two speakers would converse together more than once, and (2) no
one spoke more than once on a given topic.  

Waveform files were recorded into two channels directly from the T1
digital telephone circuits, at an 8kHz sample rate and 8-bit mu-law
quantization. Complete orthographic transcriptions were made for each
conversation, with codes to identify overlapping portions (both
speakers talking at the same time), certain non-speech events
(laughter, coughs, etc), and interruptions/hesitations.  Each
conversation was also rated by transcribers for various quality
factors (amount of cross-talk between channels, static and background
noise, topicality, etc).  In addition, each transcription was
verified, and then used in a forced speech-recognition algorithm to
establish timing marks for word and utterance boundaries;
transcriptions are provided in the corpus in both "plain text" and
"time-aligned" forms.  A description is published in the 1993 ICASSP
Proceedings: Godfrey, McDaniel, and Holliman, ``SWITCHBOARD:  A
Telephone Speech Corpus for Research and Develpment.'' 

The original issue of SWITCHBOARD in early 1993 lacked about 150
conversations which were intended for publication but omitted by
error.  They were published in May 1994 and distributed to all
previous recipients of SWITCHBOARD. 

The Switchboard Corpus was collected at Texas Instruments and produced
on CD-ROM at the National Institute of Standards and Technology.  It
is distributed in a notebook-style binder with 28 CD-ROMs, (27
containing speech data, and one containing all transcription data).
Preparation of the data for CD-ROM production was done by NIST.  The
waveform files use the NIST SPHERE format.


\vspace{.25in}
\noindent Item Name:		Switchboard  \\
LDC Catalog No.:  	LDC93S7\\
NIST Catalog No.: 	\#9-1.1, 9-3.1 through 9-29.1  \\
LDC Release date:	4/92 (MY93) \\
Nonmember price: 	\$10000  \\
Special license:	NO  \\


\subsection {Switchboard Corpus Excerpts, Credit Card Conversations}

This CD-ROM contains 35 conversations on the topic of ``Credit Card
Use''. Most but not all can also be found in the Switchboard Corpus
(see below).  The conversations can be used in training and testing
wordspotting systems.  In addition to 2-channel mu-law encoded audio
waveform files, the disc contains transcriptions, time-alignments, and
wordspotting targets.


\vspace{.25in}
\noindent Item Name:		Switchboard Credit Card  \\
LDC Catalog No.:  	LDC93S8\\
NIST Catalog No.: 	\#8-1.2  \\
LDC Release date:	5/92 (MY93) \\
Nonmember price: 	\$1000  \\
Special license:	NO  \\


\subsection {Texas Instruments 46-Word Speaker-Dependent Isolated Word
      Corpus (TI46)}

This CD-ROM contains a corpus of speech which was originally designed
and collected at Texas Instruments, Inc. (TI) in 1980, and used
initially in performance assessment tests of isolated-word
speaker-dependent technology. (See ``Speech Recognition: Turning Theory
to Practice'' by G. R. Doddington and T. B. Schalk, in IEEE Spectrum,
Vol. 18, No. 9, September 1981.)

The 46-word vocabulary consists of two sub-vocabularies: (1) the TI
20-word vocabulary (consisting of the digits zero through nine plus
the words "enter", "erase", "go", "help", "no", "rubout", "repeat",
"stop", "start", and "yes", and (2) the TI 26-word "alphabet set"
(consisting of the letters "a" through "z").

The corpus contains read utterances from 16 speakers (8 males and 8
females) each speaking 26 utterances of the 46-word vocabulary: 16
tokens designated as training and 10 as test.

The corpus was collected at Texas Instruments in a quiet acoustic
enclosure using an Electro-Voice RE-16 Dynamic Cardiod microphone at
12.5kHz sample rate with 12-bit quantization.  The files are in NIST
SPHERE format, and have a ".wav" filename extension.

\vspace{.25in}
\noindent Item Name:		TI 46 Word  \\
LDC Catalog No.:  	LDC93S9\\
NIST Catalog No.: 	\#7-1.1  \\
LDC Release date:	4/92 (MY93) \\
Nonmember price: 	\$125  \\
Special license:	NO  \\


\subsection {Texas Instruments Speaker-Independent
      Connected-Digit Corpus (TIDIGITS)}

This three-disc set contains speech which was originally designed and
collected at Texas Instruments, Inc. (TI) for the purpose of designing
and evaluating algorithms for speaker-independent recognition of
connected digit sequences.  There are 326 speakers (111 men, 114
women, 50 boys, and 51 girls) each pronouncing 77 digit sequences.
Each speaker group is partitioned into test and training subsets.

The corpus was collected at TI in 1982 in a quiet acoustic enclosure
using an Electro-Voice RE-16 Dynamic Cardiod microphone, digitized at
20kHz.  The waveform files are in the NIST SPHERE format.

\vspace{.25in}
\noindent Item Name:		TIDIGITS  \\
LDC Catalog No.:  	LDC93S10\\
NIST Catalog No.: 	\#4-1, 4-2, 4-3  \\
LDC Release date:	4/92 (MY93) \\
Nonmember price: 	\$750  \\
Special license:	NO  \\


\subsection  {Road Rally Conversational Speech Corpora}

The "Road Rally" corpus was designed for the development and testing
of word-spotting systems, and consists of two sub-corpora, known as
``Stonehenge'' and ``Waterloo.''  

Stonehenge was collected using telephone handsets modified to contain
a high quality microphone.  To gather conversational data, two talkers
were located in separate rooms, given a road map, and asked to
participate in a road rally planning task.  The digitized speech was
filtered using a 300 Hz to 3300Hz PCM FIR bandpass filter to simulate
telephone quality.

The Stonehenge corpus contains 3 "styles" of speech data: (1) the
spontaneous conversations, (2) a read paragraph, containing at least
one occurrence of each of the key words, and (3) a set of read
"carrier" sentences. There are 80 speakers, 52 males and 28 females.
Twenty words were selected as keywords, and their occurrences and
locations are marked.

The Waterloo corpus was collected as an extension of Stonehenge,
providing similar domain material, but collected under different
conditions. It is intended for use in training models of keywords in
the conversational portion of the Stonehenge corpus.  The Waterloo
material was collected from 56 speakers (28M, 28F) using conventional
telephone handsets and dialed-up telephone lines in the Massachusetts
area, and consists of a read passage only (not the same as that in
Stonehenge.)  For this release, the naturally band-limited telephone
handset and line speech data were subsequently filtered with the same
300 Hz to 3300 Hz PCM FIR bandpass filter that was used for this
release's Stonehenge data.

Suggested wordspotting training and test procedures are outlined in
the documentation.

\vspace{.25in}
\noindent Item Name:		ROAD RALLY  \\
LDC Catalog No.:  	LDC93S11\\
NIST Catalog No.: 	\#6-1.1  \\
LDC Release date:	4/92 (MY93) \\
Nonmember price: 	\$500  \\
Special license:	NO  \\


\subsection{The HCRC Map Task Corpus}
		
The Map Task Corpus is a set of 8 CD-ROMs containing a total of about
18 hours of spontaneous speech that was recorded from 128 two-person
conversations, involving 64 different speakers (32 female, 32 male,
all adults, each taking part in four conversations).  The 64 speakers
were all students at the University of Glasgow, 61 of them being
native Scots.  The conversations were carried out in an experimental
setting, in which each participant has a schematic map in front of
them, not visible to the other. Each map is comprised of an outline
and roughly a dozen labelled features (e.g. "a white cottage", "an oak
forest", "Green Bay", etc). Most features are common to the two maps,
but not all. One map has a route drawn in, the other does not. The
task is for the participant without the route to draw one on the basis
of discussion with the participant with the route. In addition to the
conversations, each speaker provides a wordlist reading, consisting of
the major vocabulary items contained in the conversations.

The experimental design allows a number of different phonemic,
syntactico-semantic and pragmatic contrasts to be explored in a
controlled way.  In particular, maps and feature names were designed
to allow for controlled exploration of phonological reductions of
various kinds in a number of different referential contexts, and to
provide, via varying patterns of matches and mis-matches between the
two maps, a range of different stimuli for referent negotiation.  Also
the conditions of the conversations were carefully balanced: In half
of them the talkers were strangers, in half friends; in half of them
the talkers could see each other's faces, in half they could not.

The waveform data are provided in "raw" (headerless) files (16-bit
samples, 20 kHz sample rate, 2 channels per conversation), and
alternative header files are provided for use with software based on
either the NIST ``SPHERE'' header structure or the European ``SAM''
header structure.  Text transcriptions are provided for each
conversation, along with PostScript files of the map images used in
the experiments.  Additional materials include full documentation of
the experimental design and data collection protocol, resources for
using SGML tools on the transcriptions and other text materials, and
an extensive set of source code for performing basic signal
processing functions on the waveform data, such as down-sampling,
de-multiplexing, channel summation, and D/A conversion for Sun
workstations (including playback of segments selected via inspection
of transcripts in Emacs).  


\vspace{.25in}
\noindent Item Name:		HCRC MAP TASK  \\
LDC Catalog No.:  	LDC93S12\\
NIST Catalog No.: 	NA  \\
LDC Release date:	4/92 (MY93) \\
Nonmember price: 	\$200  \\
Special license:	NO  \\


\subsection {Air Traffic Control Corpus (ATC0)}

The Air Traffic Control Corpus (ATC0) is an eight-disc set of recorded
speech for use in supporting research and development activities in
the area of robust speech recognition in domains similar to air
traffic control (several speakers, noisy channels, relatively small
vocabulary, constrained languaged, etc.)  The audio data on these
discs is composed of voice communication traffic between various
controllers and pilots.

The audio files are 8 KHz, 16-bit linear sampled data, representing
continuous monitoring, without squelch or silence elimination, of a
single FAA frequency for one to two hours.  There are also files which
indicate the amplitude of the received AM carrier signal at 10 msec.
intervals.

Full transcripts, including the start and end times of each
transmission, are provided for each audio file.  Each flight is
identified by its flight number.

ATC0 consists of three subcorpora, one for each airport in which the
transmissions were collected -- Dallas Fort Worth (DFW), Logan
International (BOS), and Washington National (DCA).  The complete set
contains approximately 70 hours of controller and pilot transmissions
collected via antennas and radio receivers which were located in the
vicinity of the respective airports.

Detailed information regarding the collection process and the
equipment used can be found on each disc in the file, ``atc.doc'' in
the ``\/doc'' directory.

The ATC0 Corpus was collected by Texas Instruments under contract to
ARPA.  It was produced on CD-ROM by the National Institute of
Standards and Technology for distribution by the Linguistic Data
Consortium.

 \vspace{.25in}
\noindent Item Name:		AIR TRAFFIC CONTROL  \\
 LDC Catalog No.:  	LDC94S14\\
 NIST Catalog No.: 	\#16-1.1 through 16-8.1  \\
 LDC Release date:	3/94 (MY94) \\
 Nonmember price: 	\$2500  \\
 Special license:	NO  \\  


\subsection {SPIDRE Speaker Identification Corpus} 

This is 2-CD subset of the SWITCHBOARD collection (see above),
selected for speaker ID research, and with special attention to
telephone instrument variation.  It contains training and testing data
for experiments in closed or open set recognition or verification.
Combining the two sides of the conversations also permits speaker
change detection, or speaker monitoring, experiments.

There are 45 ``target'' speakers; four conversations from each target
are included, of which two are from the same handset. There are also
100 calls in which no target appears.  Since all conversations are
two-sided, this results in 180 target sides and 180 + 200 = 380
nontarget sides.  

Except for truncations of a few longer calls at 5 minutes, the call
themselves are as described under SWITCHBOARD.


\vspace{.25in}
\noindent Item Name:		SPIDRE  \\
LDC Catalog No.:  	LDC94S15 \\
NIST Catalog No.: 	\#18-1.1 and 18-2.1  \\
LDC Release date:	4/94 (MY94) \\
Nonmember price: 	\$2500  \\
Special license:	NO  \\

\subsection{YOHO Speaker Verification Corpus}

The YOHO database is a three-disc set containing a large scale,
high-quality speech corpus to support text-dependent speaker
authentication research, such as is used in "secure access"
technology.  The data was collected in 1989 by ITT under a US
Government contract, but has not been available for public use before.
Note that certain changes have been made to the corpus, mainly to
insure the privacy of the speakers, and some data has been withheld by
the government for future use in testing.

YOHO contains:
\begin{itemize}

  \item ``Combination lock'' phrases (e.g., 36-24-36)
  \item  Collected over 3 month period in a real-world office environment
  \item 4 enrollment sessions per subject with 24 phrases per session
  \item ~10 test sessions per subject with 4 phrases per session
  \item 8 kHz sampling with 3.8 kHz analog bandwidth
  \item 1.5 gigabytes of data \\

\end{itemize}

The number of trials is thus sufficient to permit evaluation testing
at high confidence levels.  In each session, a speaker was prompted
with a series of phrases to be read aloud; each phrase was a sequence
of three two-digit numbers (e.g.  ``35 - 72 - 41,'' pronounced
``thirty-five seventy-two forty-one'').  The first four sessions for a
given speaker were enrollment sessions of 24 phrases, and all
additional sessions were verification trials of four phrases each.  In
all there are 552 enrollment sessions, and 1380 trial sessions, with a
nominal time interval of three days between sessions.

\vspace{.25in}
\noindent Item Name:		YOHO  \\
LDC Catalog No.:	LDC94S16 \\
NIST Catalog No.:	NA  \\
LDC Release date:	4/94 (MY94)  \\
Nonmember price:	\$1000 \\
Special License: 	NO  \\


\subsection {OGI Multi-Language Corpus }

The corpus consists of responses to prompts spoken over commercial
telephone lines by speakers of English, Farsi(Persian), French,
German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and
Vietnamese.  It contains a total of 1927 calls, an average of 175
calls per language.

Speech was collected using an automated system that answered the
telephone, played digitized prompts in the appropriate language to
request the speech samples, and digitized the callers' responses for a
designated period of time.  

Log files are included that provide a set of automatic measurements
made on each utterance. In addition, some utterances were
automatically segmented into broad phonetic catagories. The speech
data are compressed, with NIST SPHERE headers.


\vspace{.25in}
\noindent Item Name:		OGI MULTI-LANGUAGE TELEPHONE  \\
LDC Catalog No.:  	LDC94S17   \\
NIST Catalog No.: 	NA  \\
LDC Release date:	4/94 (MY94) \\
Nonmember price: 	\$2000  \\
Special license:	NO  \\

\subsection{OGI Spelled and Spoken Telephone Corpus}

The OGI Spelled and Spoken Telephone Corpus consists of speech
recordings from over 3650 telephone calls, each made by a different
speaker to an automated prompting/recording system installed at the
Oregon Graduate Institute. Speakers were asked to say their name,
where they were calling from, and where they grew up; they were asked
to answer a couple of yes/no questions, and to spell their first and
last names; many were also asked to repeat a few specific words, and
to recite the letters of the alphabet.

Each response to a prompt is stored as a separate waveform file, and
the files are organized according to prompt (response type); all
responses from a given call have a unique caller-index number as part
of the file named, so that responses can easily be sorted by speaker.
Waveform data are stored in compressed form, using the NIST SPHERE 2.0
software package, which is available separately at no charge to users.
SPHERE 2.0 provides the decompression software needed to extract the
waveform data, as well as tools for accessing and modifying file
headers.

Time-aligned phonetic transcriptions are provided for a subset of
responses, and a complete log of each (giving speaker sex, quality
judgments, and orthographic transcriptions of all responses) is
included in a form suitable for use as a relational data base.

\vspace{.25in}
\noindent Item Name:            OGI SPELLED \& SPOKEN WORD  \\
LDC Catalog No.:        LDC94S18  \\
NIST Catalog No.:       NA  \\
LDC Release date:       4/94 (MY94) \\
Nonmember price:        \$100  \\
Special license:        NO  \\


\subsection{BRAMSHILL}  The recordings on this nine-disc set were
originally made in 1978-79 as part of a British Home Office study into
speaker identification techniques.  Subsequently, it was realised that
a large body of unconstrained conversational material might be of
interest to researchers working in other speech processing fields.
The recordings were transcribed and the CD-ROMs prepared during 1993.

The recordings were made at the Police Staff College, Bramshill,
Hampshire, England.  The participants were police officers taking part
in the various courses at the college. This provided a wide range of
regional accents and a range of ages from late teens to early fifties.
Each speaker is described by nine demographic attributes.

Three adjacent bedrooms were used. The two participants, each
alone in their rooms, conversed by telephone. The third room
was used as a monitoring and recording station.

In addition to the telephone recordings, reference recordings were
made using a high quality dynamic microphone in each room.  It is
these higher quality recordings, {\em not the telephone speech}, which are
provided on the BRAMSHILL CD-ROM set.

The recordings were made on a Sony Elcaset EL-7 cassette machine,
chosen at the time because of its good speed stability.  The
microphone was a Shure SM-7 cardioid type.  The speech data was
sampled at 10 kHz, 16-bit resolution.

Some attempt was made to control the acoustic environment.
It is evident from listening to the recordings that, while
these measures produced a reasonable recording environment,
the rooms were far from soundproof. A variety of external
noises (engines, aircraft, etc) can be heard on some of the
recordings.

Each speaker was given a pile of photographs.  In response to a bleep
signal, each speaker introduced himself by name and read a set of test
sentences.  After this, the main part of the conversation took place,
in which participants were asked to determine which of each pair of
photographs has been taken first (if indeed they were related at all).
The conversations continued for 10 minutes until terminated by another
bleep signal.

During the digitisation process, some periods of silence were
removed, so some recordings now appear to be shorter than the
original ten minutes. Furthermore, this means that recordings
of two sides of a conversation {\em are no longer time-aligned}. In
addition, to preserve the anonymity of the speakers, some
passages (mainly the introductions) have been erased by
replacing with binary zeroes. Finally the bleep signals have
also been erased with binary zeroes. The transcriptions
indicate where this has occurred.

The speech was transcribed verbatim. No attempt was made to correct
grammar, fill in missing words etc.  Transcription conventions are
detailed in the documentation.  Every lexical word from the
transcriptions is contained in the dictionary supplied in the INDEX
directory.  There are about 6500 word types in the 600k words of the
transcripts.  Contractions, part-words, slang words, hesitation sounds
and the non-speech sounds such are all treated as words in their own
right in the dictionary.

\vspace{.25in}
\noindent Item Name:            BRAMSHILL  \\
LDC Catalog No.:        LDC94S20 \\
NIST Catalog No.:       NA  \\
LDC Release Date:       8/94 (MY94)  \\
Nonmember price:        \$150  \\
Special license:        NO  \\


\subsection{MACROPHONE}

MACROPHONE consists of approximately 200,000 utterances by 5000
speakers. It is designed to provide material sufficient and suitable
for research, development, and evaluation of automatic speech
recognition technology for common telephone applications, such as
shopping, transportation, database access, and autodialing.  In
addition to application-oriented phrases and numerous digit strings,
seven sentences are spoken by each talker to provide ensemble phoneme,
diphone and triphone coverage of the language.  The spoken material
also refers to times, locations, monetary amounts, spellings, and
interactive operations.

The utterances were collected automatically over the telephone network
by recording directly from a T1 connection in 8 kHz, 8-bit mu-law
format.  The participants, roughly equal numbers of males and females,
were solicited by a marketing firm from all regions of the United
States.  They ranged in age from the teens to the seventies, and
represented a broad range of educations and incomes as well.  Each
recorded utterance is accompanied by an orthographic transcription
which also notes any unusual acoustic events or anomalies.

Macrophone is the American English contribution to an international
database of telephone speech corpora called POLYPHONE.  Similar data
sets are expected for major languages of the world, and at least some
of these will be made available through LDC.  Prospects are currently
good for American Spanish (by early 1995), Dutch, Standard French,
Standard German, Japanese, Mandarin Chinese, Swiss French, and Danish
versions of POLYPHONE, all with basically the same structure and
methods of collection.

MACROPHONE was collected at SRI under LDC sponsorship.  A paper
describing it was presented at ICASSP-94: ``Macrophone: An American
English Telephone Speech Corpus for the POLYPHONE Project,'' by Jared
Bernstein, Kelsey Taussig, and Jack Godfrey.

\vspace{.25in}
\noindent Item Name:		MACROPHONE  \\
LDC Catalog No.:  	LDC94S22   \\
NIST Catalog No.: 	NA  \\
LDC Release date:	August 1994 (MY94) \\
Nonmember price: 	\$10000  \\
Special license:	NO  \\


\section{Text Corpora: Descriptions and Ordering Information}

\subsection{Association for Computational Linguistics Data Collection
	Initiative (ACL/DCI)}
	
The ACL Data Collection Initiative disc contains text from: Wall
Street Journal, copyright 1987, 1988, 1989, provided by Dow Jones,
Inc.; the Collins English Dictionary, Copyright 1979, William Collins
Sons \& Co., Ltd.; scientific abstracts provided by the U.S.
Department of Energy; and a variety of gramatically tagged and parsed
materials from the Treebank project at the University of Pennsylvania,
copyright 1990,1991, University of Pennsylvania. The total amount of
uncompressed text is 620 Mbytes.

The many formats in which the originals of these texts came have all,
to one extent or another, been mapped into a markup language
consistent with the SGML standard (ISO 8879).

The format of the material from the Wall Street Journal uses a
labelled bracketing, expressed in the style of SGML, although no
formal SGML DTD is provided. The tag set has been modified by turning
the Dow Jones header categories into tags and by creating ad hoc tages
such as ``<dateline>.'' The original datelines are presented as
separate text units; the text is divided and tagged into paragraphs
and sentences with each sentence presented on a single line. Nothing
has been done to modify the typographical methods used to subdivide
headlines and stories into sections, nor are any of the text features
within sentences (quotes, ellipsis, etc.) normalized.

The Collins English Dictionary is present in two forms. One form was
approximately parsed into fielded records as an exercise in learning a
language called ``FIT'', by a student working under the direction of
Lloyd Nakatani at AT\&T Bell Laboratories during the summer of 1990.
The original digital image of the typographer's tape that the database
version was prepared from had serious flaws that were not detected and
corrected until later; the corrected version, a clean typographer's
tape, is presented in a separate directory. A properly-analyzed
database version will be provided in the future.  The documentation
includes notes developed during the new attempt to analyze the tape
from scratch.

The Department of Energy abstracts reside in files that are
approximately one megabyte each. The original 950 separators have
been replaced with newlines, and space padding between articles was
removed.  An acronym dictionary that was extracted from the database
as an indication of the material's topic areas has been included in a
separate directory.

Provisional material from the Penn Treebank project is divided into
two subdirectories on this disk. The subdirectory ``postext'' contains
text with part-of-speech annotations; ``parstext'' contains text with
syntactic bracketing.

\vspace{.25in}
\noindent Item Name:		ACL/DCI  \\
LDC Catalog No.:  	LDC93T1\\
NIST Catalog No.: 	NA  \\
LDC Release date:	4/92 (MY93) \\
Nonmember price: 	\$25  \\
Special license:	YES  \\

\subsection {The Penn Treebank Project}


This CD-ROM contains over 1.6 million words of hand-parsed material
from the Dow Jones News Service, plus an additional 1 million words
tagged for part-of-speech. This material is a subset of the corpus for
the current DARPA large-vocabulary speech recognition project.

It also contains the first fully parsed version of the Brown Corpus,
which has also been completely retagged using the Penn Treebank tag
set. Also included are tagged and parsed data from Department of
Energy abstracts, IBM computer manuals, MUC-3, and ATIS.

In addition, the CD-ROM includes source code for several software
packages, including tgrep, which permits the user to search for
specific constituents in tree structures.

Later versions of Treebank in MY95 will include greater depth of
annotation, and more varied corpus materials.

\vspace{.25in}
\noindent Item Name:		PENN TREEBANK  \\
LDC Catalog No.:  	LDC93T2\\
NIST Catalog No.: 	NA  \\
LDC Release date:	1/93 (MY93) \\
Nonmember price: 	\$2500  \\
Special license:	NO  \\
\pagebreak


\subsection {TIPSTER Information Retrieval Text Research Collection}

The TIPSTER project is sponsored by the Software and Intelligent
Systems Technology Office of the Advanced Research Projects Agency
(ARPA/SISTO) in an effort to significantly advance the state of the
art in effective document detection (information retrieval) and data
extraction from large, real-world data collections.

The detection data is comprised of a new test collection built at NIST
to be used both for the TIPSTER project and the related TREC project.
The TREC project has many other participating information retrieval
research groups, working on the same task as the TIPSTER groups, but
meeting once a year in a workshop to compare results (similar to MUC).
The test collection built at NIST consists of 3 disks (gigabytes) of
documents, 150 topics, and the answers (relevant documents) for those
topics.

The documents in the test collection are varied in style, size, and
subject domain.  The first disk contains material from the Wall Street
Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal
Register (1989), information from Computer Select disks (Ziff-Davis
Publishing), and short abstracts from the Department of Energy.  The
second disk contains information from the same sources, but from
different years.  The third disk contains more information from the
Computer Select disks, plus material from the San Jose Mercury News
(1991), more AP newswire (1990), and about 250 megabytes of formatted
U.S. Patents.  The format of all the documents is relatively clean and
easy to use, with SGML-like tags separating documents and document
fields.  There is no part-of-speech tagging or breakdown into
individual sentences or paragraphs as the purpose of this collection
is to test retrieval against real-world data.

The three Tipster discs so far released have been re-issued with
updates and corrections, and all recipients of the earlier versions
should have received these replacements free of charge.  If you think
you have the unrevised original, contact LDC for confirmation.

A fourth Tipster volume is planned for release during MY95.

\subsubsection{TIPSTER Volume 1, March 1992}

\begin{tabular}{ll}
Directory Name & Description\\ \\

/ap   &       Associated Press Newswire material, copyright 1989\\
           /fr   &       Federal Register material, 1989\\
          /wsj   &      Wall Street Journal, copyright 1987, 1988, 1989\\
           /doe   &      Department of Energy abstracts\\
\end{tabular}


\vspace{.25in}
\noindent Item Name:		TIPSTER vol.1 \\
LDC Catalog No.:  	LDC93T3-1.1  \\
NIST Catalog No.: 	NA  \\
LDC Release date:	4/92 (MY93) \\
Nonmember price: 	\$1000  \\
Special license:	YES  \\


\subsubsection{TIPSTER Volume 2, July 1992}


\begin{tabular}{ll}
 Directory Name &             Description\\ \\

           /ap    &      Associated Press Newswire material, copyright 1988\\
           /fr    &      Federal Register, 1988\\
           /wsj   &      Wall Street Journal, copyright 1990, 1991, 1992\\
           /ziff  &      Ziff-Davis Publishing, copyright 1989, 1990\\
           /doe  &       Department of Energy abstracts\\

\end{tabular}


\vspace{.25in}
\noindent Item Name:		TIPSTER vol.2  \\
LDC Catalog No.:  	LDC93T3-2.1   \\
NIST Catalog No.: 	NA  \\
LDC Release date:	7/92 (MY93) \\
Nonmember price: 	\$1000  \\
Special license:	YES  \\

\subsubsection{TIPSTER Volume 3, April 1993}

\begin{tabular}{ll}
Directory Name   &           Description\\ \\


            /ap  &       Associated Press material, copyright 1990\\
            /patents &   U.S. Patent documents, 1983-1991\\
            /sjm  &      San Jose Mercury News, copyright 1991\\

\end{tabular}

\vspace{.25in}
\noindent Item Name:		TIPSTER vol.3  \\
LDC Catalog No.:  	LDC93T3-3.1   \\
NIST Catalog No.: 	NA  \\
LDC Release date:	7/92 (MY93) \\
Nonmember price: 	\$1000  \\
Special license:	YES  \\

\pagebreak


\subsection {United Nations Parallel Text Corpus  (English, French, Spanish)}


This set of three compact discs contains documents provided
to the LDC by the United Nations, for use in research on machine
translation technology.  The documents come from the Office of
Conference Services at the UN in New York, and are drawn from
archives that span the period between 1988 and 1993.  

This publication contains the English, French and Spanish archives,
with data from each language stored on a separate disc in the set.
Care has been taken to arrange the document files in a parallel
directory structure for each language, so that corresponding
translations of a document are found directly by means of the
directory paths and file names.

All parallel files in this corpus are English-based: for every file on
the English disc, there will be a corresponding file on either the
French or Spanish disc, or both.  Tables are included on all discs to
assist in determining which parallels are present.  Due to the nature
and organization of UN translation services and the original
electronic text archives, the process of finding and sorting out
parallel documents yielded a numerous gaps, with many files in each
language having no parallel in other languages.

In preparing the text for publication, we have applied a
fully-compliant SGML format (Standard Generalized Markup Language).
For those researchers who use SGML, a working DTD (Document Type
Definition) is provided on each disc.  For those who do not need SGML
markup, a simple script is included that can be used to filter out the
SGML-specific material, and leave only the plain text.  The character
set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and
some other non-ASCII characters occupy the upper 128 entries of the
character table.


\vspace{.25in}
\noindent Item Name:		UNITED NATIONS PARALLEL TEXT Complete Set \\
LDC Catalog No.:  	LDC94T4A\\
NIST Catalog No.: 	NA  \\
LDC Release date:	4/94 (MY94) \\
Nonmember price: 	\$5000  \\
Special license:	NO  \\

\vspace{.25in}
\noindent Item Name:		UNITED NATIONS PARALLEL TEXT English \\
LDC Catalog No.:  	LDC94T4B-1    \\
NIST Catalog No.: 	NA  \\
LDC Release date:	4/94 (MY94) \\
Nonmember price: 	\$2500  \\
Special license:	NO  \\

\vspace{.25in}
\noindent Item Name:		UNITED NATIONS PARALLEL TEXT French \\
LDC Catalog No.:  	LDC94T4B-2    \\
NIST Catalog No.: 	NA  \\
LDC Release date:	4/94 (MY94) \\
Nonmember price: 	\$2500  \\
Special license:	NO  \\

\vspace{.25in}
\noindent Item Name:		UNITED NATIONS PARALLEL TEXT Spanish \\
LDC Catalog No.:  	LDC94T4B-3.1    \\
NIST Catalog No.: 	NA  \\
LDC Release date:	4/94 (MY94) \\
Nonmember price: 	\$2500  \\
Special license:	NO  \\


\subsection {ECI-1}

The first release of the European Corpus Initiative, the Multilingual
Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European)
languages.  The total size of these is roughly 92 million (lexical)
words.  The corpora are marked up using TEI P2 conformant SGML (to
varying levels of detail), with easy access to the source text without
markup.  Twelve of the component corpora are multilingual parallel corpora
with from two to nine sub-corpora.  All the alphabetic corpora (there
is some Japanese and Chinese) are encoded in the ISO LATIN family of
8-bit character sets (ISO 8859-1, -5 and -7).  The CD-ROM is in High
Sierra format (ISO 9660), readable on UN*X, MSDOS and Apple systems at
least.

The amount of material per language varies, from about 36 million
words (German) to about 5 thousand words (Bulgarian).  The majority of
sources are journalistic in nature (newspapers, magazines,
broadcasts); additional sources include dictionaries (Albanian,
Gaelic, Turkish, Japanese/English), literature, technical reports, and
proceedings or publications of international organizations.  The table
on the next page lists the languages included, the subcorpus numbers
for each language (in parentheses), and the amount of data per
language in thousands of lexical words.

\newpage

\begin{tabular}{lrrrrrrrrr}

Language &   (Subcorpus \#)& Kwords & & & & & & & Totals\\ \\

German   &       (70)& 34291 & (09) &  191 & (65) &  20 & (28) & 187 &\\
         &       (29)&    59 & (30) &  76 & (47) &  24 & (59) &  50 &\\
         &       (71) &   21 & (70A)&  999  & & & & &                 35918\\
French  &        (31) &  4775 & (04) & 4121 & (28) &  187 & (29) &  59 &\\
        &        (30) &   76 & (47) &  24 & (51) & 6 & (59) &  50 &\\
         &       (71) &  21 & (32) & 1667 & & & & &                    10986\\
Spanish   &      (31) &  4500 & (13) & 830 & (14) & 1041 & (15) & 447 &\\
          &      (47) &    24 & (32) & 1667 &8 & (59) &  50 & (71) &   8580\\
English   &      (31) &  4222 &  (36) &  1141 & (74) & 95 & (28) & 187 &\\
          &      (47) &   24 & (51) & 6 & (56) & 97 & (59) & 50 &\\
          &      (71) &   21 & (32) & 1667 &   &    &      &    &  7510\\
Dutch     &      (03) &  5500 & (02) & 600 & (47) & 24 & (71) & 21 & 6145\\
Czech     &      (44) &  4726 &      &    &       &    &     & &  4726\\
Italian   &     (11)  & 3518 & (42) & 303 & (58) & 13 & (29) &  59 &\\
          &      (30) &   76 & (47) &  24 & (71) & 21 &      & &  4014\\
Chinese   &      (78) &  2895 &     &     &      &    &      & &  2895\\
Greek     &      (10) &  2515 & (47) &  24 & (59) & 50 & (71) & 21 & 2610\\
Norwegian &      (41) &  2226 &      &     &      &    &      &  &  2226\\
Swedish   &      (37) &  1718 &      &     &      &    &     &   & 1718\\
Serb/Croat/Slov & (24) &  700 & (56) & 289 &      &    &    &   &  989\\
Tibetan     &    (76) &  834  &      &   &        &    &   &   & 834\\
Portuguese  &    (60) &  675 & (47) & 24 & (71) & 21   &  &    &  720\\
Malay       &    (80) &  563 &      &     &     &      &  &   &  563\\
Russian     &    (73) &  364 &      &     &     &       &  &  &  364\\
Japanese    &    (57) &  203 &      &     &     &       &  &  &  203\\
Turkish     &    (20) &  173 & (20A) & 110 &    &       &  &  & 283\\
Albanian    &    (82) &  205 &       &   &      &       &  &  &  205\\
Gaelic      &    (55) &  141 &       &   &      &       &  &  &  141\\
Estonian    &    (39) &  100 &       &   &      &       &  &  &  100\\
Usbek       &    (81) &   88 &       &   &      &       &  &  &  88\\
Latin       &    (74) &   75 &       &   &      &       &  &  &  75\\
Danish      &    (47) &   24 & (71) & 21 &      &       &  &  &  45\\
Lithuanian  &    (89) &   20 &     &     &      &       &  &  &  20\\
Bulgarian   &    (84) &    5 &     &     &      &       &  &  &  5\\ \\

Total       &         &      &     &     &      &       &  &  &  91969\\

\end{tabular}
\vspace{.2in}
\noindent Item Name:		ECI/MCI  \\
LDC Catalog No.:  	LDC94T5\\
NIST Catalog No.: 	NA  \\
LDC Release date:	6/94 (MY94) \\
Nonmember price: 	\$35  \\
Special license:	YES  \\

\newpage

\section{Lexical Databases: Descriptions and Ordering Information}

\subsection {CELEX Lexical Database}

This corpus contains ASCII versions of the CELEX lexical databases of
English (version 2.5), Dutch (version 3.1) and German (version 2.0).
CELEX was developed as a joint enterprise of the University of
Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck
Institute for Psycholinguistics in Nijmegen, and the Institute for
Perception Research in Eindhoven.  Pre-mastering and CD-ROM production
was done by the LDC.

       For each language, this CD-ROM contains detailed information on :
\begin{itemize}
          \item the orthography (variations in spelling, hyphenation),

         \item the phonology (phonetic transcriptions, variations in pronunciation,
            syllable structure, primary stress),

         \item the morphology (derivational and compositional structure,
            inflectional paradigms),

          \item the syntax (word class, word-class specific subcategorizations,
            argument structures) and

        \item word frequency (summed word and lemma counts, based on recent and
            representative text corpora).

\end{itemize}

        The databases have not been tailored to fit any particular
database management program.  Instead, the information is in ASCII
files in a UNIX directory tree that can be queried with tools such as
AWK or ICON.  Unique identity numbers allow the linking of information
from different files. Some kinds of information have to be computed
on-line; wherever necessary, AWK functions have been provided to
recover this information.  README files specify the details of their
use.

        A detailed User Guide describing the various kinds of lexical
information available is supplied.  All sections of this guide are
POSTSCRIPT files, except for some additional notes on the German
lexicon in plain ASCII.


\vspace{.25in}
\noindent Item Name:            CELEX  \\
LDC Catalog No.:        LDC94L1   \\
NIST Catalog No.:       NA  \\
LDC Release date:       4/94 (MY94) \\
Nonmember price:        \$150  \\
Special license:        YES  \\


\subsection
{COMLEX: COMmon LEXical Database of English} 

This is a three-part project: COMLEX English Syntax, COMLEX English
Pronunciation, and COMLEX English Semantics.  The first two have
resulted in electronic dictionaries, released by LDC as MY94 products
and described below.

The Semantics will result in an annotated corpus
using WordNet, which is a public domain compendium of lexical semantic
relations, in 1995.  Annotation of the same corpus using COMLEX Syntax
is also planned for 1995.

For a description of WordNet, see George Miller (ed.), WordNet: An
on-line lexical database, in International Journal of Lexicography
(special issue), 3(4):235-312, 1990, or George Miller, Claudia
Leacock, Randee Tengi, and Ross Bunker: A semantic concordance, in
Proceedings of the Human Language Technology Workshop, pages 303--308,
Princeton, NJ, March 1993. 

These products are intended to provide a comprehensive set of lexical
resources for research and development in computational linguistics.
They will be revised and expanded continuously, with feedback from the
community of users, and current members will receive all new versions.

{\em The initial (MY94) versions of the electronic dictionaries are being
distributed only by ftp.}  Contact LDC for instructions to obtain
license forms and the dictionaries.


\subsubsection{COMLEX English Syntax}   This is a moderately broad 
coverage English lexicon (with about 38,000 lemmas) developed at New
York University under LDC sponsorship.  It contains detailed
information about the syntactic characteristics of each lexical item,
and is particularly detailed in its treatment of subcategorization
(complement structures).  It includes 92 different subcategorization
features for verbs, 14 for adjectives, and 9 for nouns.  These
features distinguish not only the different constituent structures
which may appear in a complement, but also the different control
features associated with a constituent structure.

Version 0, released in August 1994, is available by ftp to members who
sign a license agreement, which is also found on the LDC ftp site.

Some references for the syntax and semantics work:

Ralph Grishman, Catherine Macleod, and Adam
Meyers.  Comlex syntax: Building a computational lexicon.  To appear
in Proc. 15th Int'l Conf.  Computational Linguistics ({COLING} 94),
Kyoto, Japan, August 1994.

\vspace{.25in}
\noindent Item Name:    COMLEX English Syntax Lexicon, Version 0 \\
LDC Catalog No.:        LDC94L2 \\
NIST Catalog No.:       NA  \\
LDC Release date:       6/94 (MY94) \\
Nonmember price:        \$10,000 \\
Special license:        YES  \\


\subsubsection{COMLEX English Pronunciation}

Version 0, released in August 1994, is a 50,000 word pronouncing
dictionary of English, including the standard 30,000 word WSJ
vocabulary.  It is available only by ftp to members who sign a license
agreement, which is also found on the LDC ftp site.

A more complete description will be available shortly.  

\vspace{.25in}
\noindent Item Name:    COMLEX Pronouncing Dictionary, Version 0 \\
LDC Catalog No.:        LDC94L3 \\
NIST Catalog No.:       NA  \\
LDC Release date:       6/94 (MY94) \\
Nonmember price:        \$10,000 \\
Special license:        YES  \\


\pagebreak


\end{document}