Dear Colleague, You have inquired about the ACL/DCI CD-ROM. This disk, as well as much other material, is available from the Linguistic Data Consortium (LDC). A current LDC catalog is appended. More information is available by anonymous ftp from in directory /pub/ldc. To place orders, or to ask other questions, please send email to, or call 215-898-0464. Note that it is possible that your institution is already an LDC member; be sure to check this. Best wishes, Mark Liberman 619 Williams Hall University of Pennsylvania Phone: 215-898-0141 Philadelphia, PA 19104-6305 Fax: 215-573-2175 __________________________________________________________________________ \documentstyle[11pt]{article} \setlength{\parskip}{0in} \setlength{\oddsidemargin}{0in} \setlength{\headheight}{0in} \setlength{\headsep}{0in} \setlength{\textwidth}{6.5in} \setlength{\textheight}{9.0in} \begin{document} \thispagestyle{empty} \begin{center} {\Large \bf Corpora Available from\\ The Linguistic Data Consortium\\August 1994} \vspace{0.2in} \end{center} \pagebreak \tableofcontents \pagebreak \section{Introduction} %\begin{bf} INTRODUCTION \end{bf} Each section below describes a corpus or a set of corpora. They are listed by type (speech, text, lexicons, other) and within type by the membership year (MY) in which they were released by LDC. In the case of series or sets of corpora, each individual corpus or segment is described in a separate subsection. The descriptions are not intended to be complete; in most cases a more complete document can be found by logging in to the ftp server at LDC:, and browsing the pub/ldc directory for a README file on the corpus. In the catalog, each description is followed by six items of information: \begin{itemize} \item The name by which the corpus is generally known \item The LDC catalog order number(s), explained below \item The NIST Catalog numbers assigned to the discs, if they have ever been available through NIST or NTIS, otherwise ``NA'' \item The date and membership year of official release by LDC \item The current price for nonmembers, if available \item Whether a separate license (User Agreement) is required \end{itemize} The {\bf LDC catalog order number} is a unique identifier for convenience in referring to corpora, parts of corpora, and individual discs as needed. It is made up of the following: \begin{itemize} \item The letters ``LDC'' and two digits for the membership year (MY) of release \item A letter indicating whether the contents are mainly speech ``S'', text ``T'', lexicon ``L'', or other ``X'' \item A number designating the corpus \item A letter, used only where needed, designating a subcorpus or corpus segment, in case several segments or configurations of the corpus can be ordered separately \item After a hyphen, the number of the disc in the corpus or segment, used only when needed to refer to individual discs \item After a period, the revision number of the individual disc, used only where a revised version has been released \end{itemize} Here are two examples to illustrate both why this system was adopted and how it works. {\bf Example 1: The United Nations Parallel Text Corpus} was published in March of 1994, thus in membership year (MY) 94, consists entirely of text (T), and was assigned text corpus number 4. It comprises three discs: the first contains English texts, the second the corresponding French texts, and the third the corresponding Spanish. They are available either (A) as a set of three or (B) separately. Thus LDC94T4A refers to the UN corpus as a whole, LDC94TB-1 to the English disc alone, LDC94T4B-2 to the French alone, and LDC94T4B-3 to the Spanish alone. Shortly after release, the Spanish disc was found to have a manufacturing defect and was replaced with a new one, so if there is need to refer to the them individually, the original is now called 3.0 and the replacement 3.1. {\bf Example 2} The second Continuous Speech Recognition corpus, collected in 1993 and distributed in early 1994, was assigned corpus number 8. It contains 14 discs of speech recorded over a Sennheiser microphone (a {\em de facto} standard in ARPA evaluations); 15 discs with the same speech recorded over another microphone; and 5 discs containing unique (unpaired) data: speech recorded only once, transcriptions, test or evaluation data, etc., much of which is also needed to make full use of the paired speech recordings. To satisfy customer preferences, the corpus is offered by LDC in three configurations: (A) the complete corpus of 34 discs; (B) the ``Sennheiser corpus,'' i.e., the whole corpus minus the ``other microphone'' data, on 19 discs; and (C) the ``other microphone'' corpus, i.e., the whole corpus minus the Sennheiser data, 20 discs. These are designated as follows: \vspace{.2in} \noindent CSR-II Complete: LDC94S13A, consisting of LDC94S13A-1 through LDC94S13A-34 \\ CSR-II Sennheiser: 19 discs, LDC94S13B, consisting of LDC94S13B-1 through S13B-7, S13B-11, S13B-13 through S13B-16, S13B-18 through S13B-21, and S13B-32 through S13B-34 \\ CSR-II Other: 20 discs, LDC94S13C, consisting of LDC94S13C-8 through S13C-10, S13C-12 through S13C-14, S13C-17, S13C-22 through S13C-34 \\ \pagebreak \section{Prices and Conditions of Purchase} %{\Large \bf PRICES AND CONDITIONS OF PURCHASE} The following are the procedures and conditions for obtaining corpora from the LDC: \vspace{.25in} {\bf For LDC Members:} \vspace{.25in} LDC membership is annual, with the membership year (MY) running from 1 September to 31 August. Each LDC corpus is identified by the MY of its release and membership fees purchase a paid-up license to that MY's LDC corpora. For each MY, Senior Members receive extra copies of each requested corpus per approved site at no extra cost. Other Members receive one copy of each requested LDC corpus at no charge; there may be charges for corpora owned or produced by others and distributed by LDC. Members may also purchase extra "convenience copies" of LDC corpora, at \$100 per disk or the catalog price, for use at approved sites. These convenience copies are subject to the same restrictions and covered by the same license, if any, as the primary copies. Notices will be mailed to all members when new data sets are available. When corpora are re-issued in revised, enhanced, or supplemented form, unless the reason is defective materials, they will be distributed only to those whose LDC membership is current in the MY of re-issue. \vspace{.25in} {\bf For Nonmembers:} \vspace{.25in} Nonmembers may purchase single copies of most listed items, at prices which are set by the LDC from time to time, and normally only under a ``research-only'' license. No commercial licenses are granted to nonmembers. Payment may be made by check drawn from a bank with branches in the United States or payment may be wired to: Mellon Bank East, ABA NO. 03100003, Philadelphia, PA, for credit to The Trustees of the University of Pennsylvania, Account No 2945020, Attn: Judith Storniolo, 215-898-0464. Prices are subject to change; the prices below are effective until December 31, 1994. Nonmembers add a shipping charge for each order: \$30 US and Canada, \$50 overseas. \pagebreak \begin{center} {\large\bf 1993 RELEASES} \end{center} \begin{tabular}{|r|r|l|l|} \cline{1-4} Price & Set-of & Title & LDC Catalog No. \\ \cline{1-4} %----- ------ ----------- ------------ \$100 & 1 & TIMIT & LDC93S1 \\ 250 & 2 & NTIMIT & LDC93S2 \\ 1000 & 6 & Resource Management Complete Set & LDC93S3A \\ 600 & 4 & Resource Management RM1 & LDC93S3B \\ 600 & 2 & Resource Management RM2 & LDC93S3C \\ 1000 & 6 & ATIS0 Complete Corpora Set & LDC93S4A\\ 500 & 1 & ATIS0 Pilot & LDC93S4B-1\\ 500 & 1 & ATIS0 Read & LDC93S4B-2\\ 500 & 4 & ATIS0 SD Read & LDC93S4B-3\\ 2500 & 4 & ATIS2 & LDC93S5\\ 4000 & 15 & CSR-I (WSJ0) Complete & LDC93S6A\\ 2000 & 9 & CSR-I (WSJ0) Sennheiser & LDC93S6B\\ 2000 & 9 & CSR-I (WSJ0) Other & LDC93S6C\\ 10000 & 28 & SWITCHBOARD & LDC93S7\\ 1000 & 1 & SWITCHBOARD Credit Card & LDC93S8\\ 125 & 1 & TI 46-Word & LDC93S9\\ 750 & 3 & TIDIGITS & LDC93S10\\ 500 & 1 & Road Rally & LDC93S11\\ 200 & 8 & HCRC Map Task Corpus & LDC93S12\\ 25 & 1 & ACL/DCI & LDC93T1\\ 2500 & 1 & Penn Treebank Corpus & LDC93T2\\ 1000 & 1 & TIPSTER Volume 1 & LDC93T3-1.1 \\ 1000 & 1 & TIPSTER Volume 2 & LDC93T3-2.1 \\ 1000 & 1 & TIPSTER Volume 3 & LDC93T3-3.1 \\ \cline{1-4} \end{tabular} \pagebreak \begin{center} {\large\bf 1994 RELEASES} \\ \end{center} \begin{tabular}{|r|r|l|l|} \cline{1-4} Price & Set-of & Title & LDC Catalog No. \\ \cline{1-4} %----- ------ ----------- ------------ 10000 & 34 & CSR-II (WSJ1) Complete & LDC94S13A\\ 5000 & 19 & CSR-II (WSJ1) Sennheiser & LDC94S13B\\ 5000 & 20 & CSR-II (WSJ1) Other & LDC94S13C\\ 2500 & 8 & Air Traffic Control & LDC94S14\\ 2500 & 2 & SPIDRE & LDC94S15\\ 1000 & 1 & YOHO Speaker Verification & LDC94S16\\ 200 & 1 & OGI Multilanguage Corpus & LDC94S17\\ 100 & 1 & OGI Spelled \& Spoken Word & LDC94S18\\ 5000 & 3 & ATIS3 & LDC94S19\\ 2500 & 9 & BRAMSHILL & LDC94S20\\ 10000 & 8 & MACROPHONE (American English) & LDC94S21\\ 5000 & 3 & UN Parallel Text (Complete) & LDC94T4A\\ 2500 & 1 & UN Parallel Text (English) & LDC94T4B-1 \\ 2500 & 1 & UN Parallel Text (French) & LDC94T4B-2 \\ 2500 & 1 & UN Parallel Text (Spanish) & LDC94T4B-3.1\\ 35 & 1 & ECI Multilingual Text & LDC94T5 \\ 150 & 1 & CELEX Lexical Database & LDC94L1 \\ 10000 & 1 & COMLEX English Syntax Lexicon, Version 0 & LDC94L2 \\ 10000 & 1 & COMLEX Pronouncing Dictionary, Version 0 & LDC94L3 \\ \cline{1-4} \end{tabular} \vspace{.5in} \begin{center} {\large\bf 1995 RELEASES (TENTATIVE)} \end{center} \begin{tabular}{|r|r|l|l|} \cline{1-4} %----- ------ ----------- ------------ Price & Set-of & Description & Release Date \\ \cline{1-4} \$5000 & 5 & PHONEBOOK: NYNEX Isolated Words & Fall 1994 \\ 2500 & 2 & KING Speaker Verification & Fall 1994 \\ TBD & 2 & Hansard & Fall 1994 \\ TBD & 1 & Speech Collection Interface & Fall 1994 \\ TBD & 19 & CSRS-III & January 1995 \\ TBD & 19 & CSRO-III & January 1995 \\ 10000 & 10 & POLYPHONE-II (American Spanish) & January 1995 \\ 2000 &1 & TIPSTER Volume 4 & Fall 1994 \\ TBD & 1 & Treebank-2 & Winter 1995 \\ \cline{1-4} \end{tabular} \pagebreak \section{Speech Corpora: Descriptions and Ordering Information} \subsection {TIMIT Acoustic-Phonetic Continuous Speech Corpora} The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of 8 major dialects of American English, each reading 10 phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions as well as speech waveform data for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI), and Texas Instruments, Inc. (TI). The speech was recorded at TI at 16 kHz , transcribed at MIT, and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation. \subsubsection{Original ARPA-sponsored Version (TIMIT)} This is the original 16 kHz version, recorded over a high quality microphone in studio conditions. \vspace{.25in} \noindent Item Name: TIMIT \\ LDC Catalog No.: LDC93S1\\ NIST Catalog No.: \#1-1 \\ Release date: 10/90 (MY93) \\ Nonmember price: \$100 \\ Special license: NO \\ \subsubsection{NYNEX Telephone Version of TIMIT Corpus (NTIMIT)} The NYNEX Science and Technology Laboratories produced a telephone channel version of the TIMIT corpus by transmitting all 6300 TIMIT utterances through a handset and across various NYNEX telephone channels in a controlled manner. The data have been prepared for CD-ROM production by NIST. Waveform files use the NIST SPHERE format. For more information about NTIMIT, see "NTIMIT: A Phonetically Balanced, Continuous speech, Telephone Bandwidth Speech Database", by C. Jankowski, et al. in Volume 1 of Proceedings of ICASSP-90, pp. 109-112). \vspace{.25in} \noindent Item Name: NTIMIT \\ LDC Catalog No.: LDC93S2\\ NIST Catalog No.: \#10-1.1, 10-2.1 \\ LDC Release date: 8/92 (MY93) \\ Nonmember price: \$250 \\ Special license: NO \\ \subsection{The Resource Management Corpora} The DARPA Resource Management Continuous Speech Corpora (RM) consist of digitized and transcribed speech for use in designing and evaluating continuous speech recognition systems. There are two main sections, often referred to as RM1 and RM2. RM1 contains four CD-ROMs, two with Speaker-Dependent (SD) training data, one with Speaker-Independent (SI) training data, and one with test and evaluation data. RM2 has 2 CD-ROMs with an additional and larger SD data set, including test material. All RM material consists of read sentences from a naval resource management task. The complete corpus contains over 25,000 utterances from more than 160 speakers representing a variety of American dialects. The material was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset microphone. All discs conform to the ISO-9660 data format. RM sentences are consistent with a limited language model with a 1000 word vocabulary that allows queries about ships, ports, etc., along with commands to control a graphics display system, but little else. There is no "official" language model, but a simple non-probabilistic word-pair grammar that provides complete coverage of the sentences in this corpus is provided. The Resource Management text corpus was designed at BBN Laboratories, Inc. and SRI International. BBN also developed and made available the "Word-Pair" grammar that has been used in the benchmark tests. Texas Instruments, Inc. recruited the subjects and recorded and digitized the speech. For more information about the design and collection of this corpus see: P. Price, W.M. Fisher, J. Bernstein and D.S. Pallett, "The DARPA 1000-Word Resource Management Database for Continuous Speech Recognition", Proceedings of the 1988 International Conference on Acoustics, Speech and Signal Processing (Paper S.13.21, pp. 651- 654). A series of benchmark speech recognition performance assessment tests were conducted beginning in March 1987 using this corpus in conjunction with standardized scoring software. For more information see D.S. Pallett, "Benchmark Tests for DARPA Resource Management Database Performance Evaluations", in Proceedings of the 1989 International Conference on Acoustics, Speech and Signal Processing (Paper S10.b.6, pp. 536-539) and related papers in the Proceedings of the February 1989, October 1989, June 1990, and February 1991 DARPA Speech and Natural Language Workshops. \subsubsection {Complete Resource Management Corpus (RM Complete)} In addition to the RM1 and RM2 subsets as described in the following sections, LDC now offers the entire RM series at a reduced price, bundled as follows: \vspace{.25in} \noindent Item Name: RM Complete \\ LDC Catalog No.: LDC93S3A\\ NIST Catalog No.: \#2-1.1 through 2-4.2, 3-1.2 and 3-2.2 \\ LDC Release date: MY93 \\ Nonmember price: \$1000 \\ Special license: NO \\ \subsubsection {Resource Management SD and SI Training and Test Data (RM1)} The first two CD-ROMs contain Speaker-Dependent (SD) Training Data: 12 subjects, each reading a set of 600 "training sentences", 2 "dialect" sentences, and 10 "rapid adaptation" sentences, for a total of 7344 recorded sentence utterances. The 600 sentences designated as training cover 97\% of the lexical items in the corpus. The third CD-ROM contains the Speaker-Independent (SI) Training Data: 80 speakers each read the 2 "dialect" sentences plus 40 sentences from the Resource Management text corpus, for a total of 3360 recorded sentence utterances. Any given sentence from a set of 1600 Resource Management sentence texts was recorded by two subjects, while no sentence was read twice by the same subject. The fourth CD-ROM contains all SD and SI system test material used in 5 DARPA benchmark tests conducted in March and October of 1987, June 1988, and February and October 1989, along with scoring and diagnostic software and documentation for those tests. Documentation is also provided outlining use of the Resource Management training and test material at CMU in development of the SPHINX system. Example output and scored results for state-of-the-art speaker-dependent and speaker-independent systems (i.e., the BBN BYBLOS and CMU SPHINX systems) for the October 1989 benchmark tests are included, as well as SPeech HEader REsources (SPHERE) software and SPHERE-to-SAM conversion software. \vspace{.25in} \noindent Item Name: RM1 \\ LDC Catalog No.: LDC93S3B\\ NIST Catalog No.: \#2-1.1 through 2-4.2 \\ LDC Release date: 8/92 (MY93) \\ Nonmember price: \$600 \\ Special license: NO \\ \subsubsection {Extended Resource Management Speaker-Dependent Corpus (RM2)} This 2-disc set forms a speaker-dependent extension to the Resource Management (RM1) corpus. The corpus consists of a total of 10,508 sentence utterances (2 male and 2 female speakers each speaking 2,652 sentence texts). These include the 600 "standard" Resource Management speaker-dependent training sentences, 2 dialect calibration sentences, 10 rapid adaptation sentences, 1800 newly-generated extended training sentences, 120 newly-generated development-test sentences, and 120 newly-generated evaluation-test sentences. The evaluation-test material on the discs was used as the test set for the June 1990 DARPA SLS Resource Management Benchmark Tests (see the Proceedings.) The RM2 corpus was recorded at Texas Instruments. The NIST speech recognition scoring software originally distributed on the RM1 "Test" Disc was adapted for RM2 sentences, and is included on these discs as well as the SPHERE speech file header manipulation software. \vspace{.25in} \noindent Item Name: RM2 \\ LDC Catalog No.: LDC93S3C\\ NIST Catalog No.: \#3-1.2, 3-2.2 \\ LDC Release date: 8/90 (MY93) \\ Nonmember price: \$600 \\ Special license: NO \\ \pagebreak \subsection{Air Travel Information System (ATIS) Corpora} During 1989 and 1990, the DARPA Spoken Language Systems (SLS) Program initiated plans for development of a "common corpus" for both speech recognition and natural language research, using "spontaneous goal-directed" speech, rather than "read speech." The common task domain that was chosen is termed the "Air Travel Information System" (ATIS). The corpora developed to date in order to train and test systems in this domain are known as ATIS0, ATIS2, and ATIS3. (ATIS1 will not be published.) In all the ATIS corpora, users make spoken inquiries to simulated (ATIS0) or prototypical (ATIS2, ATIS3) speech understanding systems to obtain air travel information. The system has the information in the form of a relational database derived from the Official Airline Guide; the initial ATIS0 relational database, for example, contains information relevant to travel among 9 major airports serving 11 cities. To measure performance, the system's answers to the spoken inquiries are expressed in a logical form known as the "canonical answer specification" (CAS) language, and compared with canonical answers reviewed by human experts. There are thus a number of auxiliary files associated with each utterance, including orthographic transcriptions and, for answerable queries, ``reference answers''. Texas Instruments developed ATIS0, the pilot corpus for this program, using a "Wizard of Oz" technique to simulate an ATIS SLS. (See Hemphill, Godfrey and Doddington's paper ``The ATIS Spoken Language Systems Pilot Corpus'' in the Proceedings of the June 1990 DARPA Speech and Natural Language Workshop.) Since 1991, the data for ATIS2 and ATIS3 have been collected at multiple sites and pooled for common use. The number of speakers and utterances, the coverage of the travel information database, the collection scenarios and platforms, have all changed as documented in each corpus section. For further information on the ATIS domain, on the test paradigm, and on ATIS-domain benchmark tests, see the Proceedings of the DARPA Speech and Natural Language Workshops held in October 1989, June 1990 and February 1991. (Morgan Kaufman, Publishers, Inc., 2929 Campus Drive, San Mateo, CA 94403. ISBN numbers: 1-55860-112-0, 1-55860-157-0, and 1-55860-207-0.) \subsubsection{ATIS0 Spontaneous Speech Pilot Corpus and Relational Database} The ATIS0 Corpus totals 6 CD-ROMs: one with spontaneous data from 36 speakers; one with read versions of the data from 20 of those speakers, along with some adaptation material; and four with extensive speaker dependent material from the ATIS domain, read by 10 of the same speakers. All ATIS speech data is recorded at 16kHz sample rate, 16 bit quantization, from two different microphones, a close-talking (Sennheiser) and a desk-top (Crown PCC-160) model. The first disc (ATIS0 Pilot) contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with the relational database containing the travel information (excluding connecting flights). Thirty-six speakers produced a total of 912 utterances. The second disc (ATIS0 Read) contains ``read'' versions of the spontaneous utterances for 20 of the 36 speakers above, for a total of 478 productions. This is supplemented by a set of 40 ``adaptation'' sentences read by each of the 20 speakers. The third through the sixth discs (ATIS0 SD-Read) contain "read" speech in the ATIS domain for ten of the speakers on the first disc. They read a total of 3171 utterances, or approximately 317 utterances per speaker. This data was collected for the purpose of training speaker-dependent speech recognition systems for the ATIS0 domain. Two of these four discs contain the close-talking (Sennheiser) microphone data, and the other two contain corresponding data for the desk-top (Crown PCC-160) microphone. Thus there are 6342 waveform files on the four discs. The entire ATIS0 set of six discs is now offered at a reduced price: \vspace{.25in} \noindent Item Name: ATIS0 Complete \\ LDC Catalog No.: LDC93S4A\\ NIST Catalog No.: \#5-1.1 through 5-6.1 \\ LDC Release date: 4/94 (MY93) \\ Nonmember price: \$1500 \\ Special license: NO \\ Any individual subcorpus from the ATIS0 set can be purchased for \$500: \vspace{.25in} \noindent Item Names: ATIS0 Pilot/Read/SD-Read \\ LDC Catalog Nos.: LDC93S4B-1/LDC93S4B-2/LDC93S4B-3\\ NIST Catalog No.: \#5-1.1 through 5-6.1 \\ LDC Release date: 4/92 (MY93) \\ Nonmember price: \$500 per subcorpus \\ Special license: NO \\ \subsubsection{ATIS2} The ATIS2 corpus, on four CD-ROMs, contains approximately 15,000 utterances recorded from approximately 450 subjects at five sites: AT\&T, BBN, CMU, MIT's Laboratory for Computer Science, and SRI. All utterances are been transcribed and almost 10,000 of them annotated with categorizations and canonical reference answers. Unlike the ATIS0 corpus, much of the data in ATIS2 was collected using partially or fully-automated data collection systems. The fully-automated data collection systems were, in fact, working ATIS prototypes. For ATIS2, the 10-city relational database of ATIS0 was revised to accommodate connecting flights and fares and some table headings were renamed. In addition to training data, the February and November '92 ATIS Benchmark Tests are included as well. Each contains approximately 1,000 utterances from the pool of data collected by the five sites. \vspace{.25in} \noindent Item Name: ATIS2 \\ LDC Catalog No.: LDC93S5\\ NIST Catalog No.: \#12-1.1 through 12-4.1 \\ LDC Release date: 4/92 (MY93) \\ Nonmember price: \$2,500 \\ Special license: NO \\ \subsubsection{ATIS3} The ATIS3 corpus, on three CD-ROMs, includes over 774 scenarios completed by 137 subjects, yielding a total of over 7,300 utterances. All utterances are transcribed and 2,900 of them have been categorized and annotated with canonical reference answers. The relational database for this dataset included flight information for 46 cities and 52 airports. Data was collected at BBN, CMU, MIT, and SRI, using their own ATIS systems, and at NIST using systems provided by BBN and SRI. Two 1000-utterance test sets were set aside from the data pooled by the collection sites. The first set was used in a December 1993 ARPA test, and is included in ATIS3. The second has been reserved for future testing. \vspace{.25in} \noindent Item Name: ATIS3 \\ LDC Catalog No.: LDC94S20\\ NIST Catalog No.: \#17-1.1 through 17-3.1 LDC Release date: 8/94 (MY94) \\ Nonmember price: \$5000 \\ Special license: NO \\ \pagebreak \subsection {Continuous Speech Recognition (CSR) Corpora sponsored by ARPA} During 1991, the DARPA Spoken Language Program initiated efforts to build a new corpus to support research on large-vocabulary Continuous Speech Recognition (CSR) systems. The first two CSR Corpora consist primarily of read speech with texts drawn from a machine-readable corpus of Wall Street Journal text, and are thus often known as WSJ0 and WSJ1. (Later sections of the CSR set of corpora, howver, will consist of read texts from other sources of North American business news, and eventually from other news domains.) The read data was collected using 5,000-word and 20,000-word subsets of the WSJ text corpus. Some spontaneous dictation is included in addition to the read speech. The dictation portion was collected using journalists who dictated hypothetical news articles. Two microphones are used throughout: a close-talking Sennheiser model, and a secondary microphone which may vary. The corpora are thus offered in three configurations: the speech from the Sennheiser, the speech from the other microphone, and the speech from both; all three sets include all transcriptions, tests, documentation, etc. In general, transcriptions of the speech, test data from ARPA evaluations, scores achieved by various speech recognition systems, and software used in scoring are included on separate discs from the waveform data. \subsubsection {ARPA Continuous Speech Recognition Corpus I: Wall Street Journal Sentences (WSJ0, or CSR-I)} MIT's Laboratory for Computer Science, SRI International and Texas Instruments collected approximately 40 hours of speech and over 31,000 utterances. Prompts were taken from the Wall Street Journal. Development and evaluation test sets are included and so marked. \vspace{.25in} \noindent Item Name: CSR-I Complete \\ LDC Catalog No.: LDC93S6A\\ NIST Catalog No.: \#11-1.1 through 11-12.1, 11-14.1, 11-15.1 \\ LDC Release date: 7/93 (MY93) \\ Nonmember price: \$4,000 \\ Special license: NO \\ \vspace{.25in} \noindent Item Name: CSR-I Sennheiser \\ LDC Catalog No.: LDC93S6B\\ NIST Catalog No.: \#11-1.1 through 11-6.1, 11-13.1 through 11-15.1 \\ LDC Release date: 4/93 (MY93) \\ Nonmember price: \$2,000 \\ Special license: NO \\ \vspace{.25in} \noindent Item Name: CSR-I Other \\ LDC Catalog No.: LDC93S6C-7\\ NIST Catalog No.: \#11-7.1 through 11-15.1\\ LDC Release date: 4/93 (MY93) \\ Nonmember price: \$2,000 \\ Special license: NO \\ \subsubsection {ARPA Continuous Speech Recognition Corpus II: Wall Street Journal Sentences (WSJ1, or CSR-II)} The complete WSJ1 corpus contains approximately 78,000 training utterances (~73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 "conventional" development test utterances (~8 hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using 2 microphones, so the amount of speech in the entire corpus is about 162 hours. In early 1993, a "Hub and Spoke" test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or ``hub'' condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7500 waveforms (~11 hours of speech). WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded ``Shorten'' compression algorithm developed at Cambridge University. \vspace{.25in} \noindent Item Name: CSR-II Complete \\ LDC Catalog No.: LDC94S13A\\ NIST Catalog No.: \#13-1.1 through 13-34.1 \\ LDC Release date: 7/93 (MY94) \\ Nonmember price: \$10,000 \\ Special license: NO \\ \vspace{.25in} \noindent Item Name: CSR-II Sennheiser \\ LDC Catalog No.: LDC94S13B\\ NIST Catalog No.: \#13-1.1 through 13-7.1, 13-11.1, 13-13.1 \\ through 13-16.1,13-18.1 through 13-21.1, and \\ 13-32.2 through 13-34.1 \\ LDC Release date: 7/93 (MY94) \\ Nonmember price: \$5,000 \\ Special license: NO \\ \vspace{.25in} \noindent Item Name: CSR-II Other \\ LDC Catalog No.: LDC94S13C\\ NIST Catalog No.: \#13-8.1 through 13-10.1, 13-12.1 through \\ 13-14.1, 13-17.1, and 13-22.1 through 13-34.1 \\ LDC Release date: 7/93 (MY94) \\ Nonmember price: \$5,000 \\ Special license: NO \\ \subsection{Switchboard Corpus of Recorded Telephone Conversations} SWITCHBOARD is a collection of about 2400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven "robot operator" system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person to take part in a conversation, introducing a topic for discussion, and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callers was constrained so that: (1) no two speakers would converse together more than once, and (2) no one spoke more than once on a given topic. Waveform files were recorded into two channels directly from the T1 digital telephone circuits, at an 8kHz sample rate and 8-bit mu-law quantization. Complete orthographic transcriptions were made for each conversation, with codes to identify overlapping portions (both speakers talking at the same time), certain non-speech events (laughter, coughs, etc), and interruptions/hesitations. Each conversation was also rated by transcribers for various quality factors (amount of cross-talk between channels, static and background noise, topicality, etc). In addition, each transcription was verified, and then used in a forced speech-recognition algorithm to establish timing marks for word and utterance boundaries; transcriptions are provided in the corpus in both "plain text" and "time-aligned" forms. A description is published in the 1993 ICASSP Proceedings: Godfrey, McDaniel, and Holliman, ``SWITCHBOARD: A Telephone Speech Corpus for Research and Develpment.'' The original issue of SWITCHBOARD in early 1993 lacked about 150 conversations which were intended for publication but omitted by error. They were published in May 1994 and distributed to all previous recipients of SWITCHBOARD. The Switchboard Corpus was collected at Texas Instruments and produced on CD-ROM at the National Institute of Standards and Technology. It is distributed in a notebook-style binder with 28 CD-ROMs, (27 containing speech data, and one containing all transcription data). Preparation of the data for CD-ROM production was done by NIST. The waveform files use the NIST SPHERE format. \vspace{.25in} \noindent Item Name: Switchboard \\ LDC Catalog No.: LDC93S7\\ NIST Catalog No.: \#9-1.1, 9-3.1 through 9-29.1 \\ LDC Release date: 4/92 (MY93) \\ Nonmember price: \$10000 \\ Special license: NO \\ \subsection {Switchboard Corpus Excerpts, Credit Card Conversations} This CD-ROM contains 35 conversations on the topic of ``Credit Card Use''. Most but not all can also be found in the Switchboard Corpus (see below). The conversations can be used in training and testing wordspotting systems. In addition to 2-channel mu-law encoded audio waveform files, the disc contains transcriptions, time-alignments, and wordspotting targets. \vspace{.25in} \noindent Item Name: Switchboard Credit Card \\ LDC Catalog No.: LDC93S8\\ NIST Catalog No.: \#8-1.2 \\ LDC Release date: 5/92 (MY93) \\ Nonmember price: \$1000 \\ Special license: NO \\ \subsection {Texas Instruments 46-Word Speaker-Dependent Isolated Word Corpus (TI46)} This CD-ROM contains a corpus of speech which was originally designed and collected at Texas Instruments, Inc. (TI) in 1980, and used initially in performance assessment tests of isolated-word speaker-dependent technology. (See ``Speech Recognition: Turning Theory to Practice'' by G. R. Doddington and T. B. Schalk, in IEEE Spectrum, Vol. 18, No. 9, September 1981.) The 46-word vocabulary consists of two sub-vocabularies: (1) the TI 20-word vocabulary (consisting of the digits zero through nine plus the words "enter", "erase", "go", "help", "no", "rubout", "repeat", "stop", "start", and "yes", and (2) the TI 26-word "alphabet set" (consisting of the letters "a" through "z"). The corpus contains read utterances from 16 speakers (8 males and 8 females) each speaking 26 utterances of the 46-word vocabulary: 16 tokens designated as training and 10 as test. The corpus was collected at Texas Instruments in a quiet acoustic enclosure using an Electro-Voice RE-16 Dynamic Cardiod microphone at 12.5kHz sample rate with 12-bit quantization. The files are in NIST SPHERE format, and have a ".wav" filename extension. \vspace{.25in} \noindent Item Name: TI 46 Word \\ LDC Catalog No.: LDC93S9\\ NIST Catalog No.: \#7-1.1 \\ LDC Release date: 4/92 (MY93) \\ Nonmember price: \$125 \\ Special license: NO \\ \subsection {Texas Instruments Speaker-Independent Connected-Digit Corpus (TIDIGITS)} This three-disc set contains speech which was originally designed and collected at Texas Instruments, Inc. (TI) for the purpose of designing and evaluating algorithms for speaker-independent recognition of connected digit sequences. There are 326 speakers (111 men, 114 women, 50 boys, and 51 girls) each pronouncing 77 digit sequences. Each speaker group is partitioned into test and training subsets. The corpus was collected at TI in 1982 in a quiet acoustic enclosure using an Electro-Voice RE-16 Dynamic Cardiod microphone, digitized at 20kHz. The waveform files are in the NIST SPHERE format. \vspace{.25in} \noindent Item Name: TIDIGITS \\ LDC Catalog No.: LDC93S10\\ NIST Catalog No.: \#4-1, 4-2, 4-3 \\ LDC Release date: 4/92 (MY93) \\ Nonmember price: \$750 \\ Special license: NO \\ \subsection {Road Rally Conversational Speech Corpora} The "Road Rally" corpus was designed for the development and testing of word-spotting systems, and consists of two sub-corpora, known as ``Stonehenge'' and ``Waterloo.'' Stonehenge was collected using telephone handsets modified to contain a high quality microphone. To gather conversational data, two talkers were located in separate rooms, given a road map, and asked to participate in a road rally planning task. The digitized speech was filtered using a 300 Hz to 3300Hz PCM FIR bandpass filter to simulate telephone quality. The Stonehenge corpus contains 3 "styles" of speech data: (1) the spontaneous conversations, (2) a read paragraph, containing at least one occurrence of each of the key words, and (3) a set of read "carrier" sentences. There are 80 speakers, 52 males and 28 females. Twenty words were selected as keywords, and their occurrences and locations are marked. The Waterloo corpus was collected as an extension of Stonehenge, providing similar domain material, but collected under different conditions. It is intended for use in training models of keywords in the conversational portion of the Stonehenge corpus. The Waterloo material was collected from 56 speakers (28M, 28F) using conventional telephone handsets and dialed-up telephone lines in the Massachusetts area, and consists of a read passage only (not the same as that in Stonehenge.) For this release, the naturally band-limited telephone handset and line speech data were subsequently filtered with the same 300 Hz to 3300 Hz PCM FIR bandpass filter that was used for this release's Stonehenge data. Suggested wordspotting training and test procedures are outlined in the documentation. \vspace{.25in} \noindent Item Name: ROAD RALLY \\ LDC Catalog No.: LDC93S11\\ NIST Catalog No.: \#6-1.1 \\ LDC Release date: 4/92 (MY93) \\ Nonmember price: \$500 \\ Special license: NO \\ \subsection{The HCRC Map Task Corpus} The Map Task Corpus is a set of 8 CD-ROMs containing a total of about 18 hours of spontaneous speech that was recorded from 128 two-person conversations, involving 64 different speakers (32 female, 32 male, all adults, each taking part in four conversations). The 64 speakers were all students at the University of Glasgow, 61 of them being native Scots. The conversations were carried out in an experimental setting, in which each participant has a schematic map in front of them, not visible to the other. Each map is comprised of an outline and roughly a dozen labelled features (e.g. "a white cottage", "an oak forest", "Green Bay", etc). Most features are common to the two maps, but not all. One map has a route drawn in, the other does not. The task is for the participant without the route to draw one on the basis of discussion with the participant with the route. In addition to the conversations, each speaker provides a wordlist reading, consisting of the major vocabulary items contained in the conversations. The experimental design allows a number of different phonemic, syntactico-semantic and pragmatic contrasts to be explored in a controlled way. In particular, maps and feature names were designed to allow for controlled exploration of phonological reductions of various kinds in a number of different referential contexts, and to provide, via varying patterns of matches and mis-matches between the two maps, a range of different stimuli for referent negotiation. Also the conditions of the conversations were carefully balanced: In half of them the talkers were strangers, in half friends; in half of them the talkers could see each other's faces, in half they could not. The waveform data are provided in "raw" (headerless) files (16-bit samples, 20 kHz sample rate, 2 channels per conversation), and alternative header files are provided for use with software based on either the NIST ``SPHERE'' header structure or the European ``SAM'' header structure. Text transcriptions are provided for each conversation, along with PostScript files of the map images used in the experiments. Additional materials include full documentation of the experimental design and data collection protocol, resources for using SGML tools on the transcriptions and other text materials, and an extensive set of source code for performing basic signal processing functions on the waveform data, such as down-sampling, de-multiplexing, channel summation, and D/A conversion for Sun workstations (including playback of segments selected via inspection of transcripts in Emacs). \vspace{.25in} \noindent Item Name: HCRC MAP TASK \\ LDC Catalog No.: LDC93S12\\ NIST Catalog No.: NA \\ LDC Release date: 4/92 (MY93) \\ Nonmember price: \$200 \\ Special license: NO \\ \subsection {Air Traffic Control Corpus (ATC0)} The Air Traffic Control Corpus (ATC0) is an eight-disc set of recorded speech for use in supporting research and development activities in the area of robust speech recognition in domains similar to air traffic control (several speakers, noisy channels, relatively small vocabulary, constrained languaged, etc.) The audio data on these discs is composed of voice communication traffic between various controllers and pilots. The audio files are 8 KHz, 16-bit linear sampled data, representing continuous monitoring, without squelch or silence elimination, of a single FAA frequency for one to two hours. There are also files which indicate the amplitude of the received AM carrier signal at 10 msec. intervals. Full transcripts, including the start and end times of each transmission, are provided for each audio file. Each flight is identified by its flight number. ATC0 consists of three subcorpora, one for each airport in which the transmissions were collected -- Dallas Fort Worth (DFW), Logan International (BOS), and Washington National (DCA). The complete set contains approximately 70 hours of controller and pilot transmissions collected via antennas and radio receivers which were located in the vicinity of the respective airports. Detailed information regarding the collection process and the equipment used can be found on each disc in the file, ``atc.doc'' in the ``\/doc'' directory. The ATC0 Corpus was collected by Texas Instruments under contract to ARPA. It was produced on CD-ROM by the National Institute of Standards and Technology for distribution by the Linguistic Data Consortium. \vspace{.25in} \noindent Item Name: AIR TRAFFIC CONTROL \\ LDC Catalog No.: LDC94S14\\ NIST Catalog No.: \#16-1.1 through 16-8.1 \\ LDC Release date: 3/94 (MY94) \\ Nonmember price: \$2500 \\ Special license: NO \\ \subsection {SPIDRE Speaker Identification Corpus} This is 2-CD subset of the SWITCHBOARD collection (see above), selected for speaker ID research, and with special attention to telephone instrument variation. It contains training and testing data for experiments in closed or open set recognition or verification. Combining the two sides of the conversations also permits speaker change detection, or speaker monitoring, experiments. There are 45 ``target'' speakers; four conversations from each target are included, of which two are from the same handset. There are also 100 calls in which no target appears. Since all conversations are two-sided, this results in 180 target sides and 180 + 200 = 380 nontarget sides. Except for truncations of a few longer calls at 5 minutes, the call themselves are as described under SWITCHBOARD. \vspace{.25in} \noindent Item Name: SPIDRE \\ LDC Catalog No.: LDC94S15 \\ NIST Catalog No.: \#18-1.1 and 18-2.1 \\ LDC Release date: 4/94 (MY94) \\ Nonmember price: \$2500 \\ Special license: NO \\ \subsection{YOHO Speaker Verification Corpus} The YOHO database is a three-disc set containing a large scale, high-quality speech corpus to support text-dependent speaker authentication research, such as is used in "secure access" technology. The data was collected in 1989 by ITT under a US Government contract, but has not been available for public use before. Note that certain changes have been made to the corpus, mainly to insure the privacy of the speakers, and some data has been withheld by the government for future use in testing. YOHO contains: \begin{itemize} \item ``Combination lock'' phrases (e.g., 36-24-36) \item Collected over 3 month period in a real-world office environment \item 4 enrollment sessions per subject with 24 phrases per session \item ~10 test sessions per subject with 4 phrases per session \item 8 kHz sampling with 3.8 kHz analog bandwidth \item 1.5 gigabytes of data \\ \end{itemize} The number of trials is thus sufficient to permit evaluation testing at high confidence levels. In each session, a speaker was prompted with a series of phrases to be read aloud; each phrase was a sequence of three two-digit numbers (e.g. ``35 - 72 - 41,'' pronounced ``thirty-five seventy-two forty-one''). The first four sessions for a given speaker were enrollment sessions of 24 phrases, and all additional sessions were verification trials of four phrases each. In all there are 552 enrollment sessions, and 1380 trial sessions, with a nominal time interval of three days between sessions. \vspace{.25in} \noindent Item Name: YOHO \\ LDC Catalog No.: LDC94S16 \\ NIST Catalog No.: NA \\ LDC Release date: 4/94 (MY94) \\ Nonmember price: \$1000 \\ Special License: NO \\ \subsection {OGI Multi-Language Corpus } The corpus consists of responses to prompts spoken over commercial telephone lines by speakers of English, Farsi(Persian), French, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. It contains a total of 1927 calls, an average of 175 calls per language. Speech was collected using an automated system that answered the telephone, played digitized prompts in the appropriate language to request the speech samples, and digitized the callers' responses for a designated period of time. Log files are included that provide a set of automatic measurements made on each utterance. In addition, some utterances were automatically segmented into broad phonetic catagories. The speech data are compressed, with NIST SPHERE headers. \vspace{.25in} \noindent Item Name: OGI MULTI-LANGUAGE TELEPHONE \\ LDC Catalog No.: LDC94S17 \\ NIST Catalog No.: NA \\ LDC Release date: 4/94 (MY94) \\ Nonmember price: \$2000 \\ Special license: NO \\ \subsection{OGI Spelled and Spoken Telephone Corpus} The OGI Spelled and Spoken Telephone Corpus consists of speech recordings from over 3650 telephone calls, each made by a different speaker to an automated prompting/recording system installed at the Oregon Graduate Institute. Speakers were asked to say their name, where they were calling from, and where they grew up; they were asked to answer a couple of yes/no questions, and to spell their first and last names; many were also asked to repeat a few specific words, and to recite the letters of the alphabet. Each response to a prompt is stored as a separate waveform file, and the files are organized according to prompt (response type); all responses from a given call have a unique caller-index number as part of the file named, so that responses can easily be sorted by speaker. Waveform data are stored in compressed form, using the NIST SPHERE 2.0 software package, which is available separately at no charge to users. SPHERE 2.0 provides the decompression software needed to extract the waveform data, as well as tools for accessing and modifying file headers. Time-aligned phonetic transcriptions are provided for a subset of responses, and a complete log of each (giving speaker sex, quality judgments, and orthographic transcriptions of all responses) is included in a form suitable for use as a relational data base. \vspace{.25in} \noindent Item Name: OGI SPELLED \& SPOKEN WORD \\ LDC Catalog No.: LDC94S18 \\ NIST Catalog No.: NA \\ LDC Release date: 4/94 (MY94) \\ Nonmember price: \$100 \\ Special license: NO \\ \subsection{BRAMSHILL} The recordings on this nine-disc set were originally made in 1978-79 as part of a British Home Office study into speaker identification techniques. Subsequently, it was realised that a large body of unconstrained conversational material might be of interest to researchers working in other speech processing fields. The recordings were transcribed and the CD-ROMs prepared during 1993. The recordings were made at the Police Staff College, Bramshill, Hampshire, England. The participants were police officers taking part in the various courses at the college. This provided a wide range of regional accents and a range of ages from late teens to early fifties. Each speaker is described by nine demographic attributes. Three adjacent bedrooms were used. The two participants, each alone in their rooms, conversed by telephone. The third room was used as a monitoring and recording station. In addition to the telephone recordings, reference recordings were made using a high quality dynamic microphone in each room. It is these higher quality recordings, {\em not the telephone speech}, which are provided on the BRAMSHILL CD-ROM set. The recordings were made on a Sony Elcaset EL-7 cassette machine, chosen at the time because of its good speed stability. The microphone was a Shure SM-7 cardioid type. The speech data was sampled at 10 kHz, 16-bit resolution. Some attempt was made to control the acoustic environment. It is evident from listening to the recordings that, while these measures produced a reasonable recording environment, the rooms were far from soundproof. A variety of external noises (engines, aircraft, etc) can be heard on some of the recordings. Each speaker was given a pile of photographs. In response to a bleep signal, each speaker introduced himself by name and read a set of test sentences. After this, the main part of the conversation took place, in which participants were asked to determine which of each pair of photographs has been taken first (if indeed they were related at all). The conversations continued for 10 minutes until terminated by another bleep signal. During the digitisation process, some periods of silence were removed, so some recordings now appear to be shorter than the original ten minutes. Furthermore, this means that recordings of two sides of a conversation {\em are no longer time-aligned}. In addition, to preserve the anonymity of the speakers, some passages (mainly the introductions) have been erased by replacing with binary zeroes. Finally the bleep signals have also been erased with binary zeroes. The transcriptions indicate where this has occurred. The speech was transcribed verbatim. No attempt was made to correct grammar, fill in missing words etc. Transcription conventions are detailed in the documentation. Every lexical word from the transcriptions is contained in the dictionary supplied in the INDEX directory. There are about 6500 word types in the 600k words of the transcripts. Contractions, part-words, slang words, hesitation sounds and the non-speech sounds such are all treated as words in their own right in the dictionary. \vspace{.25in} \noindent Item Name: BRAMSHILL \\ LDC Catalog No.: LDC94S20 \\ NIST Catalog No.: NA \\ LDC Release Date: 8/94 (MY94) \\ Nonmember price: \$150 \\ Special license: NO \\ \subsection{MACROPHONE} MACROPHONE consists of approximately 200,000 utterances by 5000 speakers. It is designed to provide material sufficient and suitable for research, development, and evaluation of automatic speech recognition technology for common telephone applications, such as shopping, transportation, database access, and autodialing. In addition to application-oriented phrases and numerous digit strings, seven sentences are spoken by each talker to provide ensemble phoneme, diphone and triphone coverage of the language. The spoken material also refers to times, locations, monetary amounts, spellings, and interactive operations. The utterances were collected automatically over the telephone network by recording directly from a T1 connection in 8 kHz, 8-bit mu-law format. The participants, roughly equal numbers of males and females, were solicited by a marketing firm from all regions of the United States. They ranged in age from the teens to the seventies, and represented a broad range of educations and incomes as well. Each recorded utterance is accompanied by an orthographic transcription which also notes any unusual acoustic events or anomalies. Macrophone is the American English contribution to an international database of telephone speech corpora called POLYPHONE. Similar data sets are expected for major languages of the world, and at least some of these will be made available through LDC. Prospects are currently good for American Spanish (by early 1995), Dutch, Standard French, Standard German, Japanese, Mandarin Chinese, Swiss French, and Danish versions of POLYPHONE, all with basically the same structure and methods of collection. MACROPHONE was collected at SRI under LDC sponsorship. A paper describing it was presented at ICASSP-94: ``Macrophone: An American English Telephone Speech Corpus for the POLYPHONE Project,'' by Jared Bernstein, Kelsey Taussig, and Jack Godfrey. \vspace{.25in} \noindent Item Name: MACROPHONE \\ LDC Catalog No.: LDC94S22 \\ NIST Catalog No.: NA \\ LDC Release date: August 1994 (MY94) \\ Nonmember price: \$10000 \\ Special license: NO \\ \section{Text Corpora: Descriptions and Ordering Information} \subsection{Association for Computational Linguistics Data Collection Initiative (ACL/DCI)} The ACL Data Collection Initiative disc contains text from: Wall Street Journal, copyright 1987, 1988, 1989, provided by Dow Jones, Inc.; the Collins English Dictionary, Copyright 1979, William Collins Sons \& Co., Ltd.; scientific abstracts provided by the U.S. Department of Energy; and a variety of gramatically tagged and parsed materials from the Treebank project at the University of Pennsylvania, copyright 1990,1991, University of Pennsylvania. The total amount of uncompressed text is 620 Mbytes. The many formats in which the originals of these texts came have all, to one extent or another, been mapped into a markup language consistent with the SGML standard (ISO 8879). The format of the material from the Wall Street Journal uses a labelled bracketing, expressed in the style of SGML, although no formal SGML DTD is provided. The tag set has been modified by turning the Dow Jones header categories into tags and by creating ad hoc tages such as ``.'' The original datelines are presented as separate text units; the text is divided and tagged into paragraphs and sentences with each sentence presented on a single line. Nothing has been done to modify the typographical methods used to subdivide headlines and stories into sections, nor are any of the text features within sentences (quotes, ellipsis, etc.) normalized. The Collins English Dictionary is present in two forms. One form was approximately parsed into fielded records as an exercise in learning a language called ``FIT'', by a student working under the direction of Lloyd Nakatani at AT\&T Bell Laboratories during the summer of 1990. The original digital image of the typographer's tape that the database version was prepared from had serious flaws that were not detected and corrected until later; the corrected version, a clean typographer's tape, is presented in a separate directory. A properly-analyzed database version will be provided in the future. The documentation includes notes developed during the new attempt to analyze the tape from scratch. The Department of Energy abstracts reside in files that are approximately one megabyte each. The original 950 separators have been replaced with newlines, and space padding between articles was removed. An acronym dictionary that was extracted from the database as an indication of the material's topic areas has been included in a separate directory. Provisional material from the Penn Treebank project is divided into two subdirectories on this disk. The subdirectory ``postext'' contains text with part-of-speech annotations; ``parstext'' contains text with syntactic bracketing. \vspace{.25in} \noindent Item Name: ACL/DCI \\ LDC Catalog No.: LDC93T1\\ NIST Catalog No.: NA \\ LDC Release date: 4/92 (MY93) \\ Nonmember price: \$25 \\ Special license: YES \\ \subsection {The Penn Treebank Project} This CD-ROM contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional 1 million words tagged for part-of-speech. This material is a subset of the corpus for the current DARPA large-vocabulary speech recognition project. It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3, and ATIS. In addition, the CD-ROM includes source code for several software packages, including tgrep, which permits the user to search for specific constituents in tree structures. Later versions of Treebank in MY95 will include greater depth of annotation, and more varied corpus materials. \vspace{.25in} \noindent Item Name: PENN TREEBANK \\ LDC Catalog No.: LDC93T2\\ NIST Catalog No.: NA \\ LDC Release date: 1/93 (MY93) \\ Nonmember price: \$2500 \\ Special license: NO \\ \pagebreak \subsection {TIPSTER Information Retrieval Text Research Collection} The TIPSTER project is sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections. The detection data is comprised of a new test collection built at NIST to be used both for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection built at NIST consists of 3 disks (gigabytes) of documents, 150 topics, and the answers (relevant documents) for those topics. The documents in the test collection are varied in style, size, and subject domain. The first disk contains material from the Wall Street Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing), and short abstracts from the Department of Energy. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News (1991), more AP newswire (1990), and about 250 megabytes of formatted U.S. Patents. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data. The three Tipster discs so far released have been re-issued with updates and corrections, and all recipients of the earlier versions should have received these replacements free of charge. If you think you have the unrevised original, contact LDC for confirmation. A fourth Tipster volume is planned for release during MY95. \subsubsection{TIPSTER Volume 1, March 1992} \begin{tabular}{ll} Directory Name & Description\\ \\ /ap & Associated Press Newswire material, copyright 1989\\ /fr & Federal Register material, 1989\\ /wsj & Wall Street Journal, copyright 1987, 1988, 1989\\ /doe & Department of Energy abstracts\\ \end{tabular} \vspace{.25in} \noindent Item Name: TIPSTER vol.1 \\ LDC Catalog No.: LDC93T3-1.1 \\ NIST Catalog No.: NA \\ LDC Release date: 4/92 (MY93) \\ Nonmember price: \$1000 \\ Special license: YES \\ \subsubsection{TIPSTER Volume 2, July 1992} \begin{tabular}{ll} Directory Name & Description\\ \\ /ap & Associated Press Newswire material, copyright 1988\\ /fr & Federal Register, 1988\\ /wsj & Wall Street Journal, copyright 1990, 1991, 1992\\ /ziff & Ziff-Davis Publishing, copyright 1989, 1990\\ /doe & Department of Energy abstracts\\ \end{tabular} \vspace{.25in} \noindent Item Name: TIPSTER vol.2 \\ LDC Catalog No.: LDC93T3-2.1 \\ NIST Catalog No.: NA \\ LDC Release date: 7/92 (MY93) \\ Nonmember price: \$1000 \\ Special license: YES \\ \subsubsection{TIPSTER Volume 3, April 1993} \begin{tabular}{ll} Directory Name & Description\\ \\ /ap & Associated Press material, copyright 1990\\ /patents & U.S. Patent documents, 1983-1991\\ /sjm & San Jose Mercury News, copyright 1991\\ \end{tabular} \vspace{.25in} \noindent Item Name: TIPSTER vol.3 \\ LDC Catalog No.: LDC93T3-3.1 \\ NIST Catalog No.: NA \\ LDC Release date: 7/92 (MY93) \\ Nonmember price: \$1000 \\ Special license: YES \\ \pagebreak \subsection {United Nations Parallel Text Corpus (English, French, Spanish)} This set of three compact discs contains documents provided to the LDC by the United Nations, for use in research on machine translation technology. The documents come from the Office of Conference Services at the UN in New York, and are drawn from archives that span the period between 1988 and 1993. This publication contains the English, French and Spanish archives, with data from each language stored on a separate disc in the set. Care has been taken to arrange the document files in a parallel directory structure for each language, so that corresponding translations of a document are found directly by means of the directory paths and file names. All parallel files in this corpus are English-based: for every file on the English disc, there will be a corresponding file on either the French or Spanish disc, or both. Tables are included on all discs to assist in determining which parallels are present. Due to the nature and organization of UN translation services and the original electronic text archives, the process of finding and sorting out parallel documents yielded a numerous gaps, with many files in each language having no parallel in other languages. In preparing the text for publication, we have applied a fully-compliant SGML format (Standard Generalized Markup Language). For those researchers who use SGML, a working DTD (Document Type Definition) is provided on each disc. For those who do not need SGML markup, a simple script is included that can be used to filter out the SGML-specific material, and leave only the plain text. The character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table. \vspace{.25in} \noindent Item Name: UNITED NATIONS PARALLEL TEXT Complete Set \\ LDC Catalog No.: LDC94T4A\\ NIST Catalog No.: NA \\ LDC Release date: 4/94 (MY94) \\ Nonmember price: \$5000 \\ Special license: NO \\ \vspace{.25in} \noindent Item Name: UNITED NATIONS PARALLEL TEXT English \\ LDC Catalog No.: LDC94T4B-1 \\ NIST Catalog No.: NA \\ LDC Release date: 4/94 (MY94) \\ Nonmember price: \$2500 \\ Special license: NO \\ \vspace{.25in} \noindent Item Name: UNITED NATIONS PARALLEL TEXT French \\ LDC Catalog No.: LDC94T4B-2 \\ NIST Catalog No.: NA \\ LDC Release date: 4/94 (MY94) \\ Nonmember price: \$2500 \\ Special license: NO \\ \vspace{.25in} \noindent Item Name: UNITED NATIONS PARALLEL TEXT Spanish \\ LDC Catalog No.: LDC94T4B-3.1 \\ NIST Catalog No.: NA \\ LDC Release date: 4/94 (MY94) \\ Nonmember price: \$2500 \\ Special license: NO \\ \subsection {ECI-1} The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High Sierra format (ISO 9660), readable on UN*X, MSDOS and Apple systems at least. The amount of material per language varies, from about 36 million words (German) to about 5 thousand words (Bulgarian). The majority of sources are journalistic in nature (newspapers, magazines, broadcasts); additional sources include dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical reports, and proceedings or publications of international organizations. The table on the next page lists the languages included, the subcorpus numbers for each language (in parentheses), and the amount of data per language in thousands of lexical words. \newpage \begin{tabular}{lrrrrrrrrr} Language & (Subcorpus \#)& Kwords & & & & & & & Totals\\ \\ German & (70)& 34291 & (09) & 191 & (65) & 20 & (28) & 187 &\\ & (29)& 59 & (30) & 76 & (47) & 24 & (59) & 50 &\\ & (71) & 21 & (70A)& 999 & & & & & 35918\\ French & (31) & 4775 & (04) & 4121 & (28) & 187 & (29) & 59 &\\ & (30) & 76 & (47) & 24 & (51) & 6 & (59) & 50 &\\ & (71) & 21 & (32) & 1667 & & & & & 10986\\ Spanish & (31) & 4500 & (13) & 830 & (14) & 1041 & (15) & 447 &\\ & (47) & 24 & (32) & 1667 &8 & (59) & 50 & (71) & 8580\\ English & (31) & 4222 & (36) & 1141 & (74) & 95 & (28) & 187 &\\ & (47) & 24 & (51) & 6 & (56) & 97 & (59) & 50 &\\ & (71) & 21 & (32) & 1667 & & & & & 7510\\ Dutch & (03) & 5500 & (02) & 600 & (47) & 24 & (71) & 21 & 6145\\ Czech & (44) & 4726 & & & & & & & 4726\\ Italian & (11) & 3518 & (42) & 303 & (58) & 13 & (29) & 59 &\\ & (30) & 76 & (47) & 24 & (71) & 21 & & & 4014\\ Chinese & (78) & 2895 & & & & & & & 2895\\ Greek & (10) & 2515 & (47) & 24 & (59) & 50 & (71) & 21 & 2610\\ Norwegian & (41) & 2226 & & & & & & & 2226\\ Swedish & (37) & 1718 & & & & & & & 1718\\ Serb/Croat/Slov & (24) & 700 & (56) & 289 & & & & & 989\\ Tibetan & (76) & 834 & & & & & & & 834\\ Portuguese & (60) & 675 & (47) & 24 & (71) & 21 & & & 720\\ Malay & (80) & 563 & & & & & & & 563\\ Russian & (73) & 364 & & & & & & & 364\\ Japanese & (57) & 203 & & & & & & & 203\\ Turkish & (20) & 173 & (20A) & 110 & & & & & 283\\ Albanian & (82) & 205 & & & & & & & 205\\ Gaelic & (55) & 141 & & & & & & & 141\\ Estonian & (39) & 100 & & & & & & & 100\\ Usbek & (81) & 88 & & & & & & & 88\\ Latin & (74) & 75 & & & & & & & 75\\ Danish & (47) & 24 & (71) & 21 & & & & & 45\\ Lithuanian & (89) & 20 & & & & & & & 20\\ Bulgarian & (84) & 5 & & & & & & & 5\\ \\ Total & & & & & & & & & 91969\\ \end{tabular} \vspace{.2in} \noindent Item Name: ECI/MCI \\ LDC Catalog No.: LDC94T5\\ NIST Catalog No.: NA \\ LDC Release date: 6/94 (MY94) \\ Nonmember price: \$35 \\ Special license: YES \\ \newpage \section{Lexical Databases: Descriptions and Ordering Information} \subsection {CELEX Lexical Database} This corpus contains ASCII versions of the CELEX lexical databases of English (version 2.5), Dutch (version 3.1) and German (version 2.0). CELEX was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. Pre-mastering and CD-ROM production was done by the LDC. For each language, this CD-ROM contains detailed information on : \begin{itemize} \item the orthography (variations in spelling, hyphenation), \item the phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), \item the morphology (derivational and compositional structure, inflectional paradigms), \item the syntax (word class, word-class specific subcategorizations, argument structures) and \item word frequency (summed word and lemma counts, based on recent and representative text corpora). \end{itemize} The databases have not been tailored to fit any particular database management program. Instead, the information is in ASCII files in a UNIX directory tree that can be queried with tools such as AWK or ICON. Unique identity numbers allow the linking of information from different files. Some kinds of information have to be computed on-line; wherever necessary, AWK functions have been provided to recover this information. README files specify the details of their use. A detailed User Guide describing the various kinds of lexical information available is supplied. All sections of this guide are POSTSCRIPT files, except for some additional notes on the German lexicon in plain ASCII. \vspace{.25in} \noindent Item Name: CELEX \\ LDC Catalog No.: LDC94L1 \\ NIST Catalog No.: NA \\ LDC Release date: 4/94 (MY94) \\ Nonmember price: \$150 \\ Special license: YES \\ \subsection {COMLEX: COMmon LEXical Database of English} This is a three-part project: COMLEX English Syntax, COMLEX English Pronunciation, and COMLEX English Semantics. The first two have resulted in electronic dictionaries, released by LDC as MY94 products and described below. The Semantics will result in an annotated corpus using WordNet, which is a public domain compendium of lexical semantic relations, in 1995. Annotation of the same corpus using COMLEX Syntax is also planned for 1995. For a description of WordNet, see George Miller (ed.), WordNet: An on-line lexical database, in International Journal of Lexicography (special issue), 3(4):235-312, 1990, or George Miller, Claudia Leacock, Randee Tengi, and Ross Bunker: A semantic concordance, in Proceedings of the Human Language Technology Workshop, pages 303--308, Princeton, NJ, March 1993. These products are intended to provide a comprehensive set of lexical resources for research and development in computational linguistics. They will be revised and expanded continuously, with feedback from the community of users, and current members will receive all new versions. {\em The initial (MY94) versions of the electronic dictionaries are being distributed only by ftp.} Contact LDC for instructions to obtain license forms and the dictionaries. \subsubsection{COMLEX English Syntax} This is a moderately broad coverage English lexicon (with about 38,000 lemmas) developed at New York University under LDC sponsorship. It contains detailed information about the syntactic characteristics of each lexical item, and is particularly detailed in its treatment of subcategorization (complement structures). It includes 92 different subcategorization features for verbs, 14 for adjectives, and 9 for nouns. These features distinguish not only the different constituent structures which may appear in a complement, but also the different control features associated with a constituent structure. Version 0, released in August 1994, is available by ftp to members who sign a license agreement, which is also found on the LDC ftp site. Some references for the syntax and semantics work: Ralph Grishman, Catherine Macleod, and Adam Meyers. Comlex syntax: Building a computational lexicon. To appear in Proc. 15th Int'l Conf. Computational Linguistics ({COLING} 94), Kyoto, Japan, August 1994. \vspace{.25in} \noindent Item Name: COMLEX English Syntax Lexicon, Version 0 \\ LDC Catalog No.: LDC94L2 \\ NIST Catalog No.: NA \\ LDC Release date: 6/94 (MY94) \\ Nonmember price: \$10,000 \\ Special license: YES \\ \subsubsection{COMLEX English Pronunciation} Version 0, released in August 1994, is a 50,000 word pronouncing dictionary of English, including the standard 30,000 word WSJ vocabulary. It is available only by ftp to members who sign a license agreement, which is also found on the LDC ftp site. A more complete description will be available shortly. \vspace{.25in} \noindent Item Name: COMLEX Pronouncing Dictionary, Version 0 \\ LDC Catalog No.: LDC94L3 \\ NIST Catalog No.: NA \\ LDC Release date: 6/94 (MY94) \\ Nonmember price: \$10,000 \\ Special license: YES \\ \pagebreak \end{document}