This page contains extracts from vendor's literature that are pertinent to us.
Q. Is Oracle new to this area?
A. No. Oracle's effort to provide solutions for the problems of
unstructured text started over a decade ago and ConText Option is the
result of an evolutionary development process. After several years of
research and development, Oracle released SQL*TextRetrieval in 1990
and then TextServer3 in 1994. ConText Option, for the first time,
integrates full text management with the Oracle database (Oracle7
The core of ConText has been developed over the past 20 years. It is a
highly sophisticated natural language processor which uses a knowledge
base consisting of over 1,000,000 words and phrases, related to more
than 200,000 concepts, associated with approximately 2,000 thematic
Q. How does the text retrieval work?
A. Developers can "hot-upgrade" existing text columns in the database
by creating a text index. This takes just a few PL/SQL commands. Once
the text-index is built, developers can query the text just like any
other column. For example, from SQL*Plus:
"SQL> Select RESUME from EMP where contains (RESUME, 'UNIX') > 0;"
returns the resumes of all employees with UNIX skills.
Users can query on:
- exact word/phrase,
- logical combinations ("and", "or", "accumulate, "minus"),
- wild-card searching (character strings),
- expansions (stem, fuzzy match, phonetic),
- multilingual stemming (Spanish, French, German, Dutch, Italian,
- proximity searching (words near each other in the text),
- synonyms and related terms searching (thesaurus-based).
Users also can:
- Grade search terms (by giving them different "weights"),
- Limit results (by score, number of hits),
- use PL/SQL functions to produce search terms
Q. How does text reduction and text classification work?
A. Using the knowledge base, ConText isolates the significant concepts
in a document. Then through complex processes of comparison and
abstraction in relation to the knowledge base, ConText discovers and
ranks the themes for each paragraph and for the document as a whole.
The text reduction function uses this output to produce either general
or theme-specific summaries (called "Gists") of the document. In
other words, it cuts a document down by selecting the most
representative paragraphs which convey the main ideas and concepts of
the full document. Readers can scan the Gists quickly before (or
instead of) reading the full text and continuing their search.
The text classification feature automatically categorizes documents
according to the main themes of the documents. For example, a document
that mentions interest rates, treasury bills, and money supply would
be classed as a banking or financial document.
The Text Summariser
The first version of the Text Summariser, from which NetSumm evolved,
was a stand-alone program running on Apple Macintosh computers. The
Summariser program accepts any plain-text document and will
automatically pick out the sentences that it considers make up the
"most important" part of the text. One way in which these sentences
can be presented is by highlighting them in the text. This mimics the
procedure which many people use when face with a printed document -
using a highlighter pen to mark those parts of the text which are of
How the Text Summariser displays an article
Alternatively, the program can extract these sentences from the text
to produce an abridgement of the article.
The process is interactive, allowing the user to choose longer or
shorter extracts at will, from a single sentence to the full text.
Typically, one begins with a very short abridgement, to see if the
article is relevant. If so, the length of the abridgement can be
increased to see more details.
The basic technique that lies behind the Summariser is very robust,
giving it the ability to work on any text, independent of subject. In
our evaluations, equally good results have been achieved with articles
on such disparate subjects as semiconductor lasers and red squirrels !
Our evaluations have confirmed that the Summariser works well
regardless of the subject area. However certain types of article are
more suitable for summarisation than others (this applies whether a
computer or a human is doing the summarisation !) In general, the Text
Summariser works best on factual articles with a reasonably single
theme - articles in newspapers, technical journals etc are all good
subjects. Two areas which we know are more difficult for the
Summariser are lists and narrative works.
Lists are difficult to summarise (even for people !) because they are
in many ways summaries in themselves. All we can do to summarise them
further is to pick out the most important items, and this requires
much more knowledge and understanding than the Summariser possesses.
Some lists are obvious - when they are introduced by bullet points,
for example. However, there are also some more subtle forms of list.
Some articles, particularly in legal and medical fields, consist
largely of a series of example cases - effectively a list. Summaries
of this type of article are generally less effective than average.
In the context of the WWW, this means that home pages and index pages
which consist of a list of other pages are not good candidates for
Even if the contents of the text are not a list, html listing tags
may have been used to format the whole text for the WWW in a certain
way. View the source and look for initial use of eg <ul>,
<ol>, <menu> or <dir> and then repeated use thoughout
the text of eg <li> or <dl>/<dt& gt; tags.
In this case also we do not yet provide a summary.
Because the Summariser needs to work out what an article is "about",
we do not expect it to work well on narrative works, where the topic
changes as events unfold and characters come and go. This means that
the answer to the frequently-asked question: "Can you summarise the
Complete Works of Shakespeare down to half a page ?" is No !
We don't see this as an important limitation of the Summariser, since
narrative works are (usually) read in full, for pleasure.
Now that the information revolution is in full swing, electronic documents
are the principal media of business information. Companies are producing new
electronic documents at a breakneck pace. Despite advances in software systems,
however, companies continue to be unable to effectively utilize the information
captured in oceans of on-line documents. The problem is not finding documents,
but finding the right document and finding the information in it. The Xerox
Linguistic Technologies (XLT) provide a collection of advanced natural language
processing components that gives software the power to tap into the information
in documents. XLT's components include state-of-the-art software for automatic
document summarization, information extraction and morphological analysis that
can improve a broad spectrum of applications ranging from full-text search
engines to handwriting recognition. By partnering with XSoft, software companies
can differentiate their products from their competition with leading edge
linguistic technology from the world-renowned Xerox research centers in Palo
Alto, California (PARC) and Grenoble, France (RXRC).
Working with document content means
working with language. XLT's linguistic transducers include advanced language
descriptions which enable highly accurate document analysis and word morphology.
Developed using state-of-the-art linguistic methodologies pioneered by Xerox
research linguist Lauri Karttunen, XLT transducers in eight languages offer
benchmark accuracy, performance and human-language portability.
Fast, Compact Design
In the early 1980s, PARC scientists Ron Kaplan and Martin Kay developed
a system for efficient compression of linguistic transducers. This compression
enables XLT to hold more than 500,000 English words in under 250K of storage
( 2 words per byte) and to hold more than 5.7 million French words in under 250K
(20 words per byte!). By making use of this compression technology, XLT can
both save space in memory and on hard disks
and outperform traditional dictionary-based software.
XLT provide a full range of advanced linguistic capabilities not available
in the components offered by other vendors. The XSoft XLT suite includes:
Tokenizing allows you to work with documents by first separating them into
sentences and individual words. While English punctuation and spacing often
provide a good indication of word and sentence boundaries, special cases such as
contractions (isn't or y'all), possessives (Bart's), and abbreviations (Inc.)
can make accurate tokenization a non-trivial task. Additionally, documents in
languages that do not place spaces between words, such as Japanese, require very
sophisticated analysis for correct tokenization.
Stemming enables your software to identify all possible root forms of a
word, providing for more efficient full-text indexes. For instance, when a user
inputs swam, stemming will deliver swim. Unlike non-linguistic methods, such as
such as wildcarding or tail-chopping, which will degrade the accuracy of
text-based information retrieval, XLT generates only linguistically correct
stems. For example, a tail-chopping system would match dining with din. XLT
recognizes the correct word root as dine.
Morphological Analysis is a more advanced form of stemming that identifies
the grammatical features of a word in addition to its root forms. Morphological
analysis will show, for example, that the root word ground can be considered as
a noun, an adjective, or a verb. Morphological analysis provides value to
applications where the grammatical features of a word are important.
Tagging builds on morphological analysis by choosing the part-of-speech
category for a group of words by examining them in sentence. For
example, the grammatical category of the word ground is different in each of the
following context situations:
Tagging is especially important for applications, such as language
translation, where proper grammatical understanding is critical, and in
situations where searches based on a grammatical aspect of a sentence is desired
(such as noun phrase identification searches).
- Falafel is made from ground (adjective) chickpeas.
- It is safer to stand on the ground (noun) than on the table.
- I ground (verb) the coffee beans.
Morphological inflection and generation can provide application software
with the linguistic knowledge necessary to expand a limited vocabulary for computer generated phrases. The
inverse process from stemming and analysis, these processes convert root forms
into inflected forms, such as turning can into could or peux into puisse.
Summarization can add additional document analysis capabilities to your
application. The XLT Summarizer automatically examines the content of a document
in real-time to identify the document's key phrases and extract sentences to
form an indicative summary, either by highlighting excerpts within a document or creating
a bulleted list of the document's key phrases.
Language identification enables your application to determine a document's
language and character set encoding, an essential step when sorting documents
for automated processing.
Full Text Search and Information Retrieval
engines rely on having highly accurate relevance assessment capabilities to sort
search results. Stemming, tagging, morphological analysis, and summarization can
all aid in searching and improve relevance evaluation.
Full text indexes can be very large and expensive to operate. Reducing the
number of unique words in the index by storing only stems can reduce the size of
an index by up to 40% when combined with other compression techniques.
Restricting an index to only nouns (identified by tagging) or summary key
phrases reduces the index further. Similarly, using XLT language identification
to filter out documents not in the language of interest to the index can further
reduce its size.
Automatic Document Highlighting and Summarization
environment is in "information overload." Email, reports, and memos
are just a few examples of the documents that need reading, understanding, and
often, action. Highlighting and summarization provide a snapshot of a document's
key content. The position of certain sentences, certain phrases, and frequently
used content words are just a few of the features used in targeting highlights
or excerpts. Highlighting and summarization can greatly simplify the sorting and
prioritization of email, assist in searches based on document content, or
provide prereading or summaries of large documents.
Fuzzy Word Matching for OCR
Even with the increasingly large
number of documents being prepared on word processors, there is still a great
need for OCR. The traditional problems with OCR have been speed and accuracy.
Considerable human editing is often required to correct the resulting on-line
text. For example, the word above may appear as ahove following the scanning and
OCR of the hard copy document. The XLT morphological analyzer has considerable
linguistic knowledge and knows, in a sense, what words are possible in a
particular language and how they are spelled. Running in parallel with the OCR
process, the morphological analyzer can help judge the correctness of whole
words, and within words can provide letter-by-letter guidance to help resolve
ambiguities in recognition. Using the above example, the XLT morphological
analyzer eliminates ahove as a possible word and correctly identifies it as
above. Similar applications of the Xerox Linguistic Technologies would include
handwriting and speech recognition.
Natural Language Input
Traditionally, commands used in
operating systems, text retrieval, database queries, and similar applications
have been formal and cryptic, designed to be easily interpreted by the software.
Human end-users were forced to deal with the computer on the computer'' terms.
This created a barrier for professionals who often needed to extract vital
information from a computer but have no time to learn the complex,
computer-oriented computer language.
One of the keys to making computers easier to use is to make them deal with
end users in natural language, or in human terms. In an increasingly competitive
hardware market, where any product can be upstaged with a bit more memory or a
slightly faster processor, many manufacturers are now trying to differentiate
their products with software that is easier to use. XLT products are the
foundation for improving user interfaces with natural-language capabilities. And
can provide the first steps leading to full-scale syntactic and semantic
The Most Intelligent Search Tool Ever Invented
Lets say youre looking for information about dinosaurs. Use just any
search engine and youll get page after page of vague summaries and
dead links to plow through before you turn up anything useful. And
then, since no single search engine knows every site on the Web,
youll have to repeat your search over and over again.
Now try looking for dinosaurs with WebCompass. First, it searches all
the top search engines at once. Then it visits every site it finds,
noting dead URLs and useless links along the way. You can even
narrow or expand your search instantly by adding and linking topics
like fossils, Jurassic, or Tyrannosaurus.
Once WebCompass finds promising sites, it uses artificial intelligence
to pull keywords from those sites and build comprehensive summaries,
so you can zero in on Brontosaurus, not Barney. It even analyzes
and ranks each site for you in order of relevancy. And once youve
tracked down the exact beast youre searching for, you can surf over
to it with a simple double-click!.
Best of all, you can set WebCompass to remember your favorite topics
or scan for new information about a particular topic (say, New
Mastadon bone discoveries), then automatically check for any new
information as often as you choose. And itll perform its work in the
background, while you do yours!
Searches The Whole Internet
Out of the box, WebCompass 2.0 accesses over 35 search resources. If
you want to search more, just add additional search resources. You can
even add resources to search Intranets.
But WebCompass doesnt just search the Web. Because it searches via so
many engines, it access Usenet, FTP, and gopher sites as well the e
Creates Comprehensive Summaries On The Fly
Once WebCompass finds a hit, it analyzes the entire document to create
a comprehensive summary. It does this by noting how often keywords are
used and analyzing the context in which the keywords are used.
Find The Most Relevant Information At A Glance
Not only does WebCompass find the information youre looking for, it
also ranks it so you can find the most important information at a
glance. Hits are given a relevancy ranking between 1 and 100. The
higher the number, the more relevant it is.
Organizes Your Search Results By Topics
WebCompass organizes search results into topic folders, and creates
sub-topics. For example, you can have a topic Jazz and create
subtopics, like Jazz Artists and Jazz Festivals. Just drag and
drop URLs from one topic folder to another to manage your information
Automatically Updates Your Results.
You can tell WebCompass to check for new information about topics on a
regular basis. Updated information will automatically arrive on your
PC while youre doing other work!
Features and Benefits
* Comes configured to search 35 search engines; easily add new ones
at any time.
* Results of searches ranked for relevance on 1 to 100 scale.
* Builds comprehensive summaries of results gather from searching.
* Automatically organizes results by topic.
* Automatically updates results.
* Uses search results to fine tune further searches.
* Includes customizable relational database of search topics.
* Includes thesaurus to assist in searching.