This page contains extracts from vendor's literature that are pertinent to us.


Q. Is Oracle new to this area?

A. No. Oracle's effort to provide solutions for the problems of unstructured text started over a decade ago and ConText Option is the result of an evolutionary development process. After several years of research and development, Oracle released SQL*TextRetrieval in 1990 and then TextServer3 in 1994. ConText Option, for the first time, integrates full text management with the Oracle database (Oracle7 release 7.3).

The core of ConText has been developed over the past 20 years. It is a highly sophisticated natural language processor which uses a knowledge base consisting of over 1,000,000 words and phrases, related to more than 200,000 concepts, associated with approximately 2,000 thematic categories.


Q. How does the text retrieval work?

A. Developers can "hot-upgrade" existing text columns in the database by creating a text index. This takes just a few PL/SQL commands. Once the text-index is built, developers can query the text just like any other column. For example, from SQL*Plus:

"SQL> Select RESUME from EMP where contains (RESUME, 'UNIX') > 0;"

returns the resumes of all employees with UNIX skills.

Users can query on:

Users also can:


Q. How does text reduction and text classification work?

A. Using the knowledge base, ConText isolates the significant concepts in a document. Then through complex processes of comparison and abstraction in relation to the knowledge base, ConText discovers and ranks the themes for each paragraph and for the document as a whole.

The text reduction function uses this output to produce either general or theme-specific summaries (called "Gists") of the document. In other words, it cuts a document down by selecting the most representative paragraphs which convey the main ideas and concepts of the full document. Readers can scan the Gists quickly before (or instead of) reading the full text and continuing their search.

The text classification feature automatically categorizes documents according to the main themes of the documents. For example, a document that mentions interest rates, treasury bills, and money supply would be classed as a banking or financial document.

More Information


The Text Summariser

The first version of the Text Summariser, from which NetSumm evolved, was a stand-alone program running on Apple Macintosh computers. The Summariser program accepts any plain-text document and will automatically pick out the sentences that it considers make up the "most important" part of the text. One way in which these sentences can be presented is by highlighting them in the text. This mimics the procedure which many people use when face with a printed document - using a highlighter pen to mark those parts of the text which are of particular interest.

How the Text Summariser displays an article

Alternatively, the program can extract these sentences from the text to produce an abridgement of the article.

The process is interactive, allowing the user to choose longer or shorter extracts at will, from a single sentence to the full text. Typically, one begins with a very short abridgement, to see if the article is relevant. If so, the length of the abridgement can be increased to see more details.

The basic technique that lies behind the Summariser is very robust, giving it the ability to work on any text, independent of subject. In our evaluations, equally good results have been achieved with articles on such disparate subjects as semiconductor lasers and red squirrels !

Our evaluations have confirmed that the Summariser works well regardless of the subject area. However certain types of article are more suitable for summarisation than others (this applies whether a computer or a human is doing the summarisation !) In general, the Text Summariser works best on factual articles with a reasonably single theme - articles in newspapers, technical journals etc are all good subjects. Two areas which we know are more difficult for the Summariser are lists and narrative works.


Lists are difficult to summarise (even for people !) because they are in many ways summaries in themselves. All we can do to summarise them further is to pick out the most important items, and this requires much more knowledge and understanding than the Summariser possesses.

Some lists are obvious - when they are introduced by bullet points, for example. However, there are also some more subtle forms of list. Some articles, particularly in legal and medical fields, consist largely of a series of example cases - effectively a list. Summaries of this type of article are generally less effective than average.

In the context of the WWW, this means that home pages and index pages which consist of a list of other pages are not good candidates for summarisation.

Even if the contents of the text are not a list, html listing tags may have been used to format the whole text for the WWW in a certain way. View the source and look for initial use of eg <ul>, <ol>, <menu> or <dir> and then repeated use thoughout the text of eg <li> or <dl>/<dt& gt; tags. In this case also we do not yet provide a summary.

Narrative works

Because the Summariser needs to work out what an article is "about", we do not expect it to work well on narrative works, where the topic changes as events unfold and characters come and go. This means that the answer to the frequently-asked question: "Can you summarise the Complete Works of Shakespeare down to half a page ?" is No !

We don't see this as an important limitation of the Summariser, since narrative works are (usually) read in full, for pleasure.


Now that the information revolution is in full swing, electronic documents are the principal media of business information. Companies are producing new electronic documents at a breakneck pace. Despite advances in software systems, however, companies continue to be unable to effectively utilize the information captured in oceans of on-line documents. The problem is not finding documents, but finding the right document and finding the information in it. The Xerox Linguistic Technologies (XLT) provide a collection of advanced natural language processing components that gives software the power to tap into the information in documents. XLT's components include state-of-the-art software for automatic document summarization, information extraction and morphological analysis that can improve a broad spectrum of applications ranging from full-text search engines to handwriting recognition. By partnering with XSoft, software companies can differentiate their products from their competition with leading edge linguistic technology from the world-renowned Xerox research centers in Palo Alto, California (PARC) and Grenoble, France (RXRC).


Linguistically Accurate
Working with document content means working with language. XLT's linguistic transducers include advanced language descriptions which enable highly accurate document analysis and word morphology. Developed using state-of-the-art linguistic methodologies pioneered by Xerox research linguist Lauri Karttunen, XLT transducers in eight languages offer benchmark accuracy, performance and human-language portability.

Fast, Compact Design
In the early 1980s, PARC scientists Ron Kaplan and Martin Kay developed a system for efficient compression of linguistic transducers. This compression enables XLT to hold more than 500,000 English words in under 250K of storage ( 2 words per byte) and to hold more than 5.7 million French words in under 250K (20 words per byte!). By making use of this compression technology, XLT can both save space in memory and on hard disks and outperform traditional dictionary-based software.


XLT provide a full range of advanced linguistic capabilities not available in the components offered by other vendors. The XSoft XLT suite includes:

Tokenizing allows you to work with documents by first separating them into sentences and individual words. While English punctuation and spacing often provide a good indication of word and sentence boundaries, special cases such as contractions (isn't or y'all), possessives (Bart's), and abbreviations (Inc.) can make accurate tokenization a non-trivial task. Additionally, documents in languages that do not place spaces between words, such as Japanese, require very sophisticated analysis for correct tokenization.

Stemming enables your software to identify all possible root forms of a word, providing for more efficient full-text indexes. For instance, when a user inputs swam, stemming will deliver swim. Unlike non-linguistic methods, such as such as wildcarding or tail-chopping, which will degrade the accuracy of text-based information retrieval, XLT generates only linguistically correct stems. For example, a tail-chopping system would match dining with din. XLT recognizes the correct word root as dine.

Morphological Analysis is a more advanced form of stemming that identifies the grammatical features of a word in addition to its root forms. Morphological analysis will show, for example, that the root word ground can be considered as a noun, an adjective, or a verb. Morphological analysis provides value to applications where the grammatical features of a word are important.

Tagging builds on morphological analysis by choosing the part-of-speech category for a group of words by examining them in sentence. For example, the grammatical category of the word ground is different in each of the following context situations:

Tagging is especially important for applications, such as language translation, where proper grammatical understanding is critical, and in situations where searches based on a grammatical aspect of a sentence is desired (such as noun phrase identification searches).

Morphological inflection and generation can provide application software with the linguistic knowledge necessary to expand a limited vocabulary for computer generated phrases. The inverse process from stemming and analysis, these processes convert root forms into inflected forms, such as turning can into could or peux into puisse.

Summarization can add additional document analysis capabilities to your application. The XLT Summarizer automatically examines the content of a document in real-time to identify the document's key phrases and extract sentences to form an indicative summary, either by highlighting excerpts within a document or creating a bulleted list of the document's key phrases.

Language identification enables your application to determine a document's language and character set encoding, an essential step when sorting documents for automated processing.


Full Text Search and Information Retrieval
Full-text search engines rely on having highly accurate relevance assessment capabilities to sort search results. Stemming, tagging, morphological analysis, and summarization can all aid in searching and improve relevance evaluation.

Full text indexes can be very large and expensive to operate. Reducing the number of unique words in the index by storing only stems can reduce the size of an index by up to 40% when combined with other compression techniques. Restricting an index to only nouns (identified by tagging) or summary key phrases reduces the index further. Similarly, using XLT language identification to filter out documents not in the language of interest to the index can further reduce its size.

Automatic Document Highlighting and Summarization
Today's work environment is in "information overload." Email, reports, and memos are just a few examples of the documents that need reading, understanding, and often, action. Highlighting and summarization provide a snapshot of a document's key content. The position of certain sentences, certain phrases, and frequently used content words are just a few of the features used in targeting highlights or excerpts. Highlighting and summarization can greatly simplify the sorting and prioritization of email, assist in searches based on document content, or provide prereading or summaries of large documents.

Fuzzy Word Matching for OCR
Even with the increasingly large number of documents being prepared on word processors, there is still a great need for OCR. The traditional problems with OCR have been speed and accuracy. Considerable human editing is often required to correct the resulting on-line text. For example, the word above may appear as ahove following the scanning and OCR of the hard copy document. The XLT morphological analyzer has considerable linguistic knowledge and knows, in a sense, what words are possible in a particular language and how they are spelled. Running in parallel with the OCR process, the morphological analyzer can help judge the correctness of whole words, and within words can provide letter-by-letter guidance to help resolve ambiguities in recognition. Using the above example, the XLT morphological analyzer eliminates ahove as a possible word and correctly identifies it as above. Similar applications of the Xerox Linguistic Technologies would include handwriting and speech recognition.

Natural Language Input
Traditionally, commands used in operating systems, text retrieval, database queries, and similar applications have been formal and cryptic, designed to be easily interpreted by the software. Human end-users were forced to deal with the computer on the computer'' terms. This created a barrier for professionals who often needed to extract vital information from a computer but have no time to learn the complex, computer-oriented computer language.

One of the keys to making computers easier to use is to make them deal with end users in natural language, or in human terms. In an increasingly competitive hardware market, where any product can be upstaged with a bit more memory or a slightly faster processor, many manufacturers are now trying to differentiate their products with software that is easier to use. XLT products are the foundation for improving user interfaces with natural-language capabilities. And can provide the first steps leading to full-scale syntactic and semantic processing.


The Most Intelligent Search Tool Ever Invented

Lets say youre looking for information about dinosaurs. Use just any search engine and youll get page after page of vague summaries and dead links to plow through before you turn up anything useful. And then, since no single search engine knows every site on the Web, youll have to repeat your search over and over again.

Now try looking for dinosaurs with WebCompass. First, it searches all the top search engines at once. Then it visits every site it finds, noting dead URLs and useless links along the way. You can even narrow or expand your search instantly by adding and linking topics like fossils, Jurassic, or Tyrannosaurus.

Once WebCompass finds promising sites, it uses artificial intelligence to pull keywords from those sites and build comprehensive summaries, so you can zero in on Brontosaurus, not Barney. It even analyzes and ranks each site for you in order of relevancy. And once youve tracked down the exact beast youre searching for, you can surf over to it with a simple double-click!.

Best of all, you can set WebCompass to remember your favorite topics or scan for new information about a particular topic (say, New Mastadon bone discoveries), then automatically check for any new information as often as you choose. And itll perform its work in the background, while you do yours!

Searches The Whole Internet

Out of the box, WebCompass 2.0 accesses over 35 search resources. If you want to search more, just add additional search resources. You can even add resources to search Intranets.

But WebCompass doesnt just search the Web. Because it searches via so many engines, it access Usenet, FTP, and gopher sites as well the e whole Internet!

Creates Comprehensive Summaries On The Fly

Once WebCompass finds a hit, it analyzes the entire document to create a comprehensive summary. It does this by noting how often keywords are used and analyzing the context in which the keywords are used.

Find The Most Relevant Information At A Glance

Not only does WebCompass find the information youre looking for, it also ranks it so you can find the most important information at a glance. Hits are given a relevancy ranking between 1 and 100. The higher the number, the more relevant it is.

Organizes Your Search Results By Topics

WebCompass organizes search results into topic folders, and creates sub-topics. For example, you can have a topic Jazz and create subtopics, like Jazz Artists and Jazz Festivals. Just drag and drop URLs from one topic folder to another to manage your information

Automatically Updates Your Results.

You can tell WebCompass to check for new information about topics on a regular basis. Updated information will automatically arrive on your PC while youre doing other work!

Features and Benefits * Comes configured to search 35 search engines; easily add new ones at any time. * Results of searches ranked for relevance on 1 to 100 scale. * Builds comprehensive summaries of results gather from searching. * Automatically organizes results by topic. * Automatically updates results. * Uses search results to fine tune further searches. * Includes customizable relational database of search topics. * Includes thesaurus to assist in searching.