Discovery Natural Language Processing Technology

How Discovery can be used

Discovery's sentence analysis capability could be exploited in any number of ways in which extensive sentence analysis would be needed: as a front-end relational database management system, software help, statistical analysis, improved website navigation, e-learning, improved speech recognition. For now, though, we are primarily focusing on only two: knowledge management and language translation.

Knowledge management

Knowledge management is defined as a method of codifying what employees, suppliers, business partners and customers know, and then sharing that knowledge with employees and other companies to devise best practices. In a broader sense, it's a way for any group to improve the creation, retention, sharing and reuse of knowledge, its insights and intellectual assets. In conventional methods, like in-person discussions, email exchanges and forums, that content often gets lost.

In contrast, ordinary information management systems manage only a specific range of data, depending on their purpose. New data can be added into such systems, but unless its developers consistently produce updates, the items of information they were designed to store is fixed and unalterable.

Discovery breaks through this information "straightjacket." With its unrestricted ability to analyze English sentences, we would like to make a feature available to existing knowledge management systems on the market already in use. Discovery presents an even better, more natural approach to knowledge management: a system in which can collect and manage the content of any body of knowledge expressed in English, about virtually any subject matter, and allow users to retrieve that knowledge simply by asking the system English questions. In effect, one could conduct knowledge management by having a chat- or messenger-like conversation with the system, much like one would with a human being.

Such an approach would enable members of a group to share knowledge that conventional information management systems were never designed to manage. And it would require no special skills beyond the ability to type and speak English.

To construct this knowledge management application, two main problems must be solved:

1. a method of database design which can accommodate the storage and management of sentences—in this case the content of the hierarchical diagrams Discovery generates to diagram sentences— of any kind of grammatical structure.

To accommodate the vast range of English sentence structure, we have devised a special type of relational database with multiple key values, in contrast to the conventional method of unique key values, similar to surrogate keys. Each of these keys represents a path of tables for the system to store the content of any sentence no matter what combination of grammatical elements it uses. Since these keys are maintained in a fixed list used within the system and their values will never change, data integrity in this model is maintained.

2. a method of semantic representation or logic which enables the system to maintain logical consistency of the information it captures. It would be applied even when the information it stores may be logically contradictory. It would also prevent the entry of information judged to be redundant, based on whether it is easily deduced by the logic the system applies.

What kind of common-sense logic would be applied to best manage this knowledge? Many types of semantic logic used in experimental AI are limited in scope and application. Why not integrate them into a more effective, workable whole, to address as many different ways logic can be applied to real-world situations as possible?

Since the validity of a logical argument depends upon the meaning or semantics of the sentences that make it up, I have begun work in this area using Franenet, a project created by Charles J. Fillmore and housed at the International Computer Science Institute at the University of Southern California at Berkeley. It consists of a network of frames representing practically any kind of event—any kind of human activity or natural occurrence. Recently I have begun work establishing relationships between the frames, treating each one as a type of logical premise from which logical conclusions and implications can be drawn.

Relationships in WordNet, a lexical database created in the Cognitive Science Laboratory of Princeton University, will also serve to implemeent this logic. In WordNet, words are grouped together into sets of synonyms, which identify individual concepts. At a higher level, WordNet associates different concepts with a series of lexical associations, as shown below, such as antonyms. There are about twenty others, applicable by parts of speech.

Patterned according to theories of human semantic memory developed in the late 1960s, WordNet is a model of how human beings mentally organize concepts in an economical, hierarchical fashion —in effect an expandable map of the totality of concepts available to the human mind. For its knowledge management application, WordNet will enable Discovery to manage information in a broader scope than with a dictionary, because its relationships will enable it to locate and accurately assess the relevancy of sentence data as a responds to a user's questions. Consider the following examples:

the user may enter a statement or question for which information already exists in memory, but uses different words with which to express it: User: Thomas met with the board. System: I already know that Thomas met with the committee. Here Discovery, when seeking information in memory, simply transposes board with its matching synonym set member committee, and thus prevents the entry of redundant information.
the user may enter a statement or question that, lexically, contradicts existing information in memory: User: Is Hilda ugly? System: No, Hilda is beautiful. Here, Discovery transposes ugly with an antonym beautiful during the data search and spots the contradiction.

the user may enter a statement or question for which information already exists in memory, but uses different words with which to express it:

User: Thomas met with the board.

System: I already know that Thomas met with the committee.

Here Discovery, when seeking information in memory, simply transposes board with its matching synonym set member committee, and thus prevents the entry of redundant information.

the user may enter a statement or question that, lexically, contradicts existing information in memory:

User: Is Hilda ugly?

System: No, Hilda is beautiful.

Here, Discovery transposes ugly with an antonym beautiful during the data search and spots the contradiction.

In this way, for purposes of knowledge management, WordNet also provides Discovery with "default" knowledge of concepts that would spare users the unnecessary effort of entering obvious factual knowledge, such as a horse is an animal. As shown in the following diagrams multiple associations also imply others.

Language translation

At best, when translating text from one language to another, software like Google Translate merely make their best guess at translatable words and sentence structures. The extent of Discovery's ability to analyze sentences, on the other hand, provides a unique means to build translation software with occasional—but more importantly minimal—assistance from the user. The result would be translations as accurate as the words, grammar and syntax of the other language will allow.

We are focusing on constructing a prototype system which is able to translate text between Spanish and English.

As its vocabulary, Discovery uses WordNet, a lexical database created by George A. Miller and compiled by Princeton University's Cognitive Science Laboratory. WordNet groups English words into sets of synonyms called "synsets", which in turn are interconnected in a network of a variety of lexical relationships. It has the qualities not only of both dictionary and thesaurus, but many more.

To use Discovery's sentence analysis capability for translating text from one language to another, it would first be necessary to have similar lexical databases in the target language or languages being translated to. Fortunately, versions of WordNet are available in more than 200 languages, though not all of them are presently as extensive as the English original. Some synsets in the Spanish version, for example, remain empty and incomplete. Nevertheless, all synsets in one language version, if they exist in that language, must correspond to synsets in all others according to a standard list of concepts, providing direct translations of individual words and phrases. The associations between directly translatable word entries between these two particular versions, in addition to associations between grammatical structures in both languages, will provide the means to construct the Spanish-to-English/English-to-Spanish prototype. As proof of concept, it could serve as the basis of translation software for other language pairs.

Just as important as direct translatable words and phrases is the method of Discovery's grammatical analysis, which is divided into twelve successive stages, concentrating on all possible kinds of phrases and clauses which occur in sentences in any language. Each of these would also have directly translatable equivalents in a target language. Each equivalent would take into account differences in syntax or word order between the two languages.

And although it would be preferable to have the system perform automatic translations as in Google Translate and Babelfish, there are times when a sentence is truly ambiguous, even to a human being. In such a case, the system would have to ask the user to choose from two or more rewordings or interpretations in the source language before translating. For this reason, we wish to build a user-assisted system which prompts the user only in those comparatively rare instances it encounters sentences in a text which contain ambiguities it cannot resolve by itself.

Constructing this application will require the following steps:

1. Adding more vocabulary entries in the Spanish version of WordNet, which currently contains 57,000 entries. Doing so will provide broader vocabulary coverage and make it comparable to its original English version, which contains 149,000.

2, Building a "sister" system which analyzes Spanish sentences in the same manner as its English counterpart, but accommodating features in Spanish, such as gender, which are absent in English.

3. Building a relational database which associates equivalent grammatical structures between both languages, taking into account differences in syntax.

4. Once the Spanish system is thoroughly tested, we shall integrate both systems to translate text between both languages.

Thereafter, additional language pairs, such as Japanese and English, could likewise be integrated to accommodate more language-speaking groups, enabling accurate translation from one language to many others in the form of an ever-growing network.

Practical Applications

How Discovery can be used

Knowledge management

User: Thomas met with the board.

System: I already know that Thomas met with the committee.

Here Discovery, when seeking information in memory, simply transposes board with its matching synonym set member committee, and thus prevents the entry of redundant information.

User: Is Hilda ugly?

System: No, Hilda is beautiful.

Here, Discovery transposes ugly with an antonym beautiful during the data search and spots the contradiction.

Language translation

Collaboration

Venture Capitalists

NLP Technology

get in touch