How Discovery Works

Introduction—Historical Context To Show Contrast Between Rule-Based and Stastical-Based Approaches To NLP

Two things distinguish Discovery from other NLP systems:

  • Discovery is entirely rule-based, and
  • Discovery's grammatical analysis is divided into twelve successive stages, all of which govern every kind of phrase and clause which occur in English sentences.

With regard to rule-based analysis, some historical background is necessary to understand the difference.

From the 1950s until the late 1980s, all NLP systems were made up of small sets of carefully handcrafted rules based on fewer than several hundred examples. They meant to devise grammars with a broad enough coverage to parse any kind of conceivable sentence, but they fell short of their expectations. It wasn't that they weren't competent or lacked effort. One hypothesis for this shortcoming was that while a human's grammar is largely modular, the interaction of the components led the system to be unimaginably complex. Developers accepted this hypothesis, and looked for other approaches.

They eventually settled on systems that analyzed text by conducting statistical analysis of corpora—or large, structured, annotated sets of English text—putting more emphasis the computer's ability to parse sentences and disambiguate words. The computer compares a given sentence to examples using the same words in the corpora.

While the results were better, statistical approaches still have limitations. No matter how extensive corpora are, computers lack awareness of what words represent. The computer essentially makes its best guess, without due judgment, mindlessly following statistical rules. Typically they don't identify true ambiguities that a human would ask for clarification to resolve—and such a system would have to ask, because how can we expect a computer to be better at resolving ambiguity than the human beings who built it?

In contrast to rule-based approaches, statistical analysis takes away control from the developer to ensure that the system mimics or emulates human comprehension of language as closely as possible, for whatever practical and productive purposes it's used. Furthermore, many psychological studies indicate that while examples remain important, rule learning is generally more efficient than exposure to examples.

As such, computerized linguistic analysis is a double-edged sword. Nevertheless, no matter how difficult it would eventually and in fact did turn out to be, I decided to review and devise a vastly improved rule-based system of my own. Overall, I believe that I have succeeded, and wish to make it open for peer review. But now, for you, the prospective investor, here is a brief overview of the entire process:

Word and Phrase Lookup

To begin the process, the user types a sentence into a textbox and presses return. It isolates individual words, while also attempting to identify multi-word phrases as individual vocabulary entries, such as "United States of America". Moreover it also identifies inflections, such as the plural of nouns, possessives, verb conjugations and comparative and superlative forms of adjectives.

If a word is not found, the user has three options:

  • the word is misspelled, in which case the system prompts the user to retype it, or
  • the word is correctly spelled, in which case the system allows the user to add it as a new vocabulary entry and integrate it into the network of lexical relationships in WordNet, or
  • the user can direct the system to ignore the sentence.

When all words are found, all their possible part(s) of speech are ascertained.

Discovery's main vocabulary is WordNet, a lexical database—or word reference system—created in the Cognitive Science Laboratory of Princeton University, initially under the direction of psychology professor George Armitage Miller starting in 1985 and more recently by Christiane Felibaum.

Fragment Analysis

Fragments are defined as either

  • individual identifiers or "signal" words: articles, prepositions, conjunctions and all manner of pronouns or interjections, or
  • one or more series of consecutive non-identifiers: nouns, verbs, adjectives or adverbs.
In his book The Structure of English, linguist Charles Carpenter Fries addresses the question of how speakers of a language recognize structural meaning. He argued that structural meaning is "signaled by specific and definite devices...that signal structural meanings which constitute the grammar of a language."
Fries called these words function or signal words—words which do not belong to the four major part-of-speech classes of nouns, verbs, adjectives and adverbs. This leaves a limited list of more commonly used words. They include prepositions, conjunctions, articles, interjections, whose meanings are generally unambiguous. In development, we refer to these words as "identifiers". Discovery adopts a strategy of first dividing up sentences into fragments, consisting of any nouns, verbs, adjectives and adverbs between different pairs of identifiers, to more readily assess structural meaning.
This helps to make sentence analysis far faster and more efficient, because it eliminates certain part-of-speech fragments—which would otherwise and invariably result in ungrammatical part-of-speech permutations for the entire sentence—before they are ever tested. The remaining fragments are recombined to a more limited list of permutations to test.

This highlights yet another way Discovery is unique among other NLP systems. Instead of the traditional method of part-of-speech tagging, it parses the sentence by determining which part-of-speech permutations result in a grammatically correct sentence.

The process can be compared to a bank vault. Only one combination, or rather permutation of numbers in a specific order will succeed in opening it. But depending on the number of turns there are, and how many possibilities for each turn, there could be a vast number of other permutations which fail. The more turns there are in the permutation, and more possibilities for each turn, the more time must be spent testing all of them. Yet you could save an enormous amount of time if there was a way to reduce the number of permutations you would have to test.

This is the purpose of fragment analysis. The solution is to break up the sentence into these fragments and figure out which ones occur in grammatically correct English sentences. The system extracts each possible part-of-speech permutation of each fragment, and determines

  • the number of words in the fragment,
  • its type, and
  • the combination of any identifiers preceding or following it.

Based on these three criteria, the result is a radically smaller number of permutations for the entire sentence that the system would later test in the next stage of the analysis.

If the system can't find a valid sequence of parts of speech for a fragment in a sentence which is grammatically correct, at least in the estimation of the developer, an interface is available to add it easily and continue processing.

Part-of-Speech Analysis

In part-of-speech analysis, these fragments are then re-combined to generate a list of part-of-speech permutations for the entire sentence. These then pass through 12 stages of grammar rules, which identify series of words that comprise noun, verb, adjective, adverb and prepositional phrases, and relative and subordinate clauses—all the grammatical constructions that comprise English sentences. If a permutation can be narrowed down to consist of one main clause, it's grammatically correct.

In rare instances in which no sentence permutation results in a main clause, it's either grammatically incorrect, at least according to available rules, or the system lacks a grammar rule. In the former case, the user can again have the system ignore the sentence. But in the latter case, there's an interface to add grammar rules as needed for a sentence that's actually grammatically correct.

Slightly more often, depending on the words used, there may be two or more valid sentence permutations. In that case, the sentence is structurally ambiguous, but those permutations pass to yet another stage to determine which one results in a comprehensible or meaningful sentence.

Again, depending on the types of words used, there are different grammar rules that govern how the eventual diagram of the sentence should be structured.

The series of grammar rules executed for each valid sentence permutation embed phrases and clauses into one another, resulting in a diagram of the sentence showing the relationship of each word to every other.

Word Sense Analysis

This stage conducts three types of word sense disambiguation (or WSD) techniques for individual words. WSD is the process of identifying which sense of a word (i.e. meaning or definition) applies in the context of a sentence, when the word has more than one.

• VerbNet

The first WSD method uses verb patterns established in another word reference system linked to WordNet called VerbNet, the brainchild of Dr. Martha Palmer from the Department of Linguistics at the University of Colorado at Boulder. VerbNet provides a large set of syntactic frames associated with different definitions of verbs. Each verb frame contains thematic roles with restrictions in which only certain types of noun, adjective and preposition definitions can apply. Three types of grammar rules are mapped according to these frames to disambiguate the words used in the frame.

VerbNet organizes 8419 definitions of 6818 English verbs into 270 different classes sharing common sentence structures. Its design draws heavily on Levin classes, a classification of verb meaning, and Tree-Adjoining Grammar, pioneered by the Institute for Research in Cognitive Science at the University of Pennsylvania.

For each sentence structure, VerbNet provides

  • a syntax—the structure itself, including a description of each kind of word it uses, and
  • the semantics—an expression of one or more logical predicates. This expression describes the event or idea the structure is supposed to convey, given the limited subset of verbs it can use.

The verb-based Levin frames upon which VerbNet is based serve to establish rules by which Discovery determines the roles which words—principally nouns, verbs and prepositions—play in the event or statement expressed in a sentence. It does so at two levels: first, by basic primary or part-of-speech patterns, and second, by more specific extended syntaxes which specify what kind of nouns apply according to verb sense. For example, svss for different sentence structures.

These relationships provide the means with which Discovery is able to create a grammatical diagram of a given sentence in a tree structure, and thus establishes both how each word relates to every other and which definitions of each word apply in the context of the sentence.

Once a set of type rules is found that results and identifies a single clause comprising the sentence, it determines which structural rules—comprising the second half of ten stages of grammar rules called "definition rules"—apply. It does so based on the extended syntaxes associated with senses of the verbs evaluated in a rule (established by VerbNet to verb definitions in WordNet) are also associated with one of these definition rules.

• Word Pair WSD

The second method governs sense disambiguation of words in pairs, depending on the relationships provided by the sentence diagram Discovery generates. These pairs include:

  • an adverb modifying an adjective or verb
  • an adjective modifying a verb or noun
  • a direct object noun of a verb, and
  • a subject noun with a predicate verb

There are four different types of rules that apply to each of these pairs. The reason for there being four types is to prevent an excess of such rules, the problem which plagued rule-based NLP systems in the past. This new strategy greatly reduces the number of rules necessary by compelling Discovery to search for rules according to priority, for specific rules that apply to different word properties before searching for more general ones.

  • definition / selectional restriction (provided by VerbNet)
  • definition / hypernym ancestor
  • definition / lexical category
  • lexical category / lexical category

Here again, an interface has been built to easily add these rules where necessary to broaden Discovery's ability to exercise reliable WSD, a feature not found in NLP systems using statistical analysis.

• Prepositional Phrase WSD

The third method governs rules that apply to prepositional phrases. Discovery takes into account prepositional phrases which modify nouns, verbs, and even nouns in other prepositional phrases. The means by which Discovery does so is another economical set of rules similar to those for word pairs, again based on pattern and rule type, but specifically for prepositional phrases.