Introduction—Historical Context To Show Contrast Between Rule-Based and Stastical-Based Approaches To NLP
Two things distinguish Discovery from other NLP systems:
- Discovery is entirely rule-based, and
- Discovery's grammatical analysis is divided into twelve successive stages, all of which govern every kind of phrase and clause which occur in English sentences.
With regard to rule-based analysis, some historical background is necessary to understand the difference.
From the 1950s until the late 1980s, all NLP systems were made up of small sets of carefully handcrafted rules based on fewer than several hundred examples. They meant to devise grammars with a broad enough coverage to parse any kind of conceivable sentence, but they fell short of their expectations. It wasn't that they weren't competent or lacked effort. One hypothesis for this shortcoming was that while a human's grammar is largely modular, the interaction of the components led the system to be unimaginably complex. Developers accepted this hypothesis, and looked for other approaches.
They eventually settled on systems that analyzed text by conducting statistical analysis of corpora—or large, structured, annotated sets of English text—putting more emphasis the computer's ability to parse sentences and disambiguate words. The computer compares a given sentence to examples using the same words in the corpora.
While the results were better, statistical approaches still have limitations. No matter how extensive corpora are, computers lack awareness of what words represent. The computer essentially makes its best guess, without due judgment, mindlessly following statistical rules. Typically they don't identify true ambiguities that a human would ask for clarification to resolve—and such a system would have to ask, because how can we expect a computer to be better at resolving ambiguity than the human beings who built it?
In contrast to rule-based approaches, statistical analysis takes away control from the developer to ensure that the system mimics or emulates human comprehension of language as closely as possible, for whatever practical and productive purposes it's used. Furthermore, many psychological studies indicate that while examples remain important, rule learning is generally more efficient than exposure to examples.
As such, computerized linguistic analysis is a double-edged sword. Nevertheless, no matter how difficult it would eventually and in fact did turn out to be, I decided to review and devise a vastly improved rule-based system of my own. Overall, I believe that I have succeeded, and wish to make it open for peer review. But now, for you, the prospective investor, here is a brief overview of the entire process: