Validating Natural Language Using an Object Database


Some rules governing natuaral language must be explained prior producing object and relational models. The rules of natural language are known as linguistics. From these rules both object and relational representations can then be divised.

The Basics of Natural Language Processing

The term natural language processing generally implies the interpretation of written or textual language. Natural language processing is the syntactical and semantical validation of words and sentences based on known words and strings of words. For example, a apple is invalid whereas an apple in invalid since a noun beginning with a vowel is preceeded by the preposition an and not the preposition a in this case.

Syntax or Spelling and Grammatical Validation

Syntax validation can be broken into two distinct steps, spelling and grammer. Syntax validation is the validation of the positioning of words in a sentence based on their spellings, existence in a lexicon (set of valid words) and their relationships to each other based on their words types, ie. the grammer or linguistics of a language. Syntax analysis determine the meaning of a word, a group of words or a sentence. Note that computerised grammatical validation is not always an exact match to normal language linguistic rules. Sometimes computerisation requires a certain amount of specialisation added to a set of grammatical rules.

  1. Spelling and existence of words in a language. Validate the words, not their meanings or inter-related meanings, find and validate each word in a dictionary or lexicon of words.

  2. Grammatical validation validates relationships between different words according to the typ[es of those words and the required grammatical relationships between those words types. Grammar defines rules for a language. All languages can never fit all the rules precisely. Thus there are grammatical exceptions which sound right but do not fit the rules. All exceptions must be catered for explicitly. Note that there are rules in the Latin-based languages for instance, such as, verb conjugation, which do not apply to the English language. There is no verb conjugation in English. All verbs in English are irregular verbs.

    1. Word types.

      1. Noun.
      2. Verb.
      3. Adjective.
      4. Adverb.
      5. Preposition.
      6. Conjunction.

    2. Word type relationships.

      1. A noun is a thing or object - Jack.
      2. A verb describes what a noun is doing - Jack runs.
      3. An adjective describes the noun - Big Jack runs.
      4. An adverb describes the verb - Jack runs fastT.
      5. Prepopositions - a, an, the, etc.
      6. Conjunctions - and, or.
      7. Punctuation - ., !, etc.

    3. Other grammatical validation factors.

      1. Tenses.

        1. Past.
        2. Present.
        3. Future.
        4. Past continuous.
        5. Present continuous.
        6. Future continuous.
        7. In English there are, I think, many more tenses than this. In this case I will stick with the simple ones, past, present and future.

      2. Verb conjunction - in terms of computerised validation verb conjugation is helpful. The more something can be systematised the more easily it can be computerised. Verb conjunctions consist of regular and irregular verb conjunctions. A regular verb conjunction sticks to a common pattern whereas an irregular verb does not.

        1. Regular verbs - note how the he and she versions end with an s and all other verbs ar left unchanged.

          1. To run.

            I run
            you run
            he runs
            she runs
            we run
            they run

          2. To walk.

            I walk
            you walk
            he walks
            she walks
            we walk
            they walk

          3. The verb to speak - in both English and French (parler) is a regular verb in both languages.

            EnglishFrench
            masculinefemininemasculinefemininefamiliar(close friends)non-familiar
            I speak je parle   
            you speak   tu parlesvous parlez
            he speaksshe speaksil parleelle parle  
            we speak nous parlons   
            they speak ils sontelles sont  
            they speak ols parlentelles parlent  

        2. Irregular verbs - note how the spelling of the verb, generally the ending letters of the verb, following the pronoun is not consistent with the two regular verbs run and walk.

          1. To be.

            I am
            you are
            he is
            she is
            we are
            they are

          2. The verb to be - in both English and French is an irregular verb in both languages.

            EnglishFrench
            masculinefemininemasculinefemininefamiliar(close friends)non-familiar
            I am je suis   
            you are   tu esvous etes
            he isshe isil estelle est  
            we are nous sommes   
            they are ils sontelles sont  

      3. Also note that tenses apply to verb conjugations aswell.

        1. Past tense.

          I was
          you were
          he was
          she was
          we were
          they were

          I ran
          you ran
          he ran
          she ran
          we ran
          they ran

        2. Present tense.

          I run
          you run
          he runs
          she runs
          we run
          they run

        3. Future tense.

          I will/should/etc. run
          you will/should/etc. run
          he will/should/etc. run
          she will/should/etc. run
          we will/should/etc. run
          they will/should/etc. run

Semantic Validation or the Validation of Meaning

Semantics is the meaning of words and the combination of those words. The meaning of words is known as semantics. A lexicon, as already stated, stores words and their meanings. A program can not understand these meanings since to a program a sentence is simply a string of characters. Meaning or semantics is probably closer to what is meant by natural language processing, than is syntax. In some respects semantics is probably more complicated than voice recognition, for instance. Voice recognition is possibly more dependent on microphone technology than anything else. Voice recognition is the breaking up of sounds into patterns of frequencies which are known quantities. Dialects could cause problems. However, learning of new frequency patterns can be accomplished over a period of time. A computer can effectively learn new frequency patterns more easily than it can learn semantics. This is because meaning is not necessarily interpretable from a set number of known patterns. Semantics is probably interpretable from a fixed number of known patterns. However, known patterns of the meanings of word combinations could produce a phenomenal number of permutations of the meanings that can be derived from groups of similar and sometimes even the same words. In some languages, accednts or expressions of the words or groups of words can have multiple meanings. The own way a computer could interpret meanings would be to take all factors involved into accounts. This would imply judging syntax, sound and word order all simultaneously in order to determine meaning. The permutations in this are unimaginable. Atleast with current comnputer technology. The future is however, open to debate. The best currently available supercomputers may be able to accomplish this task. However, five years from now those currently available supercomputers will probably be on your wrist. Programmers probably will not be able to solve the software construction problem that quickly. However, we can now construct software which can learn simply by human interaction. Therefore the computers will learn how to communicate withis us on this level.

Let us take a quick look at two simple sentences. How would these simple sentences be validated semantically such that their meaning is correctly interpreted and understood to the point where the two sentences could perhaps be explained by a computer in the computer's own words.

  1. This is their cat.
  2. Go over there cat.

How would the computer determine between the two words there and their. Both words sound the same but are spelt differently. The only way to correctly deduce the different meanings of both sentences is to compare the order of other words in each sentence. The order of words in each sentence may assist in determining meaning.

  1. The cat is owned.
  2. The cat is going somewhere.
  1. Actually this is a poor example. Word order is not required. The word their is a possessive pronoun. Therefore the cat is owned, atleast by someone.
  2. This is a better example. The word cat is a noun. The word go is a verb. However, the word over could have many meanings. For instance, something is over something, she is over him, the cat is going over the hill or over there. The word there implies direction or could be called a directional pronoun if you like. However, this one is still a little confusing from my point of view.

Even these two simple sentences could generate enormous complexities in semantic analysis. Semantics, even in its simplest computer based form can potentially reference a very large amount of information at once. Meaning is uncovered based on inter-relationships between words on an enormous scale and can include standard responses to standard questions or statements. A machine can appear to understand by varying answers and ensuring that things like tense are correctly constructed. Also different languages, colloqialisms (slang) and multiple dialects could cause problems in terms of the quantity of information required to validate word and sentence meanings.

What is a Lexicon ?

A lexicon is by definition a set of words. A dictionary contains the same set of words as a lexicon but also contains meanings for each of those words. Atleast that is what I mean by a computerised lexicon.

It should now be apparent that a computerised lexicon is much more complex than a book-form dictionary. A program can not judge the meaning of a word based on the definition of that word. The computer can not simply understand the meaning of the word in the dictionary since the machine does not understand the meaning of the words comprising the explanation of that word.

A lexicon or a dictionary is a group of words and their associated meanings. Book form and computerised forms of lexicons could be very different. In general a lexicon is a storage area for data. In book form a lexicon contains words and meanings. A computerised form of lexicon could contain words, word inter-relationships and rules governing or triggered by access to those words or relationships.

At some stage the computerised lexicon becomes firstly, an expert system and secondly, a knowledge base.

Expert Systems

An expert system or database is an expert in a specific area. Thus if one is to ask an expert system a question about engineering and the expert system was programmed by an architect one could get an abstract answer to a precise question. The point to note is that an expert system is written by an expert for an expert in a very specific field. An expert system gives standard answers to standard questions. In its most simple form an expert system will always give the same answer to the same question and could have multiple questioning and answering pathways depending on user responses. However, it is important to note that there is no semantic capability. An expert system only appears to be intelligent based on its content.

Knowledge Base

A knowledge base is a database repository of expert information. An expert system is like a knowledge base with a user friendly front-end. A knowledge base however, not only contains standard answers to standard questions but can also contain rules. These rules can be triggered under specified circumstances. The original idea of a knowledge base was that of an expert system repository which has some inherent processing power, ie. The processing power is contained within the database in the form of event triggers firing rules and performing other tasks such as creating new knowledge base entries.

This leads us to the next item of interest, learning.

Learning.

Intelligence is made up of a number of things. One of these is learning. Learning implies gaining more information by interaction with an environment. Thus if a computer is asked a question it has does not have answer to it could store the question and request an answer. Thus by storing question and answer the computer is learning. This is a very simplistic descrioption. Neural networks are effective with learning, particularly with language. This is because language is generally finite in comparison to other things. Language is easily learned because there are lots of people to talk to. The problem with learning is that the more one learns the more searching is done when questions are asked. Searching takes time. Neural networks attempt to solve this problem by attempting to model the way the brain works. In short, direct and indirect dynamic node interconnections. Isaac Azimov called them Positronic Pathways, routes from one point to another. Neural networks are highly complex and intensive in processing time.