So far we have pursued a strictly pattern matching approach to processing natural language input from a user. This is the most efficient method for handling a limited number of commands, but this approach quickly becomes inefficient as the number of different queries grows. We would have to write a new subroutine to process each new query type, and we run the risk of creating a conflict between subroutines. Ultimately, the pattern matching approach fails to provide a credible model for the way people process language. Pattern matching will not allow computers to match patterns intelligently.

In this chapter we will start to explore the grammar of English, and how we can use Perl to recognize this grammar. A grammar provides an explicit set of rules for processing all aspects of a language, including its phonology, morphology, syntax, semantics and pragmatics. We will assume that the standard alphabet provides an adequate structure for processing the sounds, or phonological structure of English. Writing a program that processes the sounds of spoken language is beyond the scope of this book (c.f., Jurafsky & Martin 2000). In this chapter we will tackle the basic syntactic rules that describe the structure of English sentences. The chapter covers two different procedures for parsing sentences in Perl. Our goal will be to provide a grammar the computer can use to parse a set of test sentences, or any other sentences that have a similar structure.

Human languages use three different sentence types or modalities. Declarative sentences make a statement, imperative sentences issue a command, and interrogative sentences ask for information. A miniature world program should be able to handle all three sentence types. In this chapter we will construct a grammar for processing declarative sentences. There are also two modes of language processing–comprehension and production. Our grammar should be useful in producing sentences and well as responding to them. We will begin by constructing a parser for comprehension.

Grammars have two main parts: a set of phrase structure rules and a lexicon. The phrase structure rules supply information about how the words combine into larger units. The lexicon supplies grammatical information about each word, such as the part of speech that the words belong to. We will concentrate on the construction of the phrase structure rules in this chapter.

Words in natural languages combine with other words to form constituents. Phrase structure rules capture this hierarchical structure by specifying the elements contained in any constituent phrase. The phrase structure rule S —> NP VP, for example, states that any sentence (an S) contains two parts or constituents – a noun phrase (NP) and a verb phrase (VP). The rule can also be interpreted as stating that a noun phrase and verb phrase combine to make a sentence. We need to know what constituents combine to make a noun phrase and verb phrase before we know how to produce a sentence though. We’ll assume for now that the phrase structure rule for a noun phrase is NP —> N and the rule for a verb phrase is VP —> V. We now have the phrase structure grammar shown in Figure 7.1 with three rules.

This grammar takes any noun and any verb and shows how to combine them to form a sentence. With such nouns as ‘I’ and ‘You’ and verbs like ‘see’ and ‘know’ our grammar will produce sentences like ‘I see’ and ‘You know’. It will not produce ungrammatical sentences such as ‘*Know you’ since our grammar states that the noun phrase always precedes the verb. It will not produce sentences such as ‘She see’ since we have not provided it with the word ‘she’ yet.

We will need a more complicated set of phrase structure rules to handle more complex declarative sentences. For example, the noun phrase rule in Figure 7.1 only produces noun phrases with a single noun. Such a rule will work for simple pronouns (‘I’, ‘you’) or proper names, but it will not process a noun phrase that includes an article and adjectives. A reasonable rule for noun phrases should be able to handle the types illustrated in Figure 7.2. We can use four simple phrase structure rules to process these noun phrases. We can even combine these rules into one rule by allowing some of the constituents to be optional. Using parentheses to mark the optional constituents, the single phrase structure rule NP —> (DET) (ADJ) N allows us to generate all of the noun phrases in Figure 7.2.

We also need to complicate our verb phrase rule a bit more. I list some different types of verb phrases and their constituent structures in Figure 7.3. Once again, we can use parentheses to mark optional constituents and reduce this set of structures to one phrase structure rule: VP —> V (NP) (PP). This rule introduces one new phrase – the prepositional phrase (PP). We can handle prepositional phrases with the rule PP —> P NP. By combining our phrase structure rules for noun phrases and verb phrases, we derive a more complete phrase structure grammar shown in Figure 7.4.

The only component we are missing now is a set of rules that tell us how the phrase structure rules apply to real words. If we apply the phrase structure rules, we are left with a set of symbols that cannot be replaced by other symbols. This final set of symbols are the terminal strings in our grammar. The grammar in Figure 7.4 has the terminal strings DET, ADJ, N, P and V. The lexicon provides us with the means for linking these terminal strings with our vocabulary. Each lexical entry specifies the part of speech, or terminal string, that applies to each word. The pronouns ‘I’ and ‘you’ are nouns and will be associated with the terminal string N. The verbs ‘understand’ and ‘found’ will be associated with the terminal string V. A lexicon for a miniature block world is shown in Figure 7.5.

We can use pattern matching to implement this grammatical system in Perl. To recognize the syntactic structure of an input sentence, we would first look up the syntactic category of the words in our lexicon, and output a string of terminal categories. We can then build the grammar into a complex pattern that we apply to the category string to determine its syntactic structure. The words in the sentence ‘The block is on the table’ would produce the category string DET N V P DET N. We would like our grammar to group these terminal categories into the proper constituent structure, e.g., NP[DET N] VP[V PP[P NP[DET N]]]. We will begin by building a pattern for noun phrases. The phrase structure rule for a noun phrase in our grammar is NP —> (DET) (ADJ) N. We can translate this rule into the Perl statement $NP = "(DET)? ?(ADJ )*(N)"; This statement defines the scalar variable $NP as a string that contains a possible DET, a possible space, zero or more ADJs and an N. The sequence space question mark (‘ ?’) adds spaces where they are needed between category symbols.

The verb phrase rule can be translated to the Perl statement $VP = “V ?($NP)? ?($PP)?”; This statement defines the scalar variable $VP as a string that contains a V possibly followed by an $NP, which is possibly followed by a prepositional phrase. The Perl statement for a sentence would then be $S = "($NP) ($VP)";. The program in Figure 7.6 provides one example of a way to implement these phrase structure rules in Perl. It also demonstrates a way to print the results of our analysis using implicit variables.

Syn.pl puts Perl’s regular expression engine to work in recognizing the constituent structure of the input string of terminal elements. We can use the question mark quantifier ? to indicate that we expect to find zero or one instance of the categories in our string. This operator is ideal for the determiners modifying the noun phrase since there can only be zero or one determiner modifying a noun phrase. The star quantifier * allows the pattern to match zero or more instances of a given category. This quantifier is ideal for the prenominal adjectives, which can expand indefinitely (e.g., ‘my raggedy old flea bitten sweater’). Together, the question mark and star quantifiers provide us with a method of searching for optional constituents in the input string.

The parentheses in our patterns allow us to record the result of the match through Perl’s implicit variables. I include the foreach loop in the program to show what matches the program finds as well as the route Perl took to find these matches. The foreach loop comes from the Perl regular expressions tutorial (perlretut) by Mark Kvale in the Perl manual, and provides a useful tool for discovering the values of the implicit variables that Perl produces in pattern matching. The Perl interpreter keeps track of each element of the matching pattern that is between parentheses. The interpreter passes these values to a list of implicit variables $1, $2, $3, etc. Such record keeping can place a strain on the computer’s memory resources, so it is best to minimize the use of parentheses in matching patterns. A colon extension (?:) can be used when parentheses are required to disambiguate a pattern, but a record of the match is not needed. If for some reason we did not need a record of the prenominal adjectives, we could alter the noun phrase pattern using a colon extension as follows:

This statement would still extract noun phrases with any number of prenominal adjectives, but would no longer pass the adjective string to a backreference variable.

Following the route Perl’s parser uses to match our sentence pattern against the input string “DET N V DET ADJ ADJ N P N” reveals the details of Perl’s pattern matching procedure. The position numbers record the character position in the input string. The sentence pattern ($S) spans across the entire string from position 0 which precedes the first character to position 25 which follows the last character. The determiner in the first noun phrase (DET ) spans from position zero to position four since the determiner pattern includes the following space. I present a diagram of Perl’s path structure in Figure 7.7. The parser starts at the top of the tree and works its way down each branch from left to right. The first match applies to the entire sentence. The second match applies to the subject noun phrase. The third match applies to the determiner in the subject noun phrase, and the fourth match would apply to the adjectives in the subject noun phrase if there were any. The technical term for this procedure is a top-down, depth-first parser.

In cases where the parser expects a possible constituent, but does not find one, the parser reports a null or zero-length match. The parse of our example sentence yields zero-length parses for the adjective in the subject noun phrase and the determiner and adjective in the object of the prepositional phrase. Our input string contains two adjectives in its direct object, but our parser only records one of them. What happened? In this case, the greediness of Perl’s pattern matching automatically takes the pattern pass the first adjective to the second adjective. The pattern records the second match, which effectively matches the whole series of adjectives.

Parsers can take either a top down or bottom up approach. Top down parsers are able to take advantage of the phrase structure rules to look ahead in the parse and apply a complete or partial phrase structure rule. They run into difficulty when the top-most parse that they apply fails and they have to backtrack back to the point in the tree that caused the problem. For example, a top down parser encountering the sentence ‘A computer program can process language’ might apply the noun phrase rule NP –> DET N to the first two words before reading the next word ‘program’ which would require backtracking to find the correct noun phrase rule NP –> DET ADJ N. If the parser instead applies the noun phrase rule NP –> DET ADJ ADJ to the phrase ‘a computer program’, it could process the word ‘can’ as a noun. This would yield the parse NP –> DET ADJ ADJ N, and encounter a problem when the program processes the word ‘process’. The parser could ultimately spend considerable time following these dead ends before discovering the correct parse S –> DET ADJ N AUX V N.

As I stated earlier, the parse tree shown in Figure 7.7 displays a top-down, depth-first parse of the input sentence. We could also construct a top-down, breadth-first parse by expanding each node generated from the maximal node in each constituent. I illustrate the result of such a parse in Figure 7.8. A breadth-first parse can limit the amount of backtracking necessary in a top-down approach by providing a limited ‘look ahead’ capability. In reality, though, there will always be sentences in a natural language with syntactic structures that fall beyond the range of this look ahead approach. Garden path sentences such as ‘The horse raced past the barn fell’ reveal the problem involved in looking far enough ahead to effectively guide the parse.

Another approach to parsing languages begins with the terminal elements and works back up to the initial syntactic node. Bottom up parsers start with the terminal elements and try to fit these together into larger constituents. Bottom up parsers run into difficulty when the constituent they build has to be reassembled in light of the remaining features in the input string. For example, a bottom up parser encountering the noun phrase ‘a computer program’ might first construct the noun phrase ‘a computer’ and then have to backtrack when it reads the next word ‘program’ to produce the correct parse NP –> DET ADJ N.

We can implement a bottom up parse in Perl as well as the top down parse we used earlier. The bottom up procedure allows the program to use any features of the words that it has already processed to constrain its parse of the following words. For example, if the noun phrase in the sentence begins with a determiner, we know that the following word should be either a noun or adjective. If the following word is a verb or preposition then we can indicate exactly where the parse went astray. I provide an example of a bottom up parser in Figure 7.9.

This program performs its parse on the input string in the subroutine parse. This routine starts with the beginning of the sentence and calls the np subroutine. The np function constructs a noun phrase by checking whether the first part of speech symbol in the input string is a determiner, adjective or noun. The np subroutine will accept noun phrases that contain any number of prenominal adjectives such as ‘the rotten old wooden table’.

When the np function is finished, it returns control to the parse subroutine which then looks to see if the noun phrase is followed by a prepositional phrase. The preposition block begins at the line labeled PLOOP:. If the program finds a preposition, it calls the prep subroutine, which creates a parse for the preposition and its noun phrase object. When the prep subroutine finishes, control returns to the preposition block, which checks to see if there is another preposition (signaling the start of another prepositional phrase). If the program finds a new P symbol in the input string, it returns to the beginning of the preposition block via the goto statement

This cycle enables the program to process noun phrases that contain an indefinite number of modifying prepositional phrases.

Perl allows programmers to define block labels by using the colon. Putting the block labels in upper case letters makes them easy to spot when debugging the program. Goto statements have a notorious past in the annals of computer programming since their use can lead to arbitrary program branching that is extremely difficult to troubleshoot. It is possible to convert any program with goto statements into a structured program that uses for, while, etc. types of control structures. The resulting structured program architecture is much easier to maintain since the control blocks maintain a coherent structure. I only used block labels in this program to illustrate a basic bottom up parsing procedure.

If the parse subroutine does not find a preposition it looks for a verb. If it finds one the parse continues, but if it does not find one it prints an output statement that says the parse failed to find a verb. Finding a verb is the first step in processing the predicate. After finding a verb, the procedure looks to see if there are any following parts of speech. If not, the parse is complete and control returns to the main program which prints the final parse. If there is another part of speech in the input, the procedure tests to see if this symbol is a preposition. If so, the procedure advances to the line labeled PREP. If not, the procedure calls the np function.

I named this program shift.pl since it implements a shift-reduce type of parsing procedure. The program converts the part of speech symbols in the input string into an array via the statement

The parse subroutine then uses the shift function to remove the first symbol from the array thereby reducing the array by one element. The parse subroutine adds each part of symbol to the end of the parse string until the symbol array is empty. At that point control is returned to the main program which prints the completed parse.

The bottom up parsing procedure is only successful when it can identify the final element of a constituent. The implementation in shift.pl uses the verb to identify the end of the subject noun phrase. Our garden path sentence ‘The horse raced past the barn fell’ illustrates one problem with this implementation. Presumably, we could equip the parser with a procedure to recognize headless relative clauses such as ‘raced past the barn’, but we would have to allow for relative clauses of unlimited length. The program does not allow backtracking for even the simple cases of noun phrases like ‘a computer program’ with possible ‘DET N’ and ‘DET ADJ N’ parses. We will look at one method for handling these difficulties in the next chapter.

One limitation of our top-down parser is that it does not recognize recursive phrase structures. The prepositional phrases we discussed in the previous sections will create a problem for this parser. If we try to build recursive patterns with regular expressions in Perl we wind up with statements such as

Perl treats the $PP in the definition of the noun phrase as an undefined variable and replaces the $PP with its value – the null character. Perl interprets the definition of the prepositional phase by replacing the variable $NP with it value, i.e., ‘DET ADJ N ( )?’. Unfortunately, the Perl interpreter does not reapply the prepositional phrase rule to the $PP variable in the expansion of the noun phrase. To create a real pattern, we would need to add another line of code to replace all of our promissary variables with real variables, and yet another line to replace the variables introduced by that substitution. We would have to continue substituting for each $PP variable indefinitely, which has obvious limitations.

We can circumvent this limitation to a certain extent by adding a definition for the prepositional phrase inside our definition of the noun phrase. By applying the star quantifier * to this prepositional phrase, we can construct a noun phrase pattern that contains zero or more prepositional phrases without using a recursive definition. For example, the statement

will match a noun phrase containing an indefinite number of prepositional phrases. While this pattern solves the problem of matching noun phrases, it will not allow us to record the hierarchical structure of the prepositional phrases. Instead, our noun phrase rule treats the prepositional phrases as essentially a flat structure with no internal structure. This is a fundamental limitation of the regular expressions used by most of today’s computer programs.

One solution to this problem would be to combine the pattern match with subroutines that allow recursion. I provide an example of a recursive subroutine for parsing noun phrases in Figure 7.8.

$parse = $parse . "PP[$1 NP[$2" . PP($string) . "]]"; #Call PP subroutine again!

The subroutine NP first deletes a noun phrase containing a determiner, adjective(s) and noun from the string. The NP subroutine then calls the PP subroutine. The PP subroutine begins by checking if the string contains a PP. If it does not contain a PP the subroutine returns an empty string to the NP subroutine. If the PP subroutine finds a prepositional phrase, it adds the preposition and noun phrase to its parse and makes a recursive call to the PP subroutine to see if there is another prepositional phrase embedded in the first PP. The else block uses the implicit variables $1 and $2 to keep track of the preposition and noun phrases that are matched in the if block. Figure 7.11 shows a Perl program that uses these subroutines to parse noun phrases. The program calls the NP subroutine in the statement

$parse = $parse . "PP[$1 NP[$2" . PP($string) . "]]"; #Call PP subroutine again!

The noun phrase parser is no longer a pure top-down parser. The noun phrase subroutine first extracts a determiner, adjective(s) and noun before checking to see if there is a following prepositional phrase. A top-down parser would extract the entire noun phrase including any prepositional phrases.

Recursion is such a central feature of language that I think it worthwhile adding two additional programs that provide further examples of recursion. The program in Figure 7.12 shows how to add noun phrase recursion to a parse for simple sentences.

My final example of recursion provides a means for parsing complex sentences with embedded clauses such as ‘Bill told Mary that he heard that computational linguistics was the greatest’. Such sentences are generated by the following phrase structure rules

The symbol COMP stands for complementizer, the linguistic term for words such at ‘that’ and ‘whether’ which introduce embedded clauses in the verb phrase. The program will parse input strings such as ‘DET N V COMP ADJ N V COMP DET N V’. This program is shown in Figure 7.13.

It is reasonable to ask whether there is an advantage to using a top down or bottom up parse procedure. In practice, the two methods are equally efficient. Many parsers combine the two approaches in an attempt to utilize the best features of both approaches. However, we can use even the simple parsing we have done so far to investigate one of the central ideas in computational theory. The problem is to provide a program that recognizes, or parses, all possible symbol strings in a natural language like English. We can proceed as before, by assuming that we have a dictionary which converts the words in an English sentence into their corresponding parts of speech. The problem for our parsing program is to divide these symbol strings into grammatical and ungrammatical sets. The input string ‘DET N V DET ADJ N’ would belong to the grammatical set since it corresponds to such English sentences as ‘The program parsed the input string’. The string ‘DET N ADJ P V’, however, belongs in the ungrammatical set since there is no acceptable English sentence with this word order.

The top-down parsing procedure used in the program syn.pl (Figure 7.6) relies on Perl’s set of regular expressions to build the patterns the program uses to parse the input strings. Any program that utilizes these regular expressions will be limited to patterns composed from the regular expressions. Our parsing problem boils down to deciding whether the language composed from regular expressions is powerful enough to match symbol strings from a natural language.

The set of patterns one can compose with regular expressions define what is termed a regular language. A regular language can be built by the concatenation, union, disjunction, or repetition (via * quantification or Kleene * operation) of any regular expression. The grammar of a regular language only permits the final element in a phrase structure rule to be non-terminal. We can generate a regular language with a phrase structure rule such as S –> mO, where the left hand element is a terminal element and the right hand element is a non-terminal element. Adding the rules O –> o and O –> oO to the first rule generates the sentences ‘mo’, ‘moo’, ‘mooo’, etc. If we applied the * quantifier to the string ‘mo’, we would generate the sentences ‘momo’ and ‘momomo’. These sentences can also be produced with the regular language rule S –> moS, which proves they are still regular language expressions.

At the dawn of the modern computing era (when the word ‘computer’ referred to humans who did calculations rather than machines), Alan Turing (1936) proposed the thesis that any fully specified algorithm can be realized by a simple device now known as a Turing machine. A Turing machine scans the symbols in its input, which is typically depicted as a one dimensional tape divided into squares each carrying a single symbol. After reading the symbol the machine can change the symbol on the tape, move the tape to the right or to the left, and/or change its internal state. The machine’s reaction to the input symbol will depend the machine’s internal state and the symbol being read. Turing machines are assumed to have a limited or finite number of internal states.

I provide a transition table for a simple Turing machine in Figure 7.10. The Odd_or_Even machine scans a tape with the symbols 0, 1 and !, and reports if it has found an odd or even number of ones by the time it scans the ! symbol. When the machine scans a zero, it maintains its current state, but when the machine scans a one, it changes to the opposite state. While this is not an impressive result, it underlines the surprising nature of Turing’s claim that finite-state machines like our Odd_or_Even wonder can be built to compute any specified algorithm. Hopcroft and Ullman (1979) have published a proof that shows Turing machines can recognize our set of regular expressions.

	Input			Move
State	0	1	!
Even	Even	Odd	Halt	Right
Odd	Odd	Even	Halt	Right

Any operation that can be performed by a Turing machine is said to be a computable function. More formally, a function is computable if the machine halts after a finite number of steps. Hopcroft and Ullman’s proof shows that recognizing the expressions in a regular language is a computable function. There are functions, however, that are not computable. Minsky (1967) provides a proof to show that the function that decides whether a machine will ever halt for a given input is not computable. This means it is impossible to provide a real estimate of how long it will take a computer to complete every task. The hourglass that appears on personal computers only shows that the computer is busy, not the amount of time that remains. We can rephrase our original problem as whether the procedure that recognizes strings of a natural language is computable.

While we cannot show that the procedure that recognizes strings generated in a natural language is computable, we can show that the grammar needed to produce such strings is not regular. Natural languages, unlike regular languages, require more powerful grammars containing context free phrase structure rules. A Context Free Grammar allows a recursive element to appear in the middle of a phrase. For example, the rules

describe a simple context free grammar that generates strings with the same number of m’s and o’s on each side. The rules that generate noun phrases in English have this feature. Consider the following rules for prepositional phrases

These rules generate strings such as NP[DET N PrepP[P NP[DET N PrepP[P DET N]]]. The prepositional phrase is generated inside the noun phrase, allowing the process to continue recursively inside the noun phrase. We can add a relative clause at the end of the noun phrase to show that the recursive expansion of the prepositional phrases takes place inside the main noun phrase. The noun phrase ‘the spider near the top of the wall, that is spinning a web’ displays this structure. The following rule which uses the * quantifier does not generate the correct noun phrase structure

It only generates pseudo noun phrases such as NP[DET N PrepP[P DET N] PrepP[P DET N]].

Parsing provides a good example of the processing limitations that restrict the computer analysis of natural languages. While there are alternatives to pure top down and bottom up parsing, none of the alternatives provide a complete solution to the problem. The problem is to find a parse for every sentence in a finite amount of time. We can easily construct a parsing procedure that will eventually determine the phrase structure of an input sentence, but we cannot guarantee that our procedure will discover the correct parse in our lifetime. Every parser has a limit on the length or complexity of the sentences that it can parse successfully, i.e., use its backtracking ability to recover from parsing errors. Sentences that go beyond the critical length cannot be analyzed by the parser. This problem seems to be inherent in parsing natural languages, though. People encounter the parsing dilemma in the form of Garden Path sentences such as ‘The horse raced past the barn fell.’ The Virginia Woolf passage that we analyzed in Chapter 4 would create tremendous difficulties for any parsing program. In the absence of a complete solution to the parsing problem we can employ various practical parsing solutions as long as we recognize their limitations.

Hopcroft, J. E. & Ullman, J. D. 1979. Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley.

Daniel Jurafsky & James H. Martin. 2000. Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, NJ: Prentice Hall.

Minsky, Marvin 1967. Computation: Finite and Infinite Machines. Englewood Cliffs, NJ: Prentice-Hall.

Turing, Alan M. 1936. On computable numbers, with an application to the Entscheidungsproblem. Proc. London Math. Soc., Ser. 2-42.230-265.