Feature Checking
The parsers that we constructed last week rely solely on part of speech categories to analyze the hierarchical structure of sentences. In this chapter, we will examine how lexical features can help the parser complete its task. We will first add a morphological component to the parser to enable it to process common inflections. We will then add verb subcategorization features to the dictionary that will help the parser determine the proper verb complements. The final section of this note will demonstrate one method for putting the parser on the web.
Inflectional Morphology
Human languages commonly inflect the different parts of speech in a distinct fashion, see Figure 8.1. Nouns are inflected for case and number. Verbs are inflected for tense and agreement, and adjectives are inflected for agreement and gender. Adding an inflectional morphology component to our parser will enable it to recognize new words and use the inflections to help determine the syntactic structure of a sentence. Finding a word that ends with the inflection /-ed/, for example, would lead us to believe that the word is possibly a verb inflected for past tense. A word with the inflection /-s/ could be either a plural noun or a verb in the third person present tense, but at least we eliminate the possibility that the word is a determiner, adjective or preposition. We can recognize words inflected for tense or number by removing the inflection and looking up the base in our dictionary. This procedure is known as ‘stemming’ since it reduces a word to its inflectional base or stem. The stemming procedure will enable us to distinguish between inflected words such as looked and eggs, and words that merely end in /ed/ or /s/, such as red and pass. We will also have to equip the parser with a procedure to recognize irregularly inflected words such as found and people. We will make use of features stored in the lexical entries of irregular vocabulary items to accomplish this task.
Figure 8.1. English Inflectional Features
Part of Speech Inflections
Noun Plural -s; Possessive -s
Verbs Third person singular present tense -s; Past tense -ed; Progressive -ing
Adjectives Comparative -er; Superlative -est; Adverbial -ly
The morphological procedure we require would examine each of the words stored in the array @words to see if they carry a regular plural or tense inflection. It can do this by checking to see if the word ends in an ‘s’ or ‘ed’. For now, we will assume that all words with an /-ed/ inflection are verbs. Our procedure will have to check the words with an /-s/ inflection to see if they are plural nouns or third person singular present tense verbs. If the word can be either a noun or verb, the morphological procedure will have to mark the word as lexically ambiguous and add the appropriate features for either a noun or verb.
We will need to expand our dictionary entry for each word in order to implement this procedure. Our current dictionary only lists the part of speech for each word. The irregularly inflected words will force us to add features for number, tense and agreement to their dictionary entries. If we add such features to the irregular vocabulary items we can make the assumption that the regularly inflected items have default values for the same features. We will assume that all regularly inflected nouns, for example, have a default value of third person singular. I provide an example of a stemming package in Figure 8.2.
Figure 8.2. Stemming package
#!/usr/bin/perl
# Stem.pm
# A stemming module
package Stem;
use Exporter;
#use strict;
#use warnings;
our @ISA = ('Exporter');
our @EXPORT = qw(&stem @array );
sub stem {
my @words = @{ $_[0] };
my %lex = %{$_[1]};
my %pos = %{$_[2]};
my @array = ();
my $item = '';
foreach $word (@words) {
if ( !exists $lex{$word} ) {
#see if word is regular past tense verb
if ( substr($word, -2) eq 'ed' ) {
$stem = $word;
substr($stem, -1) = ''; # remove the -d suffix
if ( $pos{$stem} eq 'V' ) {
$lex{$word} = "V/0";}
substr($stem, -1) = ''; # remove the -e suffix
if ( $pos{$stem} eq 'V' ) {
$lex{$word} = "V/0";}
} # end -ed if
#see if word has -s inflection
elsif ( substr($word, -1) eq 's' ) {
$stem = $word;
substr($stem, -1) = ''; # remove the -s suffix
if ( $pos{$stem} eq 'N' ) {
$lex{$word} = "N/2"; }
elsif ( $pos{$stem} eq 'V' ) {
$lex{$word} = "V/3"; }
} # end -s elsif
} # end if !exists
$item = $word . "." . $lex{$word};
push ( @array, $item );
} # end foreach $word
return @array;
} # end sub stem
return 1;
This package makes use of the perl function substr to see if a word contains an inflection. The statement
substr($stem, -2)
extracts the last two letters from the string in the variable $stem. The statement
substr($stem, 2)
would extract the first two letters from this string.
The stemming routine first checks if the word does not exist in the dictionary. If not, the stemming subroutine checks if the word ends in /-ed/. If so, it removes the /-d/ from the base using another version of the substr function.
substr($stem, -1) = ''
The subroutine then checks the dictionary to see if the resulting stem (e.g., ‘move’) is listed as a verb. If so, the subroutine creates a new dictionary entry for the inflected verb and adds the number feature ‘0’. If the subroutine finds that the word ends in an ‘s’, it removes the suffix and then checks to see if the base is a noun or verb. If the base is a noun, the subroutine adds the feature ‘2’ which indicates that the noun is plural. If the base is a verb, the subroutine adds the feature ‘3’ to indicate that the verb is in the third person. The stemming subroutine will convert the string
the eggs are in the bowl
to
the.DET/0 eggs.N/2 are.V/2 in.P/ the.DET/0 bowl.N/3
Since the stemming subroutine alters the form of the words in the input, we have to adjust the ambilex subroutine. The adjusted form of the ambilex package is shown in Figure 8.3. The adjustment uses the statement
$ambig =~ /(.*)\/(.*)/ ;
to see if the word contains a feature. If so, it adds the feature to the variable $ambig and adds this feature to each of the parse strings the subroutine builds in the array @string. We have to use a backslash (\) before each of the slashes in this statement to indicate that we are using the slashes as a regular character rather than as an escape character.
Figure 8.3. The adjusted ambilex package
#!/usr/local/bin/perl
# Ambi2.pm
# Works with ambi6.pl
# Produces ambiguous strings for input to Parse.pm
package Ambi2;
use Exporter;
our @ISA = ('Exporter');
our @EXPORT = qw(&ambilex @string);
sub ambilex {
my @words = @{ $_[0] };
my $no1 = 0; # initialize loop index for old string array
my $no2 = 0; # initialize loop index for expanded string array
my $index = 0; # initialize loop index
my @string = '';
my $item = '';
foreach $item (@words) { # @_ is the implicit array variable that refers to @words
(my $word, my $pos) = split /\./, $item ;
my @ambig = split /;/, $pos ;
$no1 = $#string; # $no records the index of the last element in the array of strings
@string = (@string) x @ambig; #replicate the string array by the number of ambiguities
my $i = 0;
$no2 = $#string; # the index of the expanded array of strings
foreach $ambig (@ambig) { #pick one of the ambiguous values
$ambig =~ /(.*)\/(.*)/ ;
$ambig = $1 . '.' . $2;
for ($index = 0; $index <= $no1; $index = $index + 1) {
$string[$i] = $string[$i] . $ambig . ' '; #add the ambiguous value to each string in array
if ( $no2 > 0 ) {
$i = $i + 1;
} #end if
} #end for index
} #end foreach ambig
} #end foreach word
return @string;
} #end sub ambilex
1;
With these changes, the program will now transform strings such as
I see eggs on the table
into
N.1 V.0 N.2 P DET.0 N.3
The next step is to add a routine that checks agreement between the features. We can assume that past tense verbs agree with any person whereas verbs marked for third person singular will only agree with third person singular subjects. A few verbs require special marking in the lexicon. The copula am is marked for first person singular, while the copula is is marked for the third person singular. Likewise, the determiners a and an can only occur with singular nouns while the determiner the can occur with both singular and plural nouns. We will assume that unless otherwise indicated, verbs are inflected as non-third person present tense forms, and nouns are inflected as third person singular forms. Past tense verbs agree with all persons while the determiner the agrees with all nouns. We will first build a routine that checks agreement in noun phrases between the determiner and the nouns. We will then add a separate routine that checks agreement between the subject and verb.
We will add a noun phrase agreement subroutine to the parse package. This insures that the program checks agreement within the noun phrase every time the NP subroutine is called. The parser only needs to record the agreement features and store their values in specific variables. I use the variable $det_agr to store the determiner’s agreement feature and $n_agr to store the noun’s agreement feature. Once we have captured the determiner and noun features we can add the following NP agreement check subroutine.
# NP Agreement Check
my($det_agr, $n_agr, $v_agr) = @_;
if ( ( $det_agr eq '.3' ) && ( $n_agr ne '.0' ) && ( $n_agr ne $det_agr ) ) {
$parse[$j] = $parse[$j] . ' NP PARSE FAILS AGREEMENT CHECK!';
return;}
else {
$np_agr = $n_agr; }
This routine first checks to see if the determiner feature and the noun feature have a critical value. If so, the program checks to see if their values are not equal. If their values are not equal (e.g., ‘a eggs’), the routine signals a failure in the noun phrase agreement check. If the features are equal or if one of the features is undefined, the routines sets the noun phrase agreement feature equal to the determiner or noun feature that has a defined value. We can use a similar check for subject verb agreement as shown below.
# Subject-Verb Agreement Check
if ( $np_agr ne '.3' && $np_agr ne '.0' && $v_agr eq '.3' ) {
$parse[$j] = $parse[$j] . ' PARSE FAILS SUBJ-VERB AGREEMENT CHECK!';
return; }
elsif (( $np_agr eq '.3' ) && ( $v_agr ne '.0' ) && ( $v_agr ne '.3' )) {
$parse[$j] = $parse[$j] . ' PARSE FAILS SUBJ-VERB AGREEMENT CHECK!';
return; }
With these routines in place, our program will now parse the input sentence and check agreement features. The main program is shown in Figure 8.4. It uses the packages Stem (Figure 8.2), Ambi2 (Figure 8.3) and Parse (Figure 8.5).
Figure 8.4 The main parsing program
#!usr/local/bin/perl
# feature5.pl
# implements inflectional morphology and feature checking
# with packages Stem.pm, Ambi2.pm and Parse2.pm
use Stem;
use Ambi2;
use Parse2;
use English;
my @string = ''; my $word = '';
my %pos = ();
#The Lexicon
my %lex = ("I" => "N/1",
"you" => "N/2",
"it" => "N/3",
"block" => "N/3;V/2",
"egg" => "N/3",
"table" => "N/3;V/2",
"bowl" => "N/3",
"floor" => "N/3",
"found" => "V/0",
"has" => "V/3",
"is" => "V/3",
"am" => "V/1",
"are" => "V/2",
"get" => "V/2",
"move" => "V/2",
"put" => "V/0",
"see" => "V/2",
"on" => "P/",
"in" => "P/",
"with" => "P/",
"a" => "DET/3",
"an" => "DET/3",
"the" => "DET/0");
foreach $word (keys %lex) {
$lex{$word} =~ /(.*)\/(.*)/ ;
$pos{$word} = $1;
} # end foreach word
print "Please type a sentence\n\n";
chop( my $input = <> ); #Get the string from the standard input
# Main program loop
until ( $input eq 'thanks' ) {
my @words = split / /, $input;
@words = stem(\@words, \%lex, \%pos); # morph analyzes inflectional morphology
my @string = ambilex(\@words); # ambilex constructs a different string for ambiguous words
parse(\@string); # parse produces a parse for each string of terminal elements
print "\n";
chop( $input = <> ); #Get another string from the standard input
} #end until
Figure 8.5 The parse package
#!/usr/local/bin/perl
# Parse2.pm
# Works with feature5.pl
# implements syntactic module
package Parse2;
use Exporter;
our @ISA = ('Exporter');
our @EXPORT = qw( &parse &print );
#The Grammar
my $NP = "(DET(\.[0-3]) )?(ADJ )*N(\.[0-3]?)?";
my $PP = "P(?:\. )$NP";
my $NP2 = "$NP( $PP)*";
my $V1 = "(V(\.[0-3]))(?: )?($NP)?(?: )?";
my $V2 = "(V(\.[0-3]))(?: )?";
sub parse {
our @parse = '';
my @string = @{ $_[0] };
$i = 0; # initialize the string index (lexical ambiguity)
foreach $string (@string) {
chop($string);
$j = 0; # initialize the parse index (structural ambiguity)
my $string1 = $string;
my $string2 = $string1;
if ( $string1 =~s/($NP2) ($V1)// ) { # VP with object NP
my $subject = $1; my $object = $14;
my $det_agr = $3; my $n_agr = $5; my $v_agr = $13;
if ( $string1 =~m/$V2/ ) {
$parse[$j] = " SENTENCE FAILS GRAMMAR CHECK!\n";
}
else {
$parse[$j] = NP ($subject) . " VP[V$v_agr" . NP ($object) . PP ($string1) . "]\n";
}
agreechk($det_agr, $n_agr, $v_agr);
}
else {
$parse[$j] = " SENTENCE FAILS GRAMMAR CHECK!\n";
}
$j = $j + 1;
if ( $string2 =~s/($NP2) ($V2)// ) { # plain VP
my $subject = $1;
my $det_agr = $3; my $n_agr = $5; my $v_agr = $13;
if ( $string2 =~m/$V2/ ) {
$parse[$j] = " SENTENCE FAILS GRAMMAR CHECK!\n";
}
elsif ( $string2 =~ /^$PP/ ) { #A PP verb complement?
$parse[$j] = NP ($subject) . " VP[V$v_agr" . PP ($string2) . "]\n"
}
elsif ( $string2 !~ /$NP/ ) { #No direct object?
$parse[$j] = NP ($subject) . " VP[V$v_agr]\n";
}
else { #A direct object
$parse[$j] = NP ($subject) . " VP[V$v_agr" . NP ($string2) . "]\n";
}
agreechk($det_agr, $n_agr, $v_agr);
}
else {
$parse[$j] = " SENTENCE FAILS GRAMMAR CHECK!\n";
}
&print;
$i = $i + 1; # increment the string index
} #end foreach string
} #end sub parse
sub NP { # parse NP
my $string = shift;
my $parse;
if ( $string =~ s/($NP)// ) {
$parse = " NP[$1" . PP ($string) . "]"; # call PP
}
} # end sub NP
sub PP { # parse PP
my $string = shift;
my $parse;
if ( $string !~ s/(P\.) ($NP)// ) { # end recursion
return;
}
else { # PP recursion
$parse = " PP[$1 NP[$2" . PP($string) . "]]";
}
} # end sub PP
# Check agreement
sub agreechk {
# NP Agreement Check
my($det_agr, $n_agr, $v_agr) = @_;
if ( ( $det_agr eq '.3' ) && ( $n_agr ne '.0' ) && ( $n_agr ne $det_agr ) ) {
$parse[$j] = $parse[$j] . ' NP PARSE FAILS AGREEMENT CHECK!';
return;}
else {
$np_agr = $n_agr; }
# Subject-Verb Agreement Check
if ( $np_agr ne '.3' && $np_agr ne '.0' && $v_agr eq '.3' ) {
$parse[$j] = $parse[$j] . " PARSE FAILS SUBJ-VERB AGREEMENT CHECK!\n";
return; }
elsif (( $np_agr eq '.3' ) && ( $v_agr ne '.0' ) && ( $v_agr ne '.3' )) {
$parse[$j] = $parse[$j] . " PARSE FAILS SUBJ-VERB AGREEMENT CHECK!\n";
return; }
} # end sub agreechk
sub print {
print "\nstring[$i] is: $string[$i]\n";
my $j = 0;
for ( $j = 0; $j <= 1; $j = $j + 1 ) {
print " parse[$j] = $parse[$j]\n"; }
@parse = '';
} #end sub print
1;
Verb Subcategorization
We have now seen one concrete example of the way in which feature checking can help us with the ambiguity of natural language. Our program makes use of agreement checks to reject sentences with ungrammatical combinations of features. We will now add a verb subcategorization feature as an additional aid to resolving lexical and syntactic ambiguity problems.
The category of verbs contains subcategories of verbs defined by the types of complements they take. The verb laugh usually appears by itself in the predicate while the verb find can appear with both a following noun phrase and prepositional phrase. It cannot be used without a following noun phrase as shown by the peculiarity of the following sentence:
*Jill found.
We expect to be told that Jill found something. Table 8.6 shows the subcategorization frames for some of the verbs that we have been using so far.
Table 8.6 Verb subcategorization features
Verb Subcategorization Example Sentences
is _pp The bowl is on the table.
put _np_pp I put the egg in the bowl.
found _np(_pp) I found the bowl.
move (_np)(_pp) I moved onto the table.
see (_np) You see a bowl.
The subcategorization notation indicates the type of phrase that we expect to follow each verb. We expect the verb is to be followed by a prepositional phrase. We will use parentheses again to indicate that a phrase is optional. The verb move may be followed by just a noun phrase, or by both a noun phrase and a prepositional phrase.
Adding subcategorization features to our computer lexicon requires some planning. A portion of the current lexicon is shown in Table 8.7. At the moment, the lexicon exists in the form of a hash, where the words are the keys and the parts of speech and features are the values. We need a lexical structure that we can expand indefinitely to accommodate any number of features. The easiest way to do this is to use a flat structure for the values that we can then analyze via Perl’s matching statement. This approach leads to the lexicon shown in Table 8.8 that contains information about the words’ subcategorization features in addition to their part of speech and number features.
Table 8.7 The lexicon
%lex = ("I" => "N/1",
"you" => "N/2",
"it" => "N/3",
"block" => "N/3;V/2",
.
.
.
"see" => "V");
Table 8.8 A flat lexicon
%lex = ("I" => "N/1/",
"you" => "N/2/",
"it" => "N/3/",
"block" => "N/0/;V/2/:_np:",
"found" => "V/past/_np:_pp:",
.
.
.
"see" => "V//:_np:");
The main difference between the structures of these two lexicons is that the lexicon in Figure 8.8 uses a slash (‘/’) to separate each of the values whereas it is not clear how we would add a subcategorization feature to the lexicon in Figure 8.7. Any additional features we wish to track can be easily tacked on to the entries in a flat lexicon. I used a colon ‘:’ in place of parentheses to avoid invoking Perl’s backreferencing mechanism. We can also make use of a simple pattern matching routine to extract the information from a flat lexicon. Figure 8.9 provides an example of a routine that uses the flat lexicon to create arrays for each word that contain its part of speech, its number feature and its subcategorization feature.
Figure 8.9 A feature extraction routine for a flat lexicon
foreach $word (keys %lex) {
$lex{$word} =~ /(.*)\/(.*)\/(.*)/ ;
$pos{$word} = $1;
$number{$word} = $2;
$subcat{$word} = $3;
} # end foreach word
With this routine in place we can use the subcategorization information to resolve some syntactic ambiguities. I inserted the routine shown in Figure 8.10 to resolve the verb phrase preposition attachment ambiguity. It checks the verb’s subcategorization to see if it allows for a prepositional phrase as well as a noun phrase. If so, the program restricts the parse to a verb phrase with a noun phrase and prepositional phrase complements. If not, the routine checks to see if the verb requires a noun phrase or a prepositional phrase. The routine notes if the parse does not pass the verb subcategorization check.
Figure 8.10 A verb subcategorization check
if ( $subcat eq ";_np" || $subcat eq ";_np:_pp:" && $17 eq '' ) {
$parse[$j] = $parse[$j] . ' PARSE FAILS SUBCAT CHECK!'; }
elsif ( ($subcat eq ";_pp" || $subcat eq ";:_np:_pp") && $24 eq '' ) {
$parse[$j] = $parse[$j] . ' PARSE FAILS SUBCAT CHECK!'; }
elsif ( $subcat eq ";_np_pp" && ( $17 eq '' || $24 eq '' ) ) {
$parse[$j] = $parse[$j] . ' PARSE FAILS SUBCAT CHECK!'; }
Figure 8.11 provides a full program that checks both number and subcategorization features. It parses the sentence ‘I am on the table’ as
NP[I.1]NP VP[am.1 PP[on. NP[the.0 table.3]NP]PP]VP
For a sentence such as ‘I am the table’ which violates the verb subcategorization requirements, the program produces the response
NP[I.1]NP VP[am.1 NP[the. table.;
PARSE FAILS VERB SUBCAT CHECK!
The program does not check all possible verb subcategorizations. In its current form, the program only allows the verb is to appear with a following prepositional phrase. I leave it as an exercise for the reader to add further subcategorization checks. The revised parsing package is provided in Figure 8.12.
Figure 8.11 A program that checks number and subcategorization features
#!usr/local/bin/perl
# subcat4.pl
# demonstrate subcategory and feature checking
# in a top-down parser
use English;
use Stem2;
use Ambi3;
use Parse3;
my @string = ''; my $word = '';
my %pos = (); my %subcat = ();
# lexicon arranged as pos/number/subcategory
%lex = ("I" => "N/1/",
"you" => "N/2/",
"it" => "N/3/",
"block" => "N/0/;V/2/:_np:",
"egg" => "N/0/",
"table" => "N/0/;V/2/_np",
"bowl" => "N/0/",
"floor" => "N/0/",
"found" => "V/0/_np:_pp:",
"has" => "V/3/_np",
"is" => "AUX/3/_np;V/3/_pp",
"am" => "AUX/1/_np;V/1/_pp",
"are" => "AUX/2/_np;V/2/_pp",
"get" => "V/2/_np:_pp:",
"got" => "V/0/_np:_pp:",
"give" => "V/2/_np:_pp:;V//_np_np",
"gave" => "V/0/_np:_pp:;V/0/_np_np",
"move" => "V/2/_np",
"put" => "V/2/_np_pp",
"see" => "V/2/:_np:",
"saw" => "V/0/:_np:",
"on" => "P//",
"in" => "P//",
"with" => "P//",
"a" => "DET/3/",
"an" => "DET/3/",
"the" => "DET/0/");
foreach $word (keys %lex) {
$lex{$word} =~ /(.*)\/(.*)\/(.*)/ ;
$pos{$word} = $1;
$subcat{$word} = $3;
} # end foreach word
print "Please type a sentence\n\n";
chop( $input = <> ); #Get the string from the standard input
# Main program loop
until ( $input eq 'thanks' ) {
print "\n";
@words = split / /, $input;
@words = stem(\@words, \%lex, \%pos, \%subcat); # stem the input string
my @string = ambilex(\@words); # ambilex constructs a string for each lexical ambiguity
parse(\@string); # parse produces a parse for each string of terminal elements
print "\n";
chop( $input = <> ); #Get another string from the standard input
} #end until
Figure 8.12 The parse package
#!usr/local/bin/perl
#!/usr/local/bin/perl
# Parse3.pm
# Works with subcat4.pl
# implements syntactic module
package Parse3;
use Exporter;
our @ISA = ('Exporter');
our @EXPORT = qw( &parse &print );
#The Grammar
my $NP = "(DET(\.[0-3]); )?(ADJ )*N(\.[0-3]?)?;";
my $PP = "P(?:\.; )($NP)";
my $NP2 = "$NP( $PP)*";
my $V1 = "(V(\.[0-3])(;[:np_]*))(?: )?($NP)?(?: )?";
my $V2 = "(V(\.[0-3])(;[:np_]*))(?: )?";
sub parse {
our @parse = '';
my @string = @{ $_[0] };
$i = 0; # initialize the string index (lexical ambiguity)
foreach $string (@string) {
chop($string);
$j = 0; # initialize the parse index (structural ambiguity)
my $string1 = $string;
my $string2 = $string1;
if ( $string1 =~s/($NP2) ($V1)// ) { # VP with object NP
my $subject = $1; my $object = $16;
my $det_agr = $3; my $n_agr = $5; my $v_agr = $14;
my $subcat = $15;
if ( $string1 =~m/$V2/ ) {
$parse[$j] = " SENTENCE FAILS GRAMMAR CHECK!\n";
}
else {
$parse[$j] = NP ($subject) . " VP[V$v_agr" . NP ($object) . PP ($string1) . "]\n";
}
agreechk($det_agr, $n_agr, $v_agr);
subcatchk($subcat, $object, $string1);
}
else {
$parse[$j] = " SENTENCE FAILS GRAMMAR CHECK!\n";
}
$j = $j + 1;
if ( $string2 =~s/($NP2) ($V2)// ) { # plain VP
my $subject = $1;
my $det_agr = $3; my $n_agr = $5; my $v_agr = $14;
my $subcat = $15;
if ( $string2 =~m/$V2/ ) {
$parse[$j] = " SENTENCE FAILS GRAMMAR CHECK!\n";
}
elsif ( $string2 =~ /^$PP/ ) { #A PP verb complement?
$parse[$j] = NP ($subject) . " VP[V$v_agr" . PP ($string2) . "]\n"
}
elsif ( $string2 !~ /$NP/ ) { #No direct object?
$parse[$j] = NP ($subject) . " VP[V$v_agr]\n";
}
else { #A direct object
$parse[$j] = NP ($subject) . " VP[V$v_agr" . NP ($string2) . "]\n";
}
agreechk($det_agr, $n_agr, $v_agr);
subcatchk($subcat, $string2, $string2);
}
else {
$parse[$j] = " SENTENCE FAILS GRAMMAR CHECK!\n";
}
&print;
$i = $i + 1; # increment the string index
} #end foreach string
} #end sub parse
sub NP { # parse NP
my $string = shift;
my $parse;
if ( $string =~ s/($NP)// ) {
$parse = " NP[$1" . PP ($string) . "]"; # call PP
}
} # end sub NP
sub PP { # parse PP
my $string = shift;
my $parse;
if ( $string !~ s/(P\.;) ($NP)// ) { # end recursion
return;
}
else { # PP recursion
$parse = " PP[$1 NP[$2" . PP($string) . "]]";
}
} # end sub PP
# Check agreement
sub agreechk {
# NP Agreement Check
my($det_agr, $n_agr, $v_agr) = @_;
if ( ( $det_agr eq '.3' ) && ( $n_agr ne '.0' ) && ( $n_agr ne $det_agr ) ) {
$parse[$j] = $parse[$j] . ' NP PARSE FAILS AGREEMENT CHECK!';
return;}
else {
$np_agr = $n_agr; }
# Subject-Verb Agreement Check
if ( $np_agr ne '.3' && $np_agr ne '.0' && $v_agr eq '.3' ) {
$parse[$j] = $parse[$j] . ' PARSE FAILS SUBJ-VERB AGREEMENT CHECK!';
return; }
elsif (( $np_agr eq '.3' ) && ( $v_agr ne '.0' ) && ( $v_agr ne '.3' )) {
$parse[$j] = $parse[$j] . ' PARSE FAILS SUBJ-VERB AGREEMENT CHECK!';
return; }
} # end sub agreechk
# check subcategory restrictions
sub subcatchk {
my($subcat, $object, $pp) = @_;
if ( $subcat eq ";_np" || $subcat eq ";_np:_pp:" && $object eq '' ) {
$parse[$j] = $parse[$j] . ' PARSE FAILS SUBCAT CHECK!'; }
elsif ( ($subcat eq ";_pp" || $subcat eq ";:_np:_pp") && $pp eq '' ) {
$parse[$j] = $parse[$j] . ' PARSE FAILS SUBCAT CHECK!'; }
elsif ( $subcat eq ";_np_pp" && ( $object eq '' || $pp eq '' ) ) {
$parse[$j] = $parse[$j] . ' PARSE FAILS SUBCAT CHECK!'; }
} # end sub subcatchk
sub print {
print "\nstring[$i] is: $string[$i]\n";
my $j = 0;
for ( $j = 0; $j <= 1; $j = $j + 1 ) {
print " parse[$j] = $parse[$j]\n"; }
@parse = '';
} #end sub print
return 1;
Putting the Parser on the Internet
Since we now have a relatively powerful parsing program, it is time to consider how to display the program on an internet page. The pronoun survey program from Figure 6.5 provides a good starting point for our web-based parser. This program contains procedures for processing data sent to the server from the web page as well as a procedure for using Perl to write to a web page. One important feature of the pronoun survey program is the section that produces the web page. It is the final section of the program and prints a web page document. The usual Perl print commands will cause an error in a program that prints to a web page. The main change we have to make to our parsing program will be to cause it to print to a web page rather than to the monitor.
I will start with an example of a web page that contains an input window for the sentences to be parsed as well as an output window for the parse. This html document is shown in Figure 8.13.
Figure 8.13 An HTML template for a web parser
<HTML>
<head>
</head>
<BODY>
<FORM method="post"
action="http://web.ku.edu/~pyersqr/cgi-bin/subcat5.pl">
<p><center><h1>A Web Parser</h1></center></p>
<p><p><textarea name="output" rows=10 cols=60>
To see the parser in action, type your sentence in the box
below, and click parse.
</textarea></p>
<p>The parser recognizes the following words:</p>
I, you, it, block, egg, table, bowl, floor, red, blue, found, has, am,
are, get, got, give, gave, move, put, see, saw, in, on, with, a, an,
the</p>
<p><input type=text name="query" size=50 maxlength=100>
<P><INPUT type=submit value="parse">
<INPUT type=reset value=Reset></P></FORM>
</BODY>
</HTML>
This web document contains two windows. The first is defined by the html tag <textarea>. The textarea tag defines a window for displaying the output of the Perl program on the web. The size of the window is defined by the number of rows and columns it contains. The window in the example has space for 10 rows with 60 characters in each row. You can change both of these dimensions as needed.
The window for input is defined by the input tag with the type equal to text. It contains a space for a single row with a width of 50 characters. Input that is longer than 50 characters will cause the text to scroll off the window to the right. I limit the maximum length of the input text to one hundred characters as a safeguard against users who try to send an encyclopedia-sized text. The user can send the text by clicking the submit button or reset the input window by clicking the reset button.
The web page calls the parsing program by the usual post method. Figure 8.14 provides an example of a web version of our parsing program subcat4.pl. I have simply added statements to process the input from the browser and output a standard html form. The program makes use of Perl’s here document style to print the block of statements that define the html document. The statement
print<<EndOfHEAD;
serves to define the beginning of the document, which prints all of the following lines up to the identifier ‘EndOfHEAD’. The identifier must be the only word on its line, and must appear at the beginning of the line. The identifier must end with the new line character. I had to add an extra line at the end of the program to add the new line character to the second identifier ‘EndOfHTML’.
The program outputs the first part of the html document, and then calls the parse subroutine. This sequence allows the parse subroutine to generate parses for each ambiguous string. When the parser has finished, the subcat5 program resumes by printing out the remainder of the html document.
Figure 8.14 A web parsing script
#!/usr/local/bin/perl
# subcat5.shtml
# demonstrate simple cgi script for webparse.htm
# a web version of subcat4.pl
use Stem2;
use Ambi3;
use Parse3;
my @string = ''; my $word = ''; my %pos = ();
%lex = ("I" => "N/1/",
"you" => "N/2/",
"it" => "N/3/",
"block" => "N/0/;V/2/:_np:",
"egg" => "N/0/",
"table" => "N/0/;V/2/_np",
"bowl" => "N/0/",
"floor" => "N/0/",
"red" => "ADJ//",
"blue" => "ADJ//",
"found" => "V/0/_np:_pp:",
"has" => "V/3/_np",
"is" => "AUX/3/_np;V/3/_pp",
"am" => "AUX/1/_np;V/1/_pp",
"are" => "AUX/2/_np;V/2/_pp",
"get" => "V/2/_np:_pp:",
"got" => "V/0/_np:_pp:",
"give" => "V/2/_np:_pp:;V//_np_np",
"gave" => "V/0/_np:_pp:;V/0/_np_np",
"move" => "V/2/_np",
"put" => "V/2/_np_pp",
"see" => "V/2/:_np:",
"saw" => "V/0/:_np:",
"on" => "P//",
"in" => "P//",
"with" => "P//",
"a" => "DET/3/",
"an" => "DET/3/",
"the" => "DET/0/");
foreach $word (keys %lex) {
$lex{$word} =~ /(.*)\/(.*)\/(.*)/ ;
$pos{$word} = $1;
$subcat{$word} = $3;
} # end foreach word
$DataLen = $ENV{'CONTENT_LENGTH'};
read (STDIN, $QueryString, $DataLen);
$QueryString=~s/%([\dA-Fa-f][\dA-Fa-f])/pack("C", hex($1))/eg;
@input=split(/&/,$QueryString);
$input[1]=~s/query\=//;
@sentence=split(/\+/,$input[1]);
@words=@sentence;
@words = stem(\@words, \%lex, \%pos, \%subcat); # morph analyzes inflectional morphology
my @string = ambilex(\@words); # ambilex constructs a string for each ambiguous word
print "Content-type:text/html\n\n";
print<<EndOfHEAD;
<HTML>
<head>
</head><BODY>
<FORM method="post"
action="http://web.ku.edu/~pyersqr/cgi-bin/subcat5.shtml">
<p><center><h1>A Web Parser</h1></center></p>
<p><p><textarea name="output" rows=10 cols=60>
Your sentence: @sentence
EndOfHEAD
parse(\@string); # parse produces a parse for each string of terminal elements
print<<EndOfHTML;
</textarea></p>
<p>The parser recognizes the following words:</p>
I, you, it, block, egg, table, bowl, floor, red, blue, found, has, am,
are, get, got, give, gave, move, put, see, saw, in, on, with, a, an,
the</p>
<p><input type=text name="query" size=50 maxlength=100>
<P><INPUT type=submit value="parse">
<INPUT type=reset value=Reset></P></FORM>
</BODY></HTML>
EndOfHTML