Text Analysis in Perl


            Now that you have some idea of how to write simple Perl programs it is time to focus on using Perl to process natural language. In this section we will look at several basic level Perl programs for text analysis. Word processing programs offer simple search and replace tools that allow a user to make crude changes to a text. The addition of macros enhances the ability of word processors to perform text analysis, but only in a very restricted fashion. Dedicated text analysis programs such as The Oxford Concordance Program (Hockey & Marriott 1980) provide a more powerful set of text analysis tools that compute word indexes, word frequencies and lexical concordances among other measures. While these programs can be extremely useful, they have limitations that preclude their use for all forms of text analysis. These programs, for example, cannot be used in a simple manner to analyze the discourse of a single speaker in a transcript with several speakers all of whom use various languages. Perl provides a programmer with the ability to construct their own text analysis tools for such situations.

            In Figure 5.1 I present an example of a Perl program that counts the number of times that a word appears in a text. The program contains five blocks. The first block identifies the name of the program and describes its function. The second block prompts the user to enter a word and stores the user’s response in the variable $word. The third block accesses the text file and extracts a line from the file. The fourth block tests to see whether the target word appears in the line of text, while the last block displays the frequency count for the word.



Figure 5.1 A Perl program that determines the frequency of a specified word


#!/usr/local/bin/perl

# freq.pl

# This program counts lexical frequencies in a pre-defined text


# Block 2. Request a target word for the search

print "What is the word you wish to count? ";

$word = <>; #read the word from the keyboard input

chop $word;

$word = lc($word);

$copy = ' ' . $word . ' ';


# Block 3. Open the text file and get a line from it

open text_in, "< text.txt";

while ($line = <text_in>) {

   chop($line);

   $line = lc($line);

   $line = ' ' . $line . ' ';

   $line =~ s/['.!?]/ /g;


#Block 4. Test to see if the line contains the target word

foreach ($line =~ m/$copy/g ) {

   $frequencyCount = $frequencyCount + 1;

} #end foreach


} #end while

close text_in;


# Block 5. Display the results

print "The word \"$word\" occurred $frequencyCount times in the text.\n";



            The program uses some tricks that we have not needed before. The second block ends with the statement


$copy = ' ' . $word . ' ';


This line adds spaces before and after the target word. The periods tell the computer to concatenate the spaces with the word. This statement converts a word like ‘fish’ to ‘ fish ’ and prevents the program from counting the word ‘lungfish’ as an instance of the word ‘fish’. The third block opens the file text.txt for input and links the file to the variable $text_in. I used the text in Figure 5.2 as the sample text for testing the program. Any text will do as long as you save it as ‘text.txt’. The program then makes use of a while loop to read the sample text one line at a time. The final line in block three


   $line =~ s/['.!?]/ /g;


substitutes a space for every apostrophe or punctuation point in $line. This statement provides a good example of Perl’s pattern matching capability. It is necessary to enclose the characters in square brackets in order for Perl to search for either an apostrophe, an exclamation mark a question mark or a period. Perl uses square brackets to define a character class. If you omit the brackets, Perl would search for the string “‘.!?” and probably not find it in any text. The g at the end of this statement tells Perl to repeat the search in order to change every punctuation mark in the line to a space. Once this operation is complete, every word in the sample text will be surrounded by a space like the target word.

            The fourth blocks uses pattern matching to see if the line contains the target word. The foreach loop searches for all instances of the target word in $line. It uses the m to show that it is searching for a match and the g modifier to find all the occurrences of the target word in each line. When the program detects a match, it increments the $frequencyCount variable. Once the test file is finished, the program closes the file and prints the result. If everything works, the program should ask for a target word. Try counting the number of times that the pronoun ‘he’ appears in the text.


Figure 5.2 Sample text (Virginia Woolf, “Mrs. Dalloway”)


A magnificent figure he cut too, pausing for a moment (as the sound of the half-hour died away) to look critically, magisterially, at socks and shoes; impeccable, substantial, as if he beheld the world from a certain eminence, and dressed to match; but realized the obligations which size, wealth, health entail, and observed punctiliously, even when not absolutely necessary, little courtesies, old-fashioned ceremonies, which gave a quality to his manner, something to imitate, something to remember him by, for he would never lunch, for example, with Lady Bruton, whom he had known these twenty years, without bringing her in his outstretched hand a bunch of carnations, and asking Miss Brush, Lady Bruton's secretary, after her brother in South Africa, which, for some reason, Miss Brush, deficient though she was in every attribute of female charm, so much resented that she said 'Thank you, he's doing very well in South Africa,' when, for half-a-dozen years, he had been doing badly in Portsmouth.



Lexical Frequencies


            Our frequency program only calculates the frequency of a single word at a time. It would be more useful if we modified the program so that it displayed the frequencies of all the words in a text. The program in Figure 5.3 shows how to modify the initial program so that it includes all the words in the frequency count. I also modified the text input statement in the new program so that it prompts the user to type in the filename for the text.


Figure 5.3. Lexical Frequency Program


#!/usr/local/bin/perl

#

# freq2.pl

# Modified version of freq.pl

# This version determines the frequency of all the words in a text


# Clear word hash

%words = ();


print "What is the text you wish to analyze? \n";

print "Please include the filename and extension \n\n";

 

$textfile = <>;                                   # Read the filename from the keyboard input

chop $textfile;                                                # Remove line return

print "Now analyzing lexical frequency in $textfile\n\n";


open text_in, "< $textfile";

while ($line = <text_in>) {

   chomp($line);                                                           # Remove line returns

   $line = lc($line);                                                       # Change to lower case

   $line = ' ' . $line . ' ';                                                              # Add spaces

   $line =~ s/["(),;:.!?]/ /g;                                            # Remove punctuation

   @words = split (" ", $line );                                                 # Put words into an array

      foreach $word (@words) {

         $words{$word} = $words{$word} + 1;             # Put word and word frequency in hash

      } #end foreach word

} #end while


close text_in;


# Display the results

 

foreach $word (sort keys %words) {                          # Sort the word hash

   print "$word $words{$word}\n";                         # Print word and frequency

   }



            Comparing the two lexical frequency programs shows that only a few modifications of the original program are necessary to produce a much more powerful analysis program. I made use of one of Perl’s most useful lexical tools, the hash. I created the hash %words in which each word was referenced by the variable $word and its frequency referenced by the variable $words{$word}. I used the temporary array @words to store the words in each line of the text until they could be entered in the hash.



Phonological Analysis


            The first step in any phonological analysis is to organize the words in a language sample by their sounds (cf. Ingram 1981). We can apply the same Perl procedures we used in counting lexical frequency to the process of organizing lexical data for phonological analysis. I am most familiar with analyzing the phonology of children’s words. Children learning to speak a language will often produce words differently from the way their parents say the words. I provide an example of a child’s lexicon in Figure 5.4. The child’s words are transcribed in a form of the phonetic alphabet that is easy to write using a computer keyboard (see Pye 1987). The sample includes the adult orthographic form for comparison.


Figure 5.4 A Child’s Phonetic Lexicon


         Adult Form       Child's Form


            a                      6

            airplane            eytHeyn

            and                  eynd

            another            n6nE

            banana            nAnA

            basket             bAsAtH

            bike                 BeykH

            boat                 bowt

            book                bEkH

            call                  kHow

            car                   kHuA

            corn                kown6

            danceing         dAnsin

            it                     In

            laces                maysEn

            let's                 E7

            little                Iy6

            my                  may

            no                    nu


            We would like to write a Perl program that groups together the child’s words by their sound. For example, the program should group together the child’s words for basket, boat and book since the child produced all of these words with an initial sound [b]. I provide an example program for phonological analysis in Figure 5.5. It analyzes the words in Figure 5.4 saved in the file ‘Daniel.txt’.


Figure 5.5. A Perl Phonological Analysis Program


#!/usr/local/bin/perl

# phon.pl

# This program produces a phonetic inventory

# It analyzes data from the file Daniel.txt


@start = ('p','t','k','b','d','g','m','n','N');

@sounds = reverse( @start );


# Read in the data file


open text_in, "< Daniel.txt";


$line = <text_in>; #Read the first four information lines

$line = <text_in>;

$line = <text_in>;

$line = <text_in>;


while ($line = <text_in>) {

   chop($line);

   $line =~ m/(\w+)\s+(\w+)/ ;                      #get the words

   @words = ($1, $2);

   $words[1]=~ m/(\w)/;                       #get the initial sound

   $phon{$1} = $phon{$1} . $words[1] . ' ';   #add word to sound hash

}


# Display the results


while ( @sounds ) {

   $first = pop( @sounds );

   $second = pop( @sounds );

   $third = pop( @sounds );


   @words1 = split( / /, $phon{$first} );

   @sorted1 = sort @words1;


   @words2 = split( / /, $phon{$second} );

   @sorted2 = sort @words2;


   @words3 = split( / /, $phon{$third} );

   @sorted3 = sort @words3;


   print "\n$first: \t\t $second: \t\t $third: \n";


   if ($#sorted1 < $#sorted2) {

      $big = $#sorted2

   }


   elsif ($#sorted1 < $#sorted3) {

      $big = $#sorted3

   }


   else {

      $big = $#sorted1

   }


   for ($i = 0; $i <= $big; $i = $i + 1 ) {

      print " $sorted1[$i]\t\t $sorted2[$i]\t\t $sorted3[$i]\n";

   }


} #end while



            The program begins by creating the array @start for the sounds that we wish to analyze. I have only included nine sounds in this array, but it can be expanded to other sounds. I immediately apply the reverse function to this array in order to arrange its contents for access via the pop function. The program then opens the text file ‘Daniel.txt’ for input and reads each line. The line

 

   $line =~ m/(\w+)\s+(\w+)/ ;                      #get the words

 

uses the special character w+ to match the words in the line. It is enclosed in parentheses so that Perl will store the matching results in its implicit variables. The special character s+ tells Perl that the words are separated by a number of spaces. Once the program extracts the words, it stores them in the array @words via the command


   @words = ($1, $2);


The program extracts the first sound from each of the child’s words in the command

 

   $words[1]=~ m/(\w)/;                       #get the initial sound


Notice that this line uses the special character w rather than w+. The program then stores the child’s word in the hash %phon using the initial sound as the hash key. I used the concatenation function in this line to separate each word by a space. The line is

 

   $phon{$1} = $phon{$1} . $words[1] . ' ';   #add word to sound hash


Once the child’s word and initial sound have been stored in the hash, the program is ready to read in the next line of the language sample. The program displays the results by popping sounds from the @sound array and using them as keys to access the words stored in the %phon hash. The program uses the split function to create the @words arrays by splitting each hash entry at the spaces


   @words1 = split( / /, $phon{$first} );


Perl uses the characters between the slashes in the split function to separate the string stored in the hash entry. Converting each hash entry to an array in this fashion makes it possible to apply the sort function to the words stored in each hash entry. The program uses the special character t to tab the results in the print out. The program produces the output shown in Figure 5.6.


Figure 5.6. Output from the program Phon.pl

 

p:                    t:                     k:

                                                 kHow

                                                 kHuA

                                                 kown6

 

b:                    d:                    g:

 bAsAtH           dAnsin

 bEkH

 bowt

 

m:                   n:                    N:

 may                  n6nE

 maysEn            nAnA

                         nu



Sentence Complexity


            Researchers studying language acquisition often use the average length of children’s sentences in words as a measure for comparing children’s syntactic development. We can use the same procedures we made use of in the preceding programs to write a Perl program that calculates the average sentence length from a sample of a child’s speech. I use a small portion of a language sample collected by Roger Brown (1973) from the child Adam for this program. The language sample is shown in Figure 5.7.


Figure 5.7. A Language Sample from Brown (1973)


@UTF8

@Begin

@Languages:  en

@Participants:            CHI Adam Target_Child, MOT Mother, URS Ursula_Bellugi

            Investigator, RIC Richard_Cromer Investigator

@ID:   en|brown|CHI|2;5.12|male|normal|middle_class|Target_Child||

@ID:   en|brown|MOT|||||Mother||

@ID:   en|brown|URS|||||Investigator||

@ID:   en|brown|RIC|||||Investigator||

@Time Duration:       10:30-11:30

*CHI:  come Cromer ?

%mor: v|come n:prop|Cromer ?

*CHI:  where dat come from ?

%mor: adv:wh|where det|that v|come prep|from ?

*MOT:            what happened ?

%mor: pro:wh|what v|happen-PAST ?

*CHI:  train track .

%mor: n|train n|track .

*MOT:            oh # is that a train whistle ?

%mor: co|oh v|be&3S pro:dem|that det|a n|train n|whistle ?

*MOT:            what's its name ?

%mor: pro:wh|what~v|be&3S pro:poss:det|its n|name ?

*CHI:  horn .

%mor: n|horn .

%spa:  $RES

*MOT:            oh # it's a horn .

%mor: co|oh pro|it~v|be&3S det|a n|horn .

*CHI:  what dat from ?

%mor: pro:wh|what pro:dem|that prep|from ?

*CHI:  hole in (th)ere ?

%mor: n|hole prep|in adv:loc|there ?

*MOT:            hole ["] .

%mor: n|quote .

*MOT:            that's what makes it whistle .

%mor: pro:dem|that~v|be&3S pro:wh|what v|make-3S pro|it v|whistle .

*CHI:  pipe .

%mor: n|pipe .

*MOT:            is that a pipe ?

%mor: v|be&3S pro:dem|that det|a n|pipe ?

*MOT:            there's a shell on the end .

%mor: pro:exist|there~v|be&3S det|a n|shell prep|on det|the n|end .

*CHI:  what shell doing ?

%mor: pro:wh|what n|shell part|do-PROG ?

%sit:   Richard Cromer is checking his watch

*CHI:  checking clock .

%mor: part|check-PROG n|clock .

%sit:   Richard Cromer is checking his watch

*MOT:            no .

%mor: co|no .

*MOT:            he's checking his watch .

%mor: pro|he~v|be&3S part|check-PROG pro:poss:det|his n|watch .

*CHI:  make [?] [!!] noise .

%mor: v|make n|noise .

*MOT:            <make a noise> ["] ?

%mor: n|quote ?

*CHI:  what dat ?

%mor: pro:wh|what pro:dem|that ?

*MOT:            go and ask Ursula about that .

%mor: v|go conj:coo|and v|ask n:prop|Ursula prep|about pro:dem|that .

*CHI:  what dat for ?

%mor: pro:wh|what pro:dem|that prep|for ?

*URS: that's the hole you blow into .

%mor: pro:dem|that~v|be&3S det|the n|hole pro|you v|blow prep|into .

*CHI:  happen ?

%mor: v|happen ?

*CHI:  happen paper ?

%mor: v|happen n|paper ?

*CHI:  happen ?

%mor: v|happen ?

*MOT:            what happened to the paper ?

%mor: pro:wh|what v|happen-PAST prep|to det|the n|paper ?

*CHI:  kangaroo .

%mor: n|kangaroo .

*MOT:            kangaroo ["] ?

%mor: n|quote ?

*CHI:  kangaroo .

%mor: n|kangaroo .

%spa:  $IMIT

*CHI:  what happen .

%mor: pro:wh|what v|happen .

*URS: what happened # Adam ?

%mor: pro:wh|what v|happen-PAST n:prop|Adam ?

*MOT:            who had a whistle like that ?

%mor: pro:wh|who v:aux|have&PAST det|a v|whistle prep|like pro:dem|that ?

*CHI:  xxx .

%mor: unk|xxx .

%spa:  $RES

*CHI:  come from ?

%mor: v|come prep|from ?

*CHI:  taper@c ?

%mor: chi|taper ?

*MOT:            no # it's tape .

%mor: co|no pro|it~v|be&3S n|tape .

*MOT:            how many are there ?

%mor: adv:wh|how qn|many v|be&PRES adv:loc|there ?

*CHI:  come from # taper@c ?

%mor: v|come prep|from chi|taper ?

*CHI:  put dat in (th)ere ?

%mor: v|put&ZERO pro:dem|that prep|in adv:loc|there ?

*CHI:  come from ?

%mor: v|come prep|from ?

*CHI:  come from # for ?

%mor: v|come prep|from prep|for ?

*CHI:  come from ?

%mor: v|come prep|from ?

*CHI:  come from for ?

%mor: v|come prep|from prep|for ?

*CHI:  green box .

%mor: adj|green n|box .

*CHI:  green .

%mor: n|green .

*CHI:  dat box .

%mor: pro:dem|that n|box .

*CHI:  get open .

%mor: v|get v|open .

*CHI:  open # Mommy .

%mor: v|open n:prop|Mommy .

*MOT:            oh no # she doesn't need that now .

%mor: co|oh co|no pro|she v:aux|do&3s~neg|not v|need det|that adv|now .

*CHI:  oh # cup .

%mor: co|oh n|cup .

*URS: this is a cork .

%mor: pro:dem|this v|be&3S det|a n|cork .

*URS: you say # cork ["] .

%mor: pro|you v|say n|quote .

*CHI:  cork .

%mor: n|cork .

%spa:  $RES $IMIT

*URS: put the cork on the cup .

%mor: v|put&ZERO det|the n|cork prep|on det|the n|cup .

*CHI:  0 .

%act:  puts cork on cup

*URS: Adam # can you put the cork in the cup ?

%mor: n:prop|Adam v:aux|can pro|you v|put&ZERO det|the n|cork prep|in det|the n|cup ?

*CHI:  0 .

%act:  puts cork in cup

*URS: again .

%mor: adv|again .

%sit:   Adam doesn't put the cork on the cup

*URS: put the cork on the cup .

%mor: v|put&ZERO det|the n|cork prep|on det|the n|cup .

%sit:   Adam doesn't put the cork on the cup

*URS: can you put it on top of the cup ?

%mor: v:aux|can pro|you v|put&ZERO pro|it prep|on n|top prep|of det|the n|cup ?

%sit:   Adam doesn't put the cork on the cup

*URS: put the cork on the cup .

%mor: v|put&ZERO det|the n|cork prep|on det|the n|cup .

*URS: ok .

%mor: co|ok .

*CHI:  I caught it .

%mor: pro|I v|catch&PAST pro|it .

*CHI:  upsadaisy caught it .

%mor: co|upsadaisy v|catch&PAST pro|it .

*URS: Adam # put the penny in the cup .

%mor: n:prop|Adam v|put&ZERO det|the n|penny prep|in det|the n|cup .

*CHI:  0 .

%com:            puts penny in cup

*URS: put the penny on the cup .

%mor: v|put&ZERO det|the n|penny prep|on det|the n|cup .

*CHI:  jack+o+lantern .

%mor: n|+n|jack+conj|o+n|lantern .

%sit:   doesn't put cork on the cup

*MOT:            jack+o+lantern # yes .

%mor: n|+n|jack+conj|o+n|lantern co|yes .

*MOT:            where are the pennies ?

%mor: adv:wh|where v|be&PRES det|the n|penny-PL ?

*CHI:  don('t) know .

%mor: v:aux|do~neg|not v|know .

%spa:  $RES

*MOT:            you don't know ?

%mor: pro|you v:aux|do~neg|not v|know ?

*CHI:  where penny go ?

%mor: adv:wh|where n|penny v|go ?

*MOT:            I don't know .

%mor: pro|I v:aux|do~neg|not v|know .

*MOT:            I hear them but I don't see them .

%mor: pro|I v|hear pro|them prep|but pro|I v:aux|do~neg|not v|see pro|them .

*CHI:  see [?] penny .

%mor: v|see n|penny .

*MOT:            you'd forgotten you had the cap on top .

%mor: pro|you~v:aux|will&COND v|forget&PERF pro|you v|have&PAST det|the n|cap prep|on n|top .

*CHI:  top .

%mor: n|top .

*CHI:  (po)tato (po)tato (po)tato (po)tato .

%mor: n|potato n|potato n|potato n|potato .

*MOT:            potato ["] ?

%mor: n|quote ?

*CHI:  (po)tato .

%mor: n|potato .

%spa:  $RES $IMIT

*MOT:            potato ["] ?

%mor: n|quote ?

*CHI:  potato # yeah .

%mor: n|potato co|yeah .

%spa:  $RES

*CHI:  where penny go ?

%mor: adv:wh|where n|penny v|go ?

*CHI:  turkey pine .

%mor: n|turkey n|pine .

*MOT:            <turkey pie> ["] ?

%mor: n|quote ?

*CHI:  turkey pine .

%mor: n|turkey n|pine .

%spa:  $IMIT

*MOT:            what about turkey pie ?

%mor: pro:wh|what prep|about n|turkey n|pie ?

*CHI:  turkey pine .

%mor: n|turkey n|pine .

%spa:  $IMIT

*CHI:  hot .

%mor: adj|hot .

%spa:  $IMIT

*CHI:  xxx roadgrader .

%mor: unk|xxx n|roadgrader .

%spa:  $IMIT

*MOT:            it was hot and he put it in the refrigerator # the cup .

%mor: pro|it v|be&PAST&13S adj|hot conj:coo|and pro|he v|put&ZERO pro|it prep|in det|the n|refrigerator det|the n|cup .

*CHI:  somebody .

%mor: pro:indef|somebody .

*CHI:  what dat some ?

%mor: pro:wh|what pro:dem|that pro:indef|some ?

*URS: <what dat some> ["] ?

%mor: n|quote ?

*URS: those are sardines .

%mor: pro:dem|those v|be&PRES n|sardine-PL .

*CHI:  sardine .

%mor: n|sardine .

%err:   <1> sardin = saUrdin $PHO ;

%spa:  $IMIT

*CHI:  op(en) dat .

%mor: v|open pro:dem|that .

*CHI:  open # Mommy ?

%mor: v|open n:prop|Mommy ?

*CHI:  sarbaby@wp [?] ?

%mor: wplay|sarbaby ?

*MOT:            sardines .

%mor: n|sardine-PL .

*CHI:  sardines .

%mor: n|sardine-PL .

%err:   <1> sardinz = saUrdinz $PHO ;

*MOT:            no # not sardines # sardines .

%mor: co|no neg|not n|sardine-PL n|sardine-PL .

%err:   <3> sardinz = saUrdinz $PHO ;

*CHI:  sardine .

%mor: n|sardine .

%spa:  $sc=1 $IMIT

*CHI:  turkey pine .

%mor: n|turkey n|pine .

*MOT:            what about turkey pie ?

%mor: pro:wh|what prep|about n|turkey n|pie ?

*MOT:            do you like turkey pie ?

%mor: v|do pro|you v|like n|turkey n|pie ?

*CHI:  yeah .

%mor: co|yeah .

%spa:  $RES

*CHI:  turkey [?] pine [?] turkey [?] pine [?] .

%mor: n|turkey n|pine n|turkey n|pine .

*MOT:            <turkey pie> ["] ?

%mor: n|quote ?

*CHI:  turkey [?] .

%mor: n|turkey .

%err:   <1> t3rki = t3rkIN $PHO ;

*CHI:  turkey pie .

%mor: n|turkey n|pie .

*CHI:  turkey pine turkey pine .

%mor: n|turkey n|pine n|turkey n|pine .

*MOT:            <turkey pie> ["] ?

%mor: n|quote ?

*CHI:  yeah .

%mor: co|yeah .

%spa:  $RES

*CHI:  pine [?] turkey [?] .

%mor: n|pine n|turkey .

*CHI:  where penny go ?

%mor: adv:wh|where n|penny v|go ?

*CHI:  there find .

%mor: adv:loc|there n|find .

*MOT:            did you find the penny ?

%mor: v|do&PAST pro|you v|find det|the n|penny ?

*CHI:  find penny .

%mor: v|find n|penny .


            This language sample is more complicated than any we have dealt with to this point since it contains utterances from several speakers. Since we are only interested in analyzing the child’s utterances, we will require a command that searches each line of the input text for the character string ‘*CHI:  ’. This string includes a tab character after the colon which is indistinguishable from an ordinary space on the computer screen. We can use the command


   if ( $line =~ s/\*CHI:            // ) {     #get the child's lines


to check if the line begins with this sequence. This command uses the substitution operator s/// to look for the string ‘*CHI:  ’ and replace it with the null string. In this way we can eliminate the child’s code at the same time we identify the child’s utterances.

            It would be convenient to use the split function to store the child’s words in an array, but before we do this, we need to eliminate any lines that have unintelligible words (identified by the sequence ‘xxx’) or questionable words (identified by the code ‘[?]’). This can be accomplished by using a Perl command to include lines that do NOT match these codes, e.g.


      if ( $line !~ m/xxx/ ) {                    #don't count xxx lines


This command uses the !~ operator to search for lines that do not contain the sequence ‘xxx’. The gap code ‘#’ should also be eliminated from the child’s lines before we count the number of words.

            Each word in the file is followed by a space. This makes it easy to store the words in an array using the split function as shown in the following command.

 

       @words = split( / /, $line );          #put words in array


The number of words in the sentence is simply the number of elements in the array @words or $#words. The rest of the procedure is straightforward. I provide a program for calculating average sentence lengths or MLUw in Figure 5.8.



Figure 5.8 A Perl Program for Calculating MLUw


#!/usr/local/bin/perl

# mluw.pl

# This program calculates the average sentence length in words

# It analyzes data from the file Adam05.txt


# Read in the data file and search for child's lines (*CHI:)


open text_in, "< Adam05.txt";


while ($line = <text_in>) {

   chop($line);


   if ( $line =~ s/\*CHI:            // ) {     #get the child's lines

      $n = $n + 1;                                    #count child's lines


      if ( $line !~ m/xxx/ ) {                    #don't count xxx lines


         if ( $line !~ m/\[?\]/ ) {     #don't count [?] lines


            $line =~ s/#// ;                           #don't count #

            @words = split( / /, $line ); #put words in array


            $sent[$#words] = $sent[$#words] . $n . ' '; #sentence nos

            $count[$#words] = $count[$#words] + 1; #no of sentences

            $num[$#words] = $#words;                #no of words


         } #end [?] if


      } #end xxx if


   } #end child's if


} #end while


# Print the results


@out = reverse( @num );                   #start at the beginning


print "\nMLUw count\n";

print "\nNumber of \t Number of \tSentence";

print "\nWords \t\t Sentences \tNumbers\n";


$i = pop( @out );                    #eliminate zero element


while ( @out ) {


   $i = pop( @out );

   $wordtotal = $wordtotal + $i * $count[$i];

   $senttotal = $senttotal + $count[$i];


   @nums = split( / /, $sent[$i] );


   print "\n$i \t\t $count[$i] \t\t";


   for ($j = 0; $j < 13; $j = $j + 1) {

      print "$nums[$j] ";

   }


   print "\n\t\t\t\t";


   for ($j = $j; $j <= $#nums; $j = $j + 1) {

      print "$nums[$j] ";

   }


   print "\n";


} #end while


$MLUw = $wordtotal / $senttotal;

print "MLUw = $MLUw\n";



Lexical Concordances


            A lexical concordance is the most useful text analysis procedure one can have. A concordance provides the user with information about the context in which each word appears. A concordance makes it easy to find words that the speaker has used in unusual ways, such as using a common noun as a verb. As a final example of Perl’s lexical utility I provide a program in Figure 5.9 that produces a concordance for the text that a user specifies in the keyboard input. The program’s output displays each word and the line (with its line number) in which the word appears.



Figure 5.9. A Perl Concordance Program


#!/usr/local/bin/perl

#

# concord.pl

# This program produces a lexical concordance

# for the text named by the user


# Clear word hash

%words = ();


print "What is the text you wish to analyze? \n";

print "Please include the filename and extension \n\n";

 

$textfile = <>;                                   # Read the filename from the keyboard input

chop $textfile;                                                # Remove line return

print "Now analyzing $textfile\n\n";


open text_in, "< $textfile";

while ($line = <text_in>) {

   $line_no = $line_no + 1;                  # Count the line number

   chomp($line);                                   # Remove line returns

   $copy = $line;                                   # Copy the original line

   $line = lc($line);                               # Change to lower case

   $line = ' ' . $line . ' ';                                      # Add spaces

   $line =~ s/["(),;:.!?]/ /g;                    # Remove punctuation

   @words = split (" ", $line );                         # Put words into an array

      foreach $word (@words) {

         $words{$word} = $words{$word} . $line_no . ' ' . $copy . '#';              # Put lines in hash

      } #end foreach word

} #end while


close text_in;


# Display the results

 

foreach $word (sort keys %words) {              # Sort the word hash

   @lines = split ("#", $words{$word} );         # Put lines into an array

   print "$word\n";                                            # Print word

      foreach $line (@lines) {

         print " $line\n";                                      # Print line

      } #end foreach line

} #end foreach word



            These programs introduced some common programming techniques for text analysis. Text analyses often have to decide between case sensitive and case insensitive procedures. The programs illustrate one method for implementing case insensitive analyses by converting all upper case characters to lower case before performing the analysis. Another common problem in text analysis is figuring out a way to find the target word in all its contexts. Words are usually preceded and followed by spaces, but not always. The sample programs demonstrate one method for handling these exceptions by converting all punctuation marks to spaces and then extracting a group of letters surrounded by spaces. This method prevents the program from counting the word ‘the’ as an instance of the pronoun ‘he’ among other things.

            It is easy to see how the same basic tricks can be used in new ways to enhance the program’s output. The programs in this chapter demonstrate how little Perl is necessary to generate sophisticated text analyses. They demonstrate ways in which Perl programs can obtain input from users and display responses on the terminal or in an output file. The programs show how to store information in variables and how to open and close files. One feature of the program is the use of pattern matching to replace punctuation marks with spaces and to search for all instances of the target word in a line of text. There is more to learn about each of these Perl features, but we can accomplish quite a lot with the Perl statements demonstrated in these programs.



References


Brown, R. 1973. A First Language: The Early Stages. Cambridge, MA: Harvard University Press.

Ingram, D. 1981. Procedures for phonological analysis of children's language. Baltimore, MD: University Park Press.

Pye, C. 1987. PAL: The Pye Analysis of Language. Lawrence, KS.



On Line Text Resources


A considerable number of texts are now available in electronic form on line. The following list (courtesy the Center for Research in Language) provides links to some of these databases.


CHILDES Child Language Description Exchange. Child language productions from a variety of researchers.
http://childes.psy.cmu.edu/
WordNet 1.6 Lexical database (See http://www.cogsci.princeton.edu/~wn/)
Penn Treebank Penn's Linguistic Data Consortium (LDC) collection, including Brown (Kucera-Francis); Wall Street Journal, and other sources; some text is parsed and can be searched with the tgrep program. (See http://www.ldc.upenn.edu/)
North American News Text Corpus Large (~350 million word) corpus of newswire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T21)
CD-ROM
Spanish Language News Corpus Large (~172 million word) corpus of newsire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T9)
CD-ROM
European Languages News Corpus ~100 million words of French, 90 million words of German, and 15 million words of Portuguese; newswire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T11)
CD-ROM
Hansard Parallel Text in English and French Parallel English/French texts drawn from Canadian Parliament discussions. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T20)
CD-ROM
CELEX Lexical databases (word lemmas, phonology, morphology, frequency) for Dutch, German, and English. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14)
British National Corpus (100 million word searchable corpus; Windows software for more extensive searching is also available)
http://sara.natcorp.ox.ac.uk/lookup.html
MRC Psycholinguistic Database Interface (Kucera-Francis; number of letters/phonemes/syllables; ratings of word familiarity, concreteness, imagability, meaningfulness; age of acquisition; etc.)
http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm
Oxford Text Archive (a collection of several thousand electronic texts and linguistic corpora)
http://ota.ahds.ac.uk/
Association for Computational Linguistics http://www.aclweb.org/
Institute for Scientific Information The Web of Science Citation Databases
http://isi1.isiknowledge.com/portal.cgi
PubMed http://www.ncbi.nlm.nih.gov/PubMed/