Perl Basics
Last week I demonstrated a technique for collecting information from web users. The only ingredient we are missing is a computer program that provides instructions for recording the data we collect. This is where a little knowledge of the computer programming language Perl comes in handy.
I think of a computer program as a recipe for creating information out of chaos. The words in a text or a set of user responses provide the raw ingredients for our information confection. The computer program provides the computer with the instructions it needs to create the final masterpiece out of the raw ingredients. We must be careful to provide the computer with a complete set of instructions, otherwise it is libel to produce more chaos than information. On the other hand, none of the information ingredients we use are dangerous. We can experiment by putting our program together in different ways without worrying about it blowing up in our face. There are no moving parts in the computer that we can break so we are free to try different instructions until we find a combination that produces a result we can live with.
Unfortunately, computers have a very low tolerance for mistakes—none actually. The smallest misstep in our instructions provides the computer with an excuse to respond with such alarmist decrees as ‘Warning, syntax error.’ A misplaced comma can create this kind of havoc, and missing quotation marks are anathema. The hardest part of computer programming in the beginning is to make sure that you have typed in the program statements EXACTLY the way they appear in the examples. One of the essential skills for programming is the ability to spot where such deviations occur and fix them. This step is known as debugging, and skillful debuggers are worth their weight in memory chips. I will suggest a few techniques for finding programming errors that will, hopefully, save you a few sleepless nights.
Perl stands for Practical extraction and report language. It was created by Larry Wall in 1987 to fill the gap between basic UNIX instructions and high level programming languages such as BASIC. It quickly became one of the more widely-used programming languages for retrieving and displaying information from databases on the web. Programs that handle the interface between the client’s computer and a web server are referred to collectively as the Common Gateway Interface (CGI). Perl provides one tool, albeit a powerful one, for CGI programming.
One of the features that I like most about Perl is that it provides a relatively simple set of procedures for text processing. These tools make it easy, with a modicum of programming, to input a text to the Perl program, extract its critical information and output the results. The combination of Perl’s text handling abilities and its CGI features make it a good language for exploring natural language processing techniques on the web.
I need to add one word of caution at this point. There are probably as many different approaches to programming in Perl as there are Perl programmers. I approach Perl programming in this class from the perspective of producing programs that are optimally transparent to a beginning programmer. This approach uses more complicated statements and characters than the minimum that are strictly necessary, but the advantages of transparency far outweigh elite programming beliefs in an introductory course. This said, students without any previous programming experience are still apt to find some Perl statements to be obscure, if not downright offputting. I will do my best to explain Perl’s quirks and eccentricities.
We can begin learning Perl by writing a simple program. Figure 3.1 shows what a simple Perl program looks like. The numbers to the left of the program statements are not part of the
program, so you should not type them when entering your program. I will use the numbers to refer to specific lines of the programs.
Figure 3.1. A First Perl Program
1 #!/usr/local/bin/perl
2 #hello.pl
3 #The first Perl program
4 print "Hello world!\n";
The first line in the program is the standard beginning of all Perl programs. It tells the computer’s operating system where to find the Perl interpreter. The Perl interpreter may be in a different location on your computer so it is important to learn in which directory the Perl interpreter resides on your computer. Fortunately, most Unix and Linux systems put the Perl interpreter in a standard location. On a computer using Microsoft Windows, the Perl interpreter would probably be found in the c:\perl\bin directory. The Perl interpreter available from ActiveState (www.activestate.com) includes a directory map for Windows computers so that you can use the first line as shown in Figure 3.1. The best procedure is to try writing the program as shown in Figure 3.1. to see if it will run on your computer. If it does not run, and you are using the Windows operating system, try altering the first line to:
#!c:\perl\bin\perl.exe
The second and third lines in the program illustrate the form of a comment line in Perl. Once we venture beyond simple programs, it becomes more and more important to document what function each section or line of the program serves. Most computer languages use comment lines for this purpose, and require each comment line to begin with a special character. Perl uses the pound sign (#) to distinguish comment lines from the program lines that serve other functions. It is a good idea to add comment lines at the beginning of the program that state the name of the program and what the program does. If the program processes data from other programs or files, you should add the names of these programs or files to the comment lines at the beginning of the program. Then, when we return to the program after a few months or years, we will have all the information we need to run the program.
The fourth line in the program actually performs an operation. In this case, it prints the message “Hello world!” on the computer screen. Print is the common command that program languages use to display information. The print command at one time was used to actually print information on paper, but over time this command has been extended to displaying information on computer terminals and pass content information to a web browser. The sequence of characters at the end of the line requires some explanation. Perl uses the combination of a backslash (\) and n to “print” a new line. The backslash serves as an escape character in Perl. When a Perl compiler finds a backslash it automatically assumes that the following character should not be treated literally. In our example, the backslash tells the Perl compiler to treat the following ‘n’ as a signal to add a new line rather than printing the letter ‘n’. We also need to add a semicolon (;) at the end of the line. All Perl command lines must have a semicolon at the end to signal that the interpreter has reached the end of the command.
This Perl command provides an introduction to the arbitrary world of the computer programming. Since computers come with limited character sets, programmers must resort to such tricks as the use of escape characters. The Perl guru Larry Wall could have used any character to signal the difference between literal and nonliteral characters, but chose the backslash. Likewise, web browsers ignore blank lines everywhere else on web documents except for the line specifying the content type. Just as one becomes used to driving on one side of the road, it is possible to become comfortable with the conventions that creators of computer languages have established. After you read and write several Perl programs you will find it easy to interpret commands that contain a whole series of escape characters. And do not forget the semicolon!
One other peculiarity in this program is the semicolon at the end of line 4. The Perl interpreter relies on semicolons to recognize the end of each program statement. This makes it possible to put several statements in the same line, but in practice it would be hard for a human programmer to interpret program lines with several statements. Most programmers only put one statement on each line to make it easier to understand their own programs.
Running a Perl Program
Now that you have seen a Perl program it is time to type the program onto your computer and run it. You should use a word processor that produces a simple text file such as Notepad on a Windows computer or Pico on a Unix system. Open your word processor and type in the program as shown in Figure 3.1. When you are finished, save the program as a file named ‘hello.pl’. Perl programs end with the extension ‘pl’. Once you have saved the program, you can run it by typing ‘perl hello.pl’ and pressing the Enter key. If you are working with the Windows operating system, you will have to open a DOS window. This window should be accessible by clicking on the ‘Start’ tab, selecting ‘Programs’ and then clicking on the ‘MS-DOS Prompt’ tab. Once you have opened this window you will need to change to the directory where you saved your hello.pl program. Use the DOS command ‘cd directory’ to change to a directory labeled ‘directory’. Type in ‘Perl hello.pl’ and you should see the message ‘Hello world!’ on your screen.
Assuming you were successful in running this program, it can be instructive to see what happens if you make a mistake. Sooner or later everyone makes a mistake in entering a program on the computer, and it is a good idea to learn how to figure out why the program did not run correctly. Try deleting the first quotation mark in the third line and then save and run the program. You should see the message
String found where operator expected at hello.pl line 4, at end of line
(Missing semicolon on previous line?)
Can’t find string terminator ‘”’anywhere before EOF at hello.pl line 4.
In this case, Perl provides an extensive warning that states there is a problem with the third line of the program. The interpreter even suggests that a semicolon might be missing in the previous line. The last line of the message states that the interpreter could not find a second quotation mark (i.e., string terminator) before the end of file (EOF). This is the main indication that the program is missing a quotation mark.
Part of leaning a programming language is learning the art of interpreting the error messages the interpreter produces when the program does not run correctly. You will need to practice correcting Perl programs to gain a better understanding of the relation between the error messages shown on the computer and the actual errors in your program. You can use the -w command line option to tell the program to print any warning messages. Type ‘perl -w hello.pl’ to run the program with this option. In addition to the other error messages, this option adds the warning
Unquoted string “n” may clash with future reserved word at hello.pl line4.
Such additional information may save you many hours of trying to fix your program.
Input
Now that we have a simple program up and running it is time to try something more elaborate. Perl reserves the expression <STDIN> for reading a line of data from the standard input which is usually the keyboard. This option allows us to write a program that requests information from a user and then display the result on the screen (the standard output or <STDOUT>). The program shown in Figure 3.2 takes advantage of this technique to request information from a user and display it on the screen.
Figure 3.2. An Interactive Perl Program
1 #!/usr/local/bin/perl
2 #name.pl
3 #An interactive Perl program
4 print "Hello, what is your name?\n";
5 $name = <STDIN>;
6 print "Hi $name, where do you come from?\n";
7 $country = <STDIN>;
8 print "I wish I could see $country some day!\n";
This program is not much more advanced than our first program, but it does a lot more. Once again the program begins with the standard Perl initial statement and a comment line that describes the program’s function. The fourth line is a simple print statement that prompts the user to enter a name. The fifth line introduces three new elements. The first of these is the scalar variable ‘$name’. A scalar variable can refer to either a number or a sequence of characters. The program uses the assignment operator = to assign the string of characters that the user types into the keyboard to the scalar variable $name. Perl uses the expression <STDIN> to obtain the information from the keyboard input. In effect, the statement in line 5 tells the program to wait for some keyboard input and assign it to the variable $name.
A warning is useful here regarding the interpretation of the assignment operator = in Perl. You should avoid thinking of this character as an ‘equal’ sign, and instead interpret it as ‘gets the value’ or ‘is assigned value’. In effect, new information flows from the right to the left in an assignment statement and this line tells the program to store the new information in the box labeled ‘$name’. Perl uses the == character (two equal signs) as an identity operator, and it is extremely easy to confuse these operators. Sooner or later you are very likely to discover that your program is not running correctly because you mistakenly used the assignment operator instead of the identity operator or vice versa.
Line 6 shows how to display the input on the screen by means of Perl’s built-in print function. Any interactive program of this sort requires a prompt to tell the user what information is required, an assignment statement to store the input, and a statement to display the output. Line 6 displays the user’s name and prompts the user for their origin. You should enter this program in the computer and save it as ‘name.pl’.
When you run the program you will find that the output is not completely satisfactory. The program displays the output up to and including the new information, and then jumps to a new line to complete the rest of the output. What happened? The input that you provide via <STDIN> includes those characters AS WELL AS a line return character that was sent when you pressed the Enter key. This is such a common problem that Perl includes a function specifically tailored to strip the line return character from the end of keyboard input. This is the chop ( ) function. You can chop the line return from the input by inserting the following line after line 5. Parentheses enclosing the argument to the function are optional.
chop ($name);
Saving the program and running it again should eliminate this problem, at least for the first input. I leave it to you to figure out how to correct the second output.
Pattern Matching
One of the main advantages of Perl over other programming languages is its built in pattern matching function. We will make use of Perl’s pattern matching abilities many times in each of the following chapters, so this is a good time to introduce the concept. A pattern match compares a group of characters (a string) to another group of characters (the pattern) to see if the string contains the pattern. This function would be useful, for example, if you wished to find the word ‘needle’ in the phrase ‘a needle in a haystack’. The program in Figure 3.3 provides an example of pattern matching.
Figure 3.3. Perl Pattern Matching
1 #!/usr/local/bin/perl
2 #match.pl
3 #Demonstrates Perl matching
4
5 print "Please type in a sentence\n";
6 $sentence = <STDIN>;
7 chop $sentence;
8 $sentence = lc($sentence);
9
10 print "Now type in a word\n";
11 $word = <STDIN>;
12 chop $word;
13 $word = lc($word);
14
15 if ( $sentence =~ m/$word/ ) {
16 print "\nThe word \"$word\" appears in the sentence \n\"$sentence\"\n";
17 } #end if word is in sentence
18
19 else {
20 print "\nThe word \"$word\" does not appear in the sentence \n\"$sentence\"\n";
21 } #end else
Once again we see the standard Perl opening and comment lines. The program prompts the user for a sentence in line 5 and assigns this sentence to the scalar variable $sentence in line 6. Line 7 removes the new line character from the string $sentence while line 8 converts all upper case characters in $sentence to lower case. Converting between upper and lower case is often necessary in text analysis to avoid treating words such as ‘Perl’ and ‘perl’ as distinct forms. Since this program is looking for a word that appears in a sentence, we want to find the word whether it appears in upper or lower case. The program simply converts both the sentence and the word to lower case.
The actual pattern matching takes place in line 15. Perl compares the string stored in the variable $word with the string stored in the variable $sentence. The pattern to be matched appears between the slashes in the matching operator m/ /. Perl requires the binding operator =~ to test if the sentence contains the word. In effect, the program compares the word against each character of the sentence moving from left to right one character at a time. If the word occurs two times in the sentence, the matching operator will pick out the first occurrence.
I embedded the pattern match in a control statement that tests whether the match is true. Control statements allow the program to branch in several directions depending on the condition set in the statement. In line 15, the if statement tests the condition of a match for the word in the sentence. When this condition is true, the program processes the statements between the following curly braces. When the condition is false, the program branches to the else statement in line 19 and processes the statements between its braces. If the program finds that our word matches one of the words in the sentence it prints the response “The word ‘$word’ appears in the sentence ‘$sentence’”. If the program does not find our word, it prints the response “The word ‘$word’ does not appear in the sentence ‘$sentence’”.
Control statements increase the flexibility of computer programs enormously, but they can also create major problems. Perl requires all of the statements that are processed if the condition holds to appear between a pair of curly braces. Omitting either one of these braces will result in a complilation error. Many programmers separate the block of control statements from the other lines in the program by indenting each control block three spaces. Another common practice is to place a comment after the final curly brace in the control block in order to identify which control statement the last brace is attached to. These programming conventions make it easier to correct syntax errors in complex control blocks.
The output statements in lines 16 and 20 are also worth noting. We were able to output the values of the scalar variables directly in our previous programs since we were using the words in the output sentences. In the context of reporting the success of our match, however, we should place the word and sentence we are comparing in quotes. We run into a problem, though, if we just put ordinary quotes around the scalar variables as in the following statement
print "The word "$word" appears in the sentence "$sentence"";
Perl reports a compilation error due to a scalar variable being found where it expected an operator. The problem occurs because Perl treats the quotes as string delimiters and does not have a way to interpret this conjunction of strings and variables. One solution to this problem is to place the escape character “\” before each of the quotation markers to signal the literal use of the quotation markers as ordinary characters, e.g.
print "The word \"$word\" appears in the sentence \"$sentence\"";
Program Debugging
If you succeeded in getting the previous programs to work on your first try congratulations. If the programs did not work for you, then you have encountered your first taste of program debugging. Sooner or later every programmer has to debug, or correct, a program. Perl interpreters usually offer advice about where the error occurred so if you can isolate the line that has the error, you can figure out how to fix it. The line count in the error report includes every line in your program, including the empty lines (as shown in Figure 3.3). You can quickly isolate the line where the computer found the error by counting down from the first line of the program. Many editing programs such HTML-Kit (www.chami.com) will display a line count automatically. The punctuation in the programs is critical, so make sure that your lines correspond exactly to the lines in the examples. One simple debugging trick is to comment out the offending line of the program. If the compiler reports an error in line 16 of the last program, for example, try placing a pound sign (#) at the beginning of the line. This converts the program line into a comment and makes it possible to see if the program will at least run without this line. If the program does run, then you can be sure that the problem occurs in this line of the program. Look it over carefully to be sure that you typed it in correctly.
Sometimes the program will compile and run, but will not produce the result you anticipated. If you comment out line 16 and run the last program you will find that it no longer reports a successful match. This happens because line 16 provides the only output that indicates a match. Another frequent debugging technique is to check the values of the variables in the program to make sure that they are behaving the way you think they should. The easiest way to check variable values is to insert a print command at an appropriate point in the program. We could insert the line
print “word is $word; sentence is $sentence\n”;
immediately after line 13. This change would display the values of the variable $word and $sentence so that we can check that they correspond to the values we think they should have.
Perl has its own debugger that can speed up the correction process. You can invoke the debugger by using the -d command line option when you run a Perl program. The -d option tells the Perl compiler to run the program under the control of the Perl debugger. I typed the command
perl -d match.pl
on my computer using Windows 98 and received the message
Default die handler restored.
Loading DB routines from perl5db.pl version 1.07
Editor support available.
Enter ‘h’ or ‘h h’ for help, or ‘perldoc perldebug’ for more help.
main::(match.pl:5): print “ Please type in a sentence\n”;
DB<1>
The debugger readies the program for execution and stops at the first executable line of the program. This happens to be line 5 in our program. The debugger displays this line together with its line number and provides the user with the prompt DB<1>. The user can then type in one of the debugger commands.
Typing the command h prompts the debugger to display its help file, one window at a time. You can then type any key to see the next window of debugger commands. The command h, followed by a command, will display help information for that particular command. We can type the command s to execute a single line of the program
DB<1> s
The debugger responds with
Please type in a sentence
main::(match.pl:6): $sentence = <STDIN>;
DB<1>
We can execute several lines in the program by using a breakpoint. The breakpoint tells the debugger where to halt execution. The breakpoint can be any line of the main program or the first line of a subprogram. The command
DB<1> b 15
sets a breakpoint at the line that matches the word to the sentence. We can then use the c command to execute all of the program statements from our current position up to just before the breakpoint. If we set the breakpoint at line 15 and then use the c command we get a blank line. This indicates the debugger has paused for us to type in the sentence we wish to search. Typing in a sentence and pressing the enter key takes the program to the prompt for a word in line 10. Once again we see a blank line which indicates the debugger is waiting for us to enter a word. Entering a word and pressing the enter key results in the output
main::(match.pl:15): if ( $sentence =~ m/$word/ ) {
You can add a line number to the c command to add an implicit breakpoint to the c command. The command
DB<1> c 15
has the same effect as setting the breakpoint at line 15 and entering the c command. Breakpoints can be deleted with the d command. The command
DB<1> d 15
deletes the breakpoint at line 15. You can delete all of the breakpoints in the program with the command D.
It is often useful to look at several lines of the program together. Typing the command
DB<1> w 15
allows you see a window of program lines that precede and follow line 15. This command produces the output
12: chop $word;
13: $word = lc($word);
14
15==>b if ( $sentence =~ m/$word/ ) {
16: print "\nThe word \"$word\" appears in the sentence \n\"$sentence\"\n";
17 } #end if word is in sentence
18
19 else {
20: print "\nThe word \"$word\" does not appear in the sentence \n\"$sentence\"\n";
21 } #end else
DB<3>
We can use the p (for ‘print’) command to see the value of a variable. The command
DB<1> p $word $sentence
displays the value of our variables $word and $sentence. It can also be useful to change the value of variables in order to test the program under different conditions. The x command allows the user to set a variable to any value. The following example shows how we can use this command to change the value of $word
DB<1> p $word
cow
DB<2> x $word = ‘horse’
0 ‘horse’
DB<3> p $word
horse
The trace is another handy debugger tool. A program trace allows us to see every line of the program that executes, along with its line number. If we set a breakpoint at line 15, set the trace on by typing the trace command t, and then use the c command, we get
main::(match.pl:5): print “Please type in a sentence\n”;
DB<1> b 15
DB<2> t
Trace = on
DB<2> c
Please type in a sentence
main::(match.pl:6): $sentence = <STDIN>;
We can turn off the trace by typing the t command again. We can terminate the debugger at any time by typing the quit command q. I have included a summary of these trace commands at the end of the chapter. Debugging computer programs is as much of an art as writing them in the first place. Once you have gained some experience with the different mistakes that you can make in creating a program, debugging will become less difficult. Be patient with yourself as you pass through the early part of the learning curve.
Summary of Perl functions
\n add a new line
\” treat the character as a literal quotation mark
# a comment line
#!/usr/local/bin/perl tells a unix server where to find the Perl compiler
<STDIN> refers to keyboard input
$string a scalar variable containing alphanumeric characters
= the assignment operator that assigns the value on the right side to the variable on the left side
== the identity operator that tests the identity of two strings
=~ the binding operator that ties a pattern to a variable
m/pattern/ the matching operator that searches a string for pattern
chop($line) function that removes new line from the end of $line
if ( condition ) {command} logical operator that processes a command if its condition is true
lc($string) function that converts strings to lower case
print “message\n”; function that prints the message
-d command-line option to start Perl debugger
-w command-line option to show warnings
Perl Debugger Commands
Command Action
s Execute one line of the program
n Execute the next line
b number Set a breakpoint at line number
c Execute up to the next breakpoint
c number Execute up to the line number
d number Delete breakpoint at line number
w number Display window of lines around line number
p expression Evaluate the expression and display the result
x expression Change the value of the expression and display the result
t Turn trace on and off
q Quit the debugger